Data custodians should ensure, as far as practicable, the accuracy, currency and completeness of data supplied to an integrating authority. The need to ensure data quality aligns with Principle 1 – Strategic Resource of the High Level Principles for Data Integration involving Commonwealth Data for Statistical and Research Purposes, which specifies the importance of data custodians following good data management practices and maintaining the quality attributes of data.
For personal information, quality assurance is also consistent with Australian Privacy Principle 10 of the Privacy Act 1988 which requires an APP entity (in this case the Commonwealth data custodian) to ensure that the personal information it collects is accurate, up to data and complete and that reasonable steps are taken to ensure that the personal information it uses or discloses is also accurate, up to data, complete and relevant, having regard to the purpose of the use or disclosure.
Equivalent principles for business data should also be considered.
Providing data quality statements
Data quality statements are one tool available to Commonwealth data custodians to clearly communicate key characteristics of the data which impact on quality. The purpose of quality statements is to help integrating authorities and data users make informed decisions about fitness for use for their particular purposes. Quality statements should report both the strengths and limitations of the data.
The Australian Bureau of Statistics (ABS) has developed a data quality framework to assist in the preparation of data quality statements, based on the Statistics Canada Quality Assurance Framework and the European Statistics Code of Practice. The ABS Data Quality Statement Checklist is a tool available to assist in the preparation of data quality statements, using the ABS Data Quality Framework. The ABS data quality framework is outlined below as an example of the issues that should be addressed in a data quality statement. However, alternative frameworks for assessing and documenting data quality may also be appropriate.
The ABS data quality framework is based on seven dimensions of quality: institutional environment, relevance, timeliness, accuracy, coherence, interpretability and accessibility. All seven dimensions should be included for the purpose of quality assessment and reporting. However, these dimensions are not necessarily equally weighted, as the importance of each dimension may vary depending on the data source and its context. Judgement should be used in making assessments of data quality, ensuring the quality dimensions are evaluated appropriately for the particular context, for example, for the purpose of linking with another dataset for a particular project.
This refers to the institutional and organisational factors which may influence the effectiveness and credibility of the agency producing the statistics. In the context of administrative data collections, usually all that will be needed is the name of the organisation collecting the data. The authority for collecting the data (e.g. relevant legislation) may also be included.
This refers to how well the dataset meets the needs of users in terms of the concepts measured and the populations represented, and will usually be covered in the metadata as well. In the context of administrative data collections, it might include information on:
- scope and coverage – for example, identify the target population and document who the data represent, who is excluded and whether there are any impacts caused by exclusion of particular people, areas or groups;
- the reference period (i.e. the time period for which the information is collected); and
- geographic detail, both about the level of detail available for the data (e.g. postcode area) and the actual geographic regions for which data is available (e.g. the whole of Australia, or selected states and territories).
This refers to the delay between the reference period and the date at which the data become available. The main things to consider here are the frequency of the collection (e.g. ongoing or one-off), and whether there are likely to be updates or revisions to the data.
Adding data over time may impose additional difficulty on a data integration project. Policies need to be established to account for changes or updates in the data for a given time period. It is common for agencies providing administrative data to add, delete, modify or update their records over time. For example, an agency may update its records to account for new information such as a change in address or family name. As a consequence, data received in one time period may not be the complete dataset for that period. Policies should impose time cut-offs on data that arrive late, to ensure it does not affect the integration exercise, and also does not result in major adjustments to the outputs over time.
Issues arise when the definitions of reference periods vary between data sources. It is important that data received from the different data providers refer to the same time period.
Ensuring the same reference periods for data are obtained from multiple data providers is easier said than done. Agencies use different dates to refer to different parts of the process they use to gather records. For example, there could be a date lodged for when a record is received by the agency, a processing date for when a record is entered into a computer system and another date for when a record is registered or accepted by the agency and perhaps another for what time period the record actually refers to. These kinds of issues mean that it is essential that data custodians provide clear metadata.
In the context of administrative data collections, the accuracy will largely reflect the quality of the data submitted by the individual or institution. Data custodians should describe any steps taken to improve the quality of the data (for example, comparison with other data sources, checks for duplicates, imputation of missing/non-response information) and any issues that might impact on the data such as areas of the population which are unaccounted for in data collection.
Accuracy will also be ensured by the establishment of governance protocols between data custodians, integrating authorities and data users. These protocols will investigate and resolve data quality issues not previously identified or which arise from the creation of the new integrated datasets. (There are a number of other factors to consider in relation to sample surveys, such as sampling error, which are covered in the ABS’s data quality framework.)
Coherence reflects the extent of comparability of data sources from either a single statistical program, or data brought together across multiple data sets or statistical programs. Fully coherent data are logically consistent – internally, over time and across products, programs and with the real world (externally). Achieving coherence is dependent on three key elements; use of statistical standards (concepts, classifications, methods); the use of common shared statistical infrastructure and processes (to mitigate operational and technological factors which can impact on coherence); and alignment with the real world (statistics about the same or related events tell a consistent story, presenting a clear, accurate and coherent picture of the economy, society and the environment.)
Information needs to be available that will help people to understand the data. Data custodians may choose to just provide a brief reference to published metadata and a link to it, to address this dimension of quality.
Data custodians should document any procedures that people need to follow in order to gain access to the data. Information about costs is also relevant here. Providing broad access to data for research purposes is consistent with Principle 1 of the High Level Principles for Data Integration involving Commonwealth Data for Statistical and Research Purposes - Strategic Resource.
Examples of data quality statements
Two examples of data quality statements for statistics based on administrative collections are:
- the statement for the South Australian Cancer Registry; and
- the quality declaration for Overseas Arrivals and Departures (statistics compiled from administrative data provided by the Australian Government Department of Immigration and Border Protection).
Some examples of data quality statements from the Australian Bureau of Statistics (relating to census and sample survey data) are below: