Data Quality

What is quality?

A big part of testing in a data driven project such as for BI testing is testing the quality of the data. The data quality needs to measured at several steps in the data landscape; at the source, but also after the migration/transformation (ETL-process). See building block: BI Testing.

Quality can be defined as: Quality is the totality of features and characteristics of a product or service that bear on its ability to satisfy stated or implied needs. (Source: TMAP Quality for DevOps teams)

The TMAP method described several quality characteristics to help measure the quality of software products, but these are not well suited for data driven environments. For testing specifically data quality the UK chapter of the DAMA (Data Management Association) wrote a paper describing new measures for data quality. (Source: The Six Primary Dimensions for Data Quality. DAMA UK, October 2013).

Six primary dimensions of data quality These are the six primary dimensions of data quality:

data  quality dimensions
Figure 1 : Six primary dimensions of data quality
  • Completeness: is all data from the source present in the target?
  • Uniqueness: are there double records?
  • Timeliness: are changes in the source reflected on time in the dashboard?
  • Validity: is the data valid (does it adhere to the business rules e.g. no insurance customers younger than 18 years)?
  • Accuracy: is the data accurate (e.g. no customers with impossible age: 121)?
  • Consistency: is the data consistent across sources?

These dimensions are primarily designed to be used to check the quality of the data at the target point, but most of them can already be used to look at the quality of the source data (see also building block: Data Profiling), the only exception being 'timeliness'. How to calculate these dimensions is further explained in the next section.

How to use and measure the quality dimensions

The listed data quality dimensions can be used as KPI's for the quality of the data, but the data quality assessment needs to be done in a methodical fashion. Not all dimensions have to weighed equally. Some dimensions might be more important in the project than others and maybe some data items need a different weighting from other data items. It’s also important to define the requirements for acceptance. In total it is important to go through the following steps (as advised by DAMA UK):

  1. Identify which data items need to be assessed for data quality, typically this will be data items deemed as critical to business operations.
  2. Assess which data quality dimensions to use and their associated weighting.
  3. For each data quality dimension, define values or ranges representing good and bad quality data.
  4. Apply the assessment criteria to the data items (calculate the score).
  5. Review the results and determine if data quality is acceptable or not.
  6. Where appropriate take corrective actions e.g. clean the data and improve data handling processes to prevent future recurrences.
  7. Repeat the above on a periodic basis to monitor trends in Data Quality.

In step 4 you will need to calculate the score on the data dimension:

data quality calculation
Figure 2: Definition and calculation of the quality dimensions

* Note that because real-world data is not always available, the accuracy cannot always be tested. Examples to check for are: impossible birthdates, e-mail addresses that don't contain the @ sign or that have spaces in them and non-existent combinations of postal code and city (postcode/huisnummer check).

Further considerations for data quality

Next to the primary data dimensions there are other data dimensions that might be taken into consideration. This list is in no way complete, but it gives an idea on what else to consider:

  • Usability: is the data understandable and relevant?
  • Security: is the data auditable and are privacy regulations maintained (e.g. "right to be forgotten")?
  • Value: Is there a good cost/benefit case for the data in terms of capacity (and therefor money)?
  • Sustainability: Is there a good cost/benefit case for the data in terms of power usage?