Data Profiling

What is data profiling?
Definition: Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data. (Source: Wikipedia)

With data profiling you look at the source data and use those aspects of the data for testing purposes. As a BI tester or a tester in a data migration project you will have as your testbase a source-to-target mapping (STM) and you will have access to the source databases, files, etc. With the use of data profiling you can start testing and building up your testcases before the datawarehouse (DWH) or datamigration is done.

The STM will hold information about which source field will go to what target field in the new DWH and whether any transformations are done or if it's a direct mapping. It will also hold information about the target field type, length, null values, etc.. Data profiling in this case is a form of static testing where a (BI) tester can assess the quality of the STM by checking if the source data matches with what is expected in the STM.

Benefits of data profiling

There are three major benefits a tester can have from data profiling:

  1. Find bugs early in the design, helping to improve the data model (shift-left)
  2. Improve your own testcases by finding special cases
  3. Assess the data quality

These can be made clear with an example.

Practical example
data profiling
Figure 1: Source–to–target mapping example

Figure 1 contains an example STM. Data profiling is "looking at the source data and collecting statistics", these statistics can be such things as maximum length, missing values, character sets, uniqueness, etc. From this example a BI tester can gather the following points of interest:

  • First name: Check the character sets used in the source (ASCII, Cyrillic, Greek, Arabic, etc.) and ask the designer if they were taken into account for the new DWH/migration. Also use this information to create testcases to test the transformations.
  • Last name: Check for missing values as the target column will not accept null values. Also check the character sets.
  • Gender: Do a value distribution on the source to see if there are truly no other values than Male, Female and Unknown since the transformation will break if it encounters any other values. Also check for missing values.
  • ID: Check for maximum value length as the target field is smaller than the source. Check also for uniqueness as this field will become a primary key and, meaning it needs to be unique. Also check for missing values and for pattern type to see if there are only numbers used in the source as the target data type is Numeric.
  • Email: Check the value patterns.

After doing data profiling you might have the following findings: in the source the last name is sometimes missing, the gender field also has values "Man" and "Woman" and the customer ID sometimes ends with a letter. This is important information for the data modeler so these issues can be fixed early in the project instead of finding out the ETL fails after it's all built!

For Email address there is a direct mapping so in theory all data should be good, but with the value pattern check you could find that there are addresses that are missing the ‘@’ or have written out ‘[at]’ instead or maybe have space in them. This means they cannot be used for automatic email distribution as they are not valid email addresses. This lack of quality should be addressed with the business or product owner. See also building block: Data Quality.

Tips and advanced data profiling

A few items have already been mentioned in the previous part and there is much more to be found online, but some of the more common things to look for are:

  • Data type
  • Value length (maximum, minimum)
  • Missing values
  • Characters sets used
  • Uniqueness
  • Value distribution
  • Decimals
  • Datetime notation
  • Value patterns (incl. outliers)
  • Currencies

Data profiling can be done with SQL, but the use of (data cleaning) tools is more common. This building block only discusses data profiling on a single column at a time. Another approach is to also look at relationships between columns in a table or between columns from different tables. For a (BI) tester however this sort of advanced data profiling is not needed. Understanding the possibilities and being able to do single column data profiling will already provide a lot of insight into the data and data quality.