A few years ago, our data platform team aimed to pinpoint the primary concerns of our data users. We conducted a survey among individuals interacting with our data platform, and unsurprisingly, the main concern highlighted was data quality.
The initial response, characteristic of our engineering mindset, was to develop data quality tooling. We introduced an internal tool named Contessa. Despite being somewhat cumbersome and necessitating significant manual configuration, Contessa facilitated checks for standard dimensions of data quality, encompassing consistency, timeliness, validity, uniqueness, accuracy and completeness. After running the tool for a couple of months with hundreds of data quality checks we concluded that:
- Data quality checks occasionally assisted data users in discovering, in a shorter timeframe, that the data was compromised and could not be relied upon.
- Despite the frequent execution of data quality checks, there was no noticeable improvement in the subjective perception of data quality.
- For a significant portion of issues, particularly those identified through automated data quality checks such as consistency or validity, no corrective actions were ever taken.
Survey and objective measurement are useful tools, but nothing can replace a discussion over coffee and cake, as Jane Carruthers writes in her book, “The Chief Data Officer’s Playbook”. Indeed, I recommend this to anybody, as one-on-one conversations helped us discover another important angle of the situation. Some of these conversations unfolded as follows:
“Hey, you say, that data quality is poor, what do you mean by that?”
#1 Pricing business analyst: “We are working on setting up price for the ancillary product X. In the dataset we use, we are missing data on what was the actual revenue from the product X per each order. We have this dataset , but it contains only expected value of the revenue from X at time of the purchase. We can see also the actual revenue per product, but not at the order granularity.”