How can I make sure that data actually supports the business objectives of our customers?
As chief technology officer, I dedicate a lot of time to thinking about how to solve data health problems such as this. Data health involves tools, of course, but also processes and people. It concerns every employee who has contact with data, therefore the approach to data health must be pervasive.
Data quality is essential to data health. Traditionally, data originates from human entries or the integration of third-party data, both of which are prone to errors and extremely difficult to control. In addition, the data that works beautifully for its intended applications can give rise to objective quality problems when extracted for another use – typically analytics. Outside of its intended environment, the data is removed from the business logic that puts it into context, and from the habits, workarounds and best practices of regular users, which often go undocumented. So even when a data format or content is not objectively a quality issue within its original silo, it will almost certainly become one when extracted and combined with others for an integration or an analytics project.
Academic research describes up to 10 data quality dimensions, but, in practice, there are five that are critical to most users: completeness, timeliness, accuracy, consistency and accessibility. Each of these dimensions correspond to a challenge for an analytics group: if the data doesn’t provide a clear and accurate picture of reality, it will lead to poor decisions, missed opportunities, increased costs or compliance risks. This makes measuring data quality a complex, multidimensional problem.
Data quality assessment must be a continuous process, as more data flows into the organisation all the time. Traditionally data quality assessment has been done on top of the applications, databases, data lakes or data warehouses where data lives. Many data quality products must collect data in their own system before they can run the assessment like an audit, as part of a data governance workflow.
A more modern approach is pervasive data quality, integrated directly into the data supply chain. The more upstream the assessment is made, the earlier risks are identified, and the less costly the remediation will be.
The assessment of data quality typically starts by observing the data and computing the relevant data quality metrics. But companies should also be looking at quality metrics that can be aggregated across dimensions, such as the Talend Trust Score. Static or dynamic reports, dashboards and drill-down explorations that focus on data quality issues and how to resolve them (not to be confused with business intelligence) provide perspective on overall data quality. For more fine-grained insight, issues will be tagged or highlighted with various visualisation techniques. And good data quality software will add workflow techniques, such as notifications or triggers, for timely remediation of data quality issues as they arise.
Reacting to problems after they happen remains very costly and companies who are reactive, instead of proactive, regarding data issues will continue to suffer from questionable decisions and missed opportunities.
Too often data quality is viewed only through the lens of an assessment, as a sort of necessary evil similar to a security or financial audit. But the value truly lies in continuous improvement. Data quality should be a cycle: the assessment runs regularly – or even better, continuously – automation is refined all the time, and new actions are taken at the source, before bad data enters the system.
As with any governance process, data quality improvement is a balance between tools, processes and people. Putting humans in the loop – people who are experts on the data but not experts on data quality – requires a highly specialised workflow and user experience that few products are able to provide. Talend is leading the way here, with tools including the Trust Score formula, Data Inventory and Data Stewardship to enable the collaborative curation of data with human-generated metadata, such as ratings and tagging.
As in medicine, we may never have a perfect picture of all the factors that affect our data health. But by establishing a culture of continuous improvement, backed by people equipped with the best tools and software available for data quality, we can protect ourselves from the biggest and most common risks. And if we embed quality functionality into the data lifecycle before it enters the pipeline, we can make data health a way of life.
Krishna Tammana is chief technology officer at Talend
This article was originally published in the Winter 21/22 issue of Technology Record. To get future issues delivered directly to your inbox, sign up for a free subscription.