Why You Should Test Your Data Assets and Pipelines, Not Just Your Software UIs and APIs

__________

“We’re testing our software . . . a lot. Why should we test our data? 
Aren’t we doing enough already?”

“Why is one dashboard showing us that we had 80 orders yesterday, 
but the other one says we had 220?”

Two questions asked by the same VP on different days

____________

Ultranauts - Fatal Error

Software testing (whether it’s manual or automated) focuses on making sure the form and function of our user interfaces, APIs, and backend programs produces the behavior and output we’ve specified. Developers set up unit tests as checkpoints to evaluate the performance of the smallest foundational components. Engineers focusing on quality assurance implement integration tests, regression tests, and continuous integration pipelines to ensure that no new software is deployed to general use that doesn’t pass critical tests. 

So how is it possible that with all those tests, our reports and dashboards might still be full of numbers and values of dubious origin? 

Here’s why. Data moves and flows and changes. We translate raw data into new structures, and transform it into new information and knowledge. There are a multitude of steps, and rarely do organizations have full transparency regarding what all those steps are. All pipelines look something like this:

Input → Processing Tasks → Output

Inside the processing step, there might be tens or hundreds of tasks. Even if you’re responsible for some of those tasks, it’s likely that you have limited visibility into what happens before and after the scope of your work. And if you’re not sure whether the inputs you’re getting meet your expectations, you won’t be able to guarantee the integrity of the results you produce and share with others.

And there’s more… because context and interpretation matters. When different people use data and information in different ways to generate insights and make decisions, they might not have the same expectations for what makes that data “good”. We can’t create specifications for exactly what to test, so we have to fall back on checking for general characteristics, like whether an output matches a word or phrase we’re expecting, or whether a number falls in a range that is probably fine. And we can’t rely solely on testing methods that verify a data asset is the same at one earlier point and a later point, like when you lift and shift a database to the cloud. 

But what we can do is strategically insert checkpoints on our data assets, as they evolve from raw data to clean data to processed or prepared business objects, and also on our pipelines that are the workhorses of our data ecosystem. By catching issues that emerge as far upstream as possible, we protect all the people who use data and information as inputs later, and make it 

Fortunately, adding tests to your data assets and pipelines is easier now than it ever has been in the past. My favorite testing framework for this job is Great Expectations, a Python-based framework whose product roadmap includes a low code/no code cloud platform to make it easier for data stewards to become data quality experts. Engineers love how it feels to work with it, and it fits nicely into data management ecosystems that include common products like Snowflake and dbt. In fact, the ease of use from the engineering side means that the hardest part of building tests for your data assets and pipelines is choosing the highest value spots to place tests. This is something that Ultranauts teams have been doing for clients for the past couple of years.

If you feel like the VP we quoted at the top of this post, you probably want to take some steps to assure yourself that your development teams aren’t accidentally missing things that data testing could instantly reveal. And in addition, by instituting a program to strategically test your data assets and pipelines, you’ll be taking a strategic step towards data-centric AI, as recommended by industry leader Andrew Ng (Radziwill 2022).

Ultranauts helps companies establish and continually improve data quality through efficient, effective data governance frameworks and other aspects of data quality management systems (DQMS), especially high impact data value audits. If you need to improve data understanding at your organization, or figure out which data assets and pipelines to test to generate the greatest business value, get in touch with us. Ultranauts can quickly help you identify opportunities for improvement that will drive value, reduce costs, and increase revenue.

Additional Reading:

Radziwill, N. M. (2022, July 28).The Key to Unleashing The Value of AI Systems? It’s Not More Data. Ultranauts Blog. Available at https://info.ultranauts.co/blog/the-key-to-unleashing-the-value-of-ai-systems