Why do we need a Data Quality Checker?
“Garbage in, garbage out” is a popular adage in machine learning. At Elula we process up to 5 billion data points monthly from each customer, and we need to identify issues within minutes. To achieve the data quality required to scale and automate our AI software services, we needed a fast-running tool to automate data checks and detect anomalies, ensuring no garbage gets in. Poorly monitored production models cause many problems and impact customer outcomes.
Many of the tools currently available on the market either could not—and still cannot—handle big data, are designed for manual fixes, or are too generalised to handle our complex situation. We needed to develop our own software, which we call the Data Quality Checker (DQC).
DQC automates data checks and requires no human intervention; software continues to run, customers access their retention leads quickly, and hundreds of monthly person-hours are saved. And if critical errors are found, much like with Toyota’s stop the line manufacturing practice, DQC allows us to quickly halt production, intervene and remediate —saving costs and time.
What it does
In designing the DQC, we considered all that could go wrong with regular data updates. Obvious mistakes, such as missing columns or files (“Where’s all the accounting data?!”), are easy to identify. Errors like all-zero or all-null columns are also easily detectable; one of our major focuses is retail finance, so a typical example might be, “Are we sure Australia had no new mortgages in November?”
But there are nuanced issues, which is where the power of the DQC comes in, such as column averages dramatically decreasing (e.g., a call centre expects 50 daily inbound calls but that suddenly drops to 5; the values are valid, but something has significantly changed). This may relate to a legitimate underlying business change but we’d still want to confirm with customers.
Another common, but subtle, issue is backdating. Transaction dates are sometimes overwritten with new dates if the transaction dates are out of sync with when the transactions are written to databases. Understanding the details of this backdating is important in assessing data quality.
How it works
DQC is designed in two parts: metrics calculations and threshold checks. DQC first calculates statistics for every column (e.g., min, max, average, percentages of zeros/nulls, distribution measures (e.g., standard deviation)), and categorical proportions (e.g., the percentage of investor or owner-occupied loans). Each of these metrics can also be examined over various time periods (e.g., daily, weekly, monthly). This enables inspection, for example, of the average loan balance across all loans, for each month, over a year. The DQC can then determine whether values are consistent with historically observed information. For example, if average mortgage balances historically vary between $100k and $500k, an average mortgage balance of $300k in a new dataset would not be flagged as an issue—but an average of $800k would be. If values exceed thresholds, checks fail and users are notified (“DQC found that average customer age went from 32 in April to 48 in May—a 50% increase”; as a result, the business owner and data engineers can investigate).
As we started testing DQC, we found that checks were flagging too many false positives, yielding unnecessary interventions. Our solution was threefold. First, we upgraded the DQC review UI, adding severity levels to assist triage. Second, we built smarter checks and thresholds that adjust dynamically based on recent distributions, minima/maxima, and month-to-month changes (e.g., it can be unclear whether a “10% decrease in complaints” is reasonable, but by checking that similar swings have appeared before, and that the data are still in line with recent distributions and trends, we can answer this question with minimal human judgement). Third and finally, our next DQC release will incorporate machine learning forecasting of expected values and thresholds; these dynamic checks account for seasonality of complex trends that make data validation so hard for humans.
Unfortunately, some data quality issues are still ambiguous, particularly those requiring domain knowledge. For example, if all customers receive interest rate discounts the average value for “interest rate” will suddenly drop. In that case, there is no real error, but it might be flagged as a quality issue. Ensuring the balance of highlighting potential issues while not highlighting every change is an ongoing challenge we continually pursue.
In our big data environment, DQC speed is critical
With the need to check billions of data points within minutes, DQC needed speed. This was no simple engineering task and required much behind-the-scenes work by our platform team. PySpark was chosen so we could take advantage of its lazy execution model, allowing us to defer CPU actions, thereby extending optimisation opportunities and improving cost and runtime. Running on the serverless AWS Glue provides us the flexibility and scalability to adjust provisioned infrastructure rapidly. All these decisions enabled us to complete the checks that we needed within the timeframe required.
Ensuring the ongoing performance of the machine learning products we provide to our customers will always rely on the quality of the data we put in. Having these automated tools in place gives us confidence in the data, and allows us to spend our time efficiently.