The Hidden Bugs in Data Pipelines That No One Tests For
Data pipelines pass all tests but silently lose millions in revenue. Discover Automated Data Tests (ADT)—lightweight checks that catch join drops, sum errors, and aggregation glitches across billions of rows. Python and SQL solutions coming next.
The Hidden Bugs in Data Pipelines That No One Tests For
Data pipelines pass all tests but silently lose millions in revenue. Discover Automated Data Tests (ADT)—lightweight checks that catch join drops, sum errors, and aggregation glitches across billions of rows. Python and SQL solutions coming next.
Introduction
A few years back, I inherited a data pipeline that merged dozens of tables across dozens of steps. The tables had hundreds of columns and billions of rows. It ran flawlessly for weeks —until monthly reports showed revenue off by millions. We didn't have a code crash, nor a source outage. The data itself had silently gone wrong: a join dropped 2% of orders, an aggregation summed nulls as zero (group-by glitch), and normalization scattered keys into oblivion. What about the integration tests? They were green every time. They checked if the pipeline worked, but not if the output matched reality.
This isn't rare. Data pipelines have exploded—ETL to lakes to ML features—and complexity has outpaced testing. We unit-test transformations, integration-test flows, but rarely verify the one thing that matters: does the end match the beginning? Do intermediate steps preserve invariants like total rows, sums, or country-level aggregates? In my work, I've seen pipelines with 50+ joins where one bad condition loses 10% of the sales data. You discover it when execs yell, or worse, when AI models train on garbage.

Standard integration tests fall short here because they mock inputs and assert shapes, not semantics. Pipeline "succeeds" if it runs; data correctness hides underneath. Complex ops amplify this: joins multiply Cartesian risks, normalizations demand key integrity, groups invent sums from thin air, metrics drift on edge cases. Billions of rows mean sampling misses needles; full scans cost fortunes.
Why Data Quality Testing Matters More Than Ever
Consider this simple real-world sales pipeline, stripped to essentials (we'll code it fully in later posts):
- Pull customers from Salesforce CRM—10 million rows with customer_id, country (DE, FR, IT, ES, UK for EU).
- Pull orders from Shopify API—12 million rows with order_id, customer_id, revenue (€10-€1000), valid flags.
- Pull weekly sales goals from country_goals table—simple lookup with country, week, target_revenue.
- Join + transform: INNER JOIN orders to customers on customer_id (drops unmatched), LEFT JOIN goals on country/upper(week), filter invalids, group by country/EU for net revenue totals.
Output feeds dashboards, forecasts, compliance. Simple to follow, breaks in classic ways.
Here's the killer: CRM sync lags—5% customer_ids (600k orders, €15M revenue) exist in Shopify but missing from Salesforce. INNER JOIN silently drops them. No error. Filter passes remaining rows. Aggregation completes with shrunken EU totals. Execs see "EU sales down 5%" and freeze hiring; inventory rots; AI forecast decline. I've lived this nightmare many times—manual LEFT JOIN postmortem reveals the gap, CRM team scrambles, pipelines pause.
Industry stats confirm it. Data teams lose 30-50% time to quality fires. Monte Carlo finds 80% pipelines hide schema drifts. Gartner tallies $15M annual hits per firm from bad data. Syntax tests green; semantics vanish.
The fix: Automated Data Tests (ADT)—light, simple checks after every step. Automated in CI/CD or post-hook. Preserve invariants: totals pre/post, EU/country sums. Aggregated for scale—no row scans.
It seems straightforward—until reality hits. Where exactly do you wire these tests (Airflow DAGs? dbt hooks?)? How do you maintain the tests as schemas evolve? Most crucially, reacting to failures: in practice, flipping on ADT floods alerts with edge cases you'd ignored forever.
The key? Budgets and tiers. Set tolerances: <0.5% total drop = green (not worth spending time on this). 0.5-1% = yellow (monitor, notify Slack). >1% = red (halt pipeline, page on-call). Now errors surface predictably—you triage once, fix root (CRM sync cron), reclaim days. No more "always-broken" surprises. We'll detail budgets, tooling, reactions in coming posts.
Industry Standards: Where They Fall Short, What Works
Everyone talks data quality tools now—2026's hot. Open-source leads: Great Expectations (Python/SQL expectations, docs galore). dbt tests (warehouse-native, generic row-count/sums). Soda Core (YAML SQL checks, anomaly detection). Deequ (Spark-scale metrics). Pipeline-embedded, CI/CD friendly.
Enterprise: Monte Carlo, Anomalo (ML observability, root-cause lineage—no rules needed). Soda Cloud, Bigeye for SLAs. ETL giants like Talend, Informatica bake quality in.
| Tool | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Great Expectations | Expressive Python/SQL, profiling | Setup overhead | Pandas/Spark pipelines |
| dbt Tests | Embedded in transformations | Warehouse-only | SQL-heavy ELT |
| Soda Core | Lightweight SQL, anomalies | Less custom logic | Airflow/dbt integrations |
| Deequ | Big data scale (Spark) | Scala learning curve | Lakehouses |
| Monte Carlo | Auto-detection, lineage | SaaS cost | Enterprise observability |
These tools excel at schemas, freshness checks, and value distributions. However, they often fall short on transformation integrity—the critical need to ensure totals and group-by sums remain preserved across pipeline steps. Great Expectations offers expect_column_sum_to_be_between for this. dbt provides generic tests that assert aggregate values like row counts and sums. Many teams layer complementary approaches: dbt handles model transformations while Soda validates source data.
The best data teams follow clear practices. They shift testing left into CI/CD pipelines for early detection. They prioritize aggregates over exhaustive row-by-row checks for scale. They build in tolerances for floating-point calculations. They layer observability tools that catch anomalies after deployment. Even so, surveys show 60% of teams still resort to manual SQL queries for debugging.
Automated Data Tests fill this gap perfectly. You can embed them in raw Pandas scripts, dbt macros, or Spark jobs. They require zero external tooling to start. Their simplicity makes them instantly actionable across any stack.
What's Next
This post lays the foundation, but the real power comes when we build it. In the next post, I'll walk through Python implementations using Pandas—dead simple validators that run on our sales pipeline example, plus a survey of tools like Great Expectations and Pandera.
The third post tackles SQL and dbt for production scale, with concrete queries and test patterns that handle billions of rows.
Start experimenting with Automated Data Tests today. Your pipelines—and your sleep—will improve immediately.
What pipeline failures have you debugged the hard way? Share in the comments.
Written by Egor Burlakov
Engineering and Science Leader with experience building scalable data infrastructure, data pipelines and science applications. Sharing insights about data tools, architecture patterns, and best practices.