Alertee LogoAlertee
Back to Blog
monitoringengineering

The data quality failure your users notice before you do

The import stops writing, the dashboards keep loading yesterday's numbers, and nothing fires — because only a check run from outside the pipeline can see that the data simply didn't arrive.

ByNiallNiall

Our Dagster instance broke while our founder was on holiday, and nobody caught it until the next day. The app stayed up. The dashboards kept loading. They just loaded yesterday's numbers, and yesterday's, because the import that should have refreshed them had stopped writing rows. A user noticed before we did. Everything we'd built to catch bad data was sitting inside Dagster, so when Dagster went quiet, the checks went quiet with it.

This is the failure most people have in mind when they decide to monitor data quality, and most of the advice points at the wrong place. The usual answer is two things: constraints on your tables, and tests in your pipeline. Both are worth having. Neither would have caught our outage, and the reason is structural, not a matter of writing better tests.

Checks fall into three groups by where they run: as a row is written, as the pipeline transforms it, and continuously against the table the data lands in. The first two live inside the system that produces your data. The third runs outside it, on its own clock. The failures that actually reach users — a stale import, a missing nightly load, a region that dropped off — produce no bad row to reject and trip no transform to fail. The data is simply absent, and only a check that reasons about what should be present, run from outside the pipeline, ever sees it.

This is about the database your application writes to: production Postgres or ClickHouse. For warehouse platforms like Snowflake or BigQuery, the data quality tools comparison goes vendor by vendor. For the same split walked through a payments incident moment by moment, see the data quality monitoring pillar.

Layer 1: write-time constraints

The cheapest layer is the one already installed: your database. A NOT NULL, a foreign key, a CHECK, a unique index — these reject a malformed row at the moment it's inserted, cost nothing to run, and never drift out of date.

-- A refund can't exceed the original charge
ALTER TABLE refunds
  ADD CONSTRAINT refund_within_charge
  CHECK (amount <= original_amount);

-- Every order belongs to a real customer
ALTER TABLE orders
  ADD CONSTRAINT orders_customer_fk
  FOREIGN KEY (customer_id) REFERENCES customers (id);

-- Status has to be one of the values the app knows about
ALTER TABLE orders
  ADD CONSTRAINT orders_status_valid
  CHECK (status IN ('placed', 'shipped', 'completed', 'returned'));

Add these first, whatever else you do. One practical note: adding a constraint to a table that already holds violating rows fails — which is itself a useful discovery. Postgres lets you add a CHECK or foreign key as NOT VALID so it's enforced for new writes immediately, then VALIDATE CONSTRAINT once you've cleaned up the history. ClickHouse is deliberately weaker here: it accepts CONSTRAINT ... CHECK on insert but enforces neither foreign keys nor uniqueness, so more of the checking moves downstream.

The limit is the whole point of this layer. A write-time constraint can only judge a row that arrives. The import that wrote zero rows violated nothing, because there was no row to violate anything. Absence is invisible at write time, by construction.

Layer 2: transform-time tests

The second layer asserts your assumptions as data moves through the pipeline. If your data flows through dbt, its tests are the cheapest coverage you'll add this year. The generic ones — unique, not_null, accepted_values, relationships — are a few lines of YAML next to the model. Singular tests are plain SQL files: a select that returns the rows that disprove your assertion, and the test fails if it returns any. Great Expectations is the heavyweight open-source version of the same idea — a Python framework with a large library of assertions about null rates, value ranges, and distributions — at the cost of hosting and scheduling it yourself.

This layer catches a real class of bug: a transform that starts producing wrong values, a source whose shape changed, a join that began dropping rows. Catch it in CI before it ships and you've earned the test cost back many times over.

But look at when it runs. A dbt test runs when dbt runs. A source can break a minute after the morning run, and nothing looks again until tomorrow's. Worse for our case: these tests live inside the orchestrator. When Dagster died, the tests died with it — there was no run, so there was no failing run, so nothing fired. And dbt only ever sees tables it touches; the table your backend writes to directly was never in the DAG to begin with. A test that runs inside the pipeline cannot report that the pipeline didn't run.

That's the structural gap. Layers 1 and 2 both sit inside the system that produces your data. When that system stops, they stop, silently, and the failure that reaches your users is precisely the one neither layer is positioned to see — not a bad row, but a missing one.

Layer 3: continuous checks against the landed table

The third layer is a query that runs against the production table on a schedule of its own, independent of any pipeline. It doesn't ask "is this row correct?" It asks: did the data that should be here actually arrive — fresh, and complete? That question has an answer even when the import never ran, because the check reasons about expected presence, not about rows in front of it.

The honest incumbent here is a cron job: a script runs a query, compares the result to a threshold, posts to Slack. It's free and it's entirely yours, and for the first handful of checks it's the right call. What erodes is everything around the query — checks scatter across repos until nobody can list what's monitored, there's no history, and alerts land in a channel that belongs to everyone and so to no one. We build Alertee for exactly this layer, so that's where our interest lies — but every check below runs equally well from psql or a cron line, and the layer matters more than the tool.

Three checks, each tied to a failure we've actually hit.

Stale import: the data stopped updating

This is our holiday outage. The import didn't error; it just stopped writing. The table was still there, still queryable, still full of last week's rows. No constraint cares — the rows it holds are all valid. No transform cares — nothing transformed them, because nothing ran. The only signal is the timestamp on the freshest row, asked from outside the import job:

-- How old is the newest row in the table the import feeds?
SELECT MAX(updated_at) AS freshest_row
FROM listings;
-- Alert if freshest_row < NOW() - INTERVAL '25 hours'

The threshold is one expected cadence plus a margin. A nightly import gets 25 hours, not 24, so a run that starts a little late doesn't trip it. The check needs no knowledge of why the data went stale — broken auth upstream, a dead scheduler, a paused job — only that the freshest row is older than it should be. Run it hourly, and a stale import surfaces the morning it happens instead of after a user emails.

Missing import: today's load wrote nothing

A freshness check catches a table that fell behind. But some loads are append-only batches where the timestamp logic is fiddlier, and what you really want to assert is simpler: the batch that runs every day actually landed today. The failure here is a job that finished green — exit code zero, success ping, dashboard cheerful — while writing zero rows, because an upstream API quietly returned an empty page.

-- Did today's import write any rows at all?
SELECT COUNT(*) AS rows_today
FROM daily_orders_import
WHERE loaded_at >= CURRENT_DATE;
-- Alert if rows_today = 0 after 07:00

The after 07:00 matters: before the import's scheduled window, zero rows is correct, not an incident. This is the failure write-time and transform-time checks are most blind to. There is nothing malformed to reject, no transform to fail — there is just a count that should be in the thousands and is zero. Only a query that knows a number should be there catches it.

Dropped region: a slice that's always present went silent

The subtlest one. The import ran. It wrote plenty of rows. Totals look normal. But after a deploy, one region stopped reporting — its events route through a config that broke — and the rows for every other region papered over the gap. A row count against the whole table passes. A region-by-region check doesn't:

-- Which regions that normally report had no activity today?
WITH expected AS (
  -- regions that appeared on every one of the last 14 days
  SELECT region
  FROM orders
  WHERE created_at >= CURRENT_DATE - INTERVAL '14 days'
    AND created_at <  CURRENT_DATE
  GROUP BY region
  HAVING COUNT(DISTINCT created_at::date) = 14
),
today AS (
  SELECT DISTINCT region
  FROM orders
  WHERE created_at >= CURRENT_DATE
)
SELECT e.region
FROM expected e
LEFT JOIN today t ON t.region = e.region
WHERE t.region IS NULL;
-- Alert if any rows come back

The trick is deriving "what should be present" from the data's own recent history rather than hard-coding a region list that goes stale the day you launch a new one. A region that showed up every day for two weeks and is absent today is the kind of fact you'd never write a constraint for — there's no rule that region X must exist, only an observed regularity that just broke. The same shape works for a tenant that went quiet or an integration that stopped syncing; Alertee runs these as one check that reports per region, so a single alert tells you which slice dropped, not just that a total moved.

Placing the tools you already use

The useful exercise is to take every check you run today and drop it into one of the three layers, then ask which failure class is left uncovered.

LayerToolsCatchesStructurally can't see
Write-time constraintsNOT NULL, foreign keys, CHECK, unique indexes; Zod / Pydantic at the app boundaryMalformed, contradictory, or duplicate rows, at the moment of insertA row that never arrives; data going stale; a slice dropping out
Transform-time testsdbt tests, Great ExpectationsTransform bugs and broken source assumptions, each time the pipeline runsAnything between runs; tables the pipeline doesn't touch; the orchestrator itself dying
Continuous production checksCron + SQL, AlerteeImports that didn't run, stale tables, missing batches, regions and tenants gone quietOnly what someone wrote a check for

If your data quality story is constraints plus dbt tests, you have the first two layers and a real hole under them. Both run inside the pipeline that produces your data, so both go dark in exactly the failure that matters most: the one where the pipeline stops. The row that never arrived isn't there to be rejected or transformed. The third layer is the only one that compares the data that should be there against what landed — and it's the one most teams discover they're missing the morning a user finds the gap first.

If you can finish the sentence "alert me when ___" — the import didn't write rows, last night's data is stale, a region went quiet — those are layer-three checks, and they're plain SQL you can run anywhere. A cron job is the honest floor. When the checks outgrow what one person can hold in their head, connecting a Postgres or ClickHouse database and turning one check on is what we built Alertee for: the same SQL, scheduled, owned, and kept out of the noise. If you need lineage or learned anomaly models across a warehouse, that's a different aisle — the data quality tools comparison weighs those.