Data quality monitoring: catching the failures your checks ran too early to see

A Friday deploy changed how a webhook handler parsed Stripe's payload. The handler kept returning 200 — Stripe was satisfied, so no retries fired — and the insert into payments quietly stopped. Six hours of successful charges were never recorded. Support found out when a customer asked why their receipt hadn't arrived.

The team had checks. Constraints on the tables, dbt tests on the models, schema assertions in the pipeline. None of them fired, and most of them couldn't have. Every one of them does the same job: it checks that the rows which arrive are correct. This failure wasn't an incorrect row. It was a row that should have existed and didn't — and a check that waits for a row to inspect has nothing to inspect when the row never comes.

That gap is the whole problem. A check that validates a value asks is this row right? The question nobody was asking was did the row that should be here show up at all? — and that question only gets answered if something runs continuously against production, on its own schedule, independent of whatever pipeline was supposed to write the data. This post walks the checks you already run, shows why each one was blind to the payments gap, and ends on the query that would have caught it in about twenty minutes.

It helps to place every check at one of three moments: when a row is written, when a pipeline transforms it, or continuously while the data sits in production. The moment decides what a check can see. The payments incident was invisible at the first two moments — that's why it ran for six hours, and why the third moment is the one that catches it.

This post covers the database your application writes to: production Postgres or ClickHouse. If you're weighing warehouse platforms like Monte Carlo, Metaplane, or Bigeye for Snowflake or BigQuery, the data quality tools comparison goes vendor by vendor instead.

Checks that run when a row is written

A check that judges a row at the moment it's inserted can only judge rows that show up. This is the cheapest coverage to add, and the broken webhook handler is exactly the failure it can't see, because it produced no row to judge.

Start at the application boundary. Zod and Pydantic validate a request before it becomes a row — wrong shape, missing field, bad type. They're cheap and you should use them. But once a request passes and the row is written, they're done. They can't notice that requests stopped arriving.

The most underused check is the one already installed: your database. Postgres constraints reject contradictory rows at insert, cost nothing to run, and never drift out of date:

-- A refund can't exceed the original charge
ALTER TABLE refunds
  ADD CONSTRAINT refund_within_charge
  CHECK (amount <= original_amount);

-- Every subscription belongs to a real customer
ALTER TABLE subscriptions
  ADD CONSTRAINT subscriptions_customer_fk
  FOREIGN KEY (customer_id) REFERENCES customers (id);

-- At most one active subscription per customer per product
CREATE UNIQUE INDEX one_active_subscription
  ON subscriptions (customer_id, product_id)
  WHERE status = 'active';

One practical detail: adding a constraint to a table that already contains violating rows fails — which is itself a useful discovery. Postgres lets you add a CHECK or foreign key as NOT VALID, so it's enforced for new writes immediately, then run VALIDATE CONSTRAINT once you've cleaned up the history.

ClickHouse is deliberately weaker here. It accepts CONSTRAINT ... CHECK on insert but enforces neither foreign keys nor uniqueness — it's built for ingest speed, and deduplication through engines like ReplacingMergeTree is eventual, not guaranteed. If ClickHouse holds your analytics data, more of the checking moves downstream.

The limit is the same everywhere: a write-time check can only judge a row that arrives. The handler violated nothing — the bad outcome was a row that didn't exist. Absence, staleness, and volume drops are invisible at write time, full stop. This is the first place monitoring departs from the checks you have, and it never comes back.

Checks that run when the pipeline runs

Checks wired into the pipeline run when the pipeline runs. Between runs, and on tables the pipeline never touches, nothing is looking.

If your data flows through dbt, its tests are the cheapest coverage you'll add this year. Generic tests (unique, not_null, accepted_values, relationships) are a few lines of YAML next to the model; singular tests are plain SQL files that fail when they return rows. They version with your code and run in CI — exactly the right place to catch a transform-logic bug before it ships.

The catch is the moment. dbt tests run when dbt runs. A source can break minutes after the morning run and nothing looks again until the next one. And tables dbt never touches — the application database your backend writes to directly, like that payments table — never get tested at all.

Great Expectations is the heavyweight open-source option: a Python framework with a large library of assertions covering null rates, value ranges, distributions, and schema shape. The cost is operational. GX Core is a framework, not a service — you write Python to configure it, host it, schedule it with an orchestrator, and wire failures into alerting yourself. GX Cloud removes the hosting in exchange for commercial pricing. With a data platform team, the library earns its weight; without one, standing it up is most of the project.

(If what you want to test is the schema and functions themselves rather than the data in them, pgTAP does that for Postgres in CI. Adjacent, but a different job.)

Both dbt and GX run only when something invokes them. If the orchestrator itself dies, the checks die with it, and nothing says so. So the second moment passes over the payments gap too — the handler wrote to a table no pipeline transformed, and even the tables that are tested are only tested when a run fires.

Checks that run against production on their own schedule

This is the moment the whole post builds to. A query runs every few minutes against production, independent of any pipeline — the only place a row that never arrived can be noticed. The handler was up, no transform was involved, and the failure was rows not appearing; only something looking at the table itself, on its own schedule, had any chance. This isn't a stricter version of your existing checks. It's something outside the pipeline whose job is to notice absence.

The incumbent here is a cron job — a script runs a query, compares the result to a threshold, posts to Slack. It's free, it's entirely yours, and it works. What erodes is everything around the query: checks scatter across repos until nobody can list what's monitored, there's no history of past failures, and alerts land in a channel where they belong to everyone, which means no one. The query was never the hard part. Operating it is.

Soda covers this with Soda Core (open source — a CLI you still schedule yourself) and Soda Cloud (commercial, where scheduling and alerting live). Checks are written in SodaCL, a YAML language that compiles to SQL, and the product leans toward contracts, governance, and audit trails. If showing auditors what's monitored is part of your job, that's a real fit. If you just want to know when payments stop being recorded, SodaCL is one more syntax for something SQL already says.

This is the moment we built Alertee for — it's our product, so take this section knowing that. It monitors production Postgres and ClickHouse: you describe the outcome that should keep happening in plain English and review the SQL it generates against your schema — ordinary SQL you can edit or replace. When a check fails, one inbox item goes to a specific person. Classifying each incident as real, transient, expected, or noisy tunes the check over time, which matters because monitoring setups don't usually die from a missed failure — they die from a team that learned to skim past the channel. What it isn't: no lineage, no learned anomaly models, no Snowflake or BigQuery.

The payments incident as a monitoring check

Here's the query that turns six silent hours into about twenty minutes, built up in three stages, because the first version you'd write has a problem worth understanding.

The naive check:

SELECT COUNT(*) AS payments_last_hour
FROM payments
WHERE created_at > NOW() - INTERVAL '1 hour';
-- Alert if 0

This fires every quiet night. At 3am on a Sunday, zero payments in an hour may be completely normal for your volume. Widen the window until it stops false-alarming and you've also widened your detection delay to most of a day. The fix is to stop comparing against a fixed threshold and compare against your own past — was this hour quiet last week too?

SELECT
  COUNT(*) FILTER (WHERE created_at > NOW() - INTERVAL '1 hour')
    AS this_hour,
  COUNT(*) FILTER (WHERE created_at BETWEEN NOW() - INTERVAL '7 days 1 hour'
                                        AND NOW() - INTERVAL '7 days')
    AS same_hour_last_week
FROM payments
WHERE created_at > NOW() - INTERVAL '8 days'
   OR created_at > NOW() - INTERVAL '1 hour';
-- Alert if this_hour = 0 AND same_hour_last_week > 0

That's already usable. But for payments specifically there's a stronger version, because you have a second source of truth: the provider's webhooks. If you log webhook receipts (most apps do, or should — it's one insert in the handler before any business logic), you can reconcile the provider's view against your own tables:

-- Charges the provider confirmed that never became payment rows
SELECT w.external_id, w.received_at
FROM webhook_events w
LEFT JOIN payments p ON p.external_id = w.external_id
WHERE w.event_type = 'charge.succeeded'
  AND w.received_at BETWEEN NOW() - INTERVAL '24 hours'
                        AND NOW() - INTERVAL '10 minutes'
  AND p.id IS NULL;
-- Alert if any rows come back

Three tuning notes from running checks shaped like this. The 10-minute exclusion at the recent edge keeps you from racing your own handler — a charge received 30 seconds ago may simply not be processed yet. Run it every 5–15 minutes; payment gaps are the kind of failure where an hour of latency is expensive. And require two consecutive failures before alerting, which absorbs deploys and brief queue backups for the price of a few minutes' delay.

This is the whole argument in one check. It asserts a relationship between two tables — "every confirmed charge has a payment row" — which a constraint can't state (the missing row isn't there to constrain), and a dbt or GX test would only evaluate whenever the pipeline next ran, if those tables are in the pipeline at all. That's the line drawn in SQL: a value check confirms the rows you have are correct, this check looks for the row you should have and don't. Run from the outside, on a schedule, it would have caught the gap in about twenty minutes. The check didn't change; the moment did.

All three queries run anywhere: psql, a cron job, or a tool like Alertee that handles the scheduling, history, and ownership.

The three moments side by side

When it runs	Tools	What it catches	What it can't see
At write time	Postgres constraints, ClickHouse `CHECK`, Zod / Pydantic at the app boundary	Malformed, contradictory, or duplicate rows, at the moment of insert	Rows that never arrive; data going stale; volume drops
At transform time	dbt tests, Great Expectations	Transform bugs and broken source assumptions, each time the pipeline runs	Anything between runs; tables the pipeline doesn't touch; the orchestrator itself dying
Continuously in production	Cron + SQL scripts, Soda, Alertee	Outcomes that stop happening: unrecorded payments, zero-row imports, stale tables, quiet tenants	Only what someone has written a check for

Only the third moment catches an outcome that stopped happening.

Where we'd start

Add the constraints first, whatever else you do. They're free, they need no scheduler, and a CHECK added today prevents a class of bad rows forever. If your data flows through dbt, enable the generic tests next — a few lines of YAML, run in CI, no reason not to. But know what you've covered: both check the rows you have, and neither watches production between runs.

The payments incident lived in the third moment, and that's the one monitoring exists for. A cron job is the honest zero-cost option there, and it holds up while checks are few and someone genuinely owns the Slack channel. Great Expectations fits when you already have orchestration running and Python-defined assertion suites earn their operational weight; GX Cloud trades the hosting for commercial pricing. Soda fits when governance and audit trails are part of the requirement, not just detection. And if your checks are plain SQL assertions about outcomes users would notice — payments recorded, imports wrote rows, no tenant gone quiet — use Alertee: that's exactly what we built it for, every check is SQL you can read and edit, and the scheduling, history, and incident ownership are handled.

If you need lineage or learned anomaly models, you've outgrown this post's frame — the data quality tools comparison weighs the platforms that do that.

Whichever tools you pick, the test of your monitoring is the one this post ran: for each moment a failure can happen, something is looking at that moment. The payments incident wasn't missed because the team had no checks. It was missed because everything they had ran at the wrong time — and nothing was watching production for the row that never came.