Alertee LogoAlertee
Back to Blog
monitoringengineering

Why your zero-row alert fires every Saturday: tuning the SQL that catches silent data failures

A continuous data check is only as good as how it handles expected absence — here's the tuning that separates a check that pages on the real failure from one your team mutes.

ByNiallNiall

The first continuous data check most teams write is a COUNT(*). It works the day they write it. Then it pages at 06:00 on a Saturday because the source doesn't run on weekends, again on the morning after a deploy because a backfill re-stamped yesterday's rows, and a third time when a small tenant that's bursty by nature happens to send nothing for a day. Three false alarms in a fortnight, and the check gets muted. Two weeks later the import genuinely writes zero rows, and nobody is watching the channel anymore.

That failure — a real outcome stopping while the service stays up — is the gap we built Alertee for, and the continuous monitoring walkthrough makes the case for where a check has to run to see it: outside the pipeline, on its own clock. This post is about the part that comes after you've put a check there and it's misbehaving. A query that detects absence has a problem a query that validates a value never has: most of the time the thing it's looking for is legitimately missing, and the whole skill is telling "missing because the load broke" apart from "missing because nothing was supposed to happen." Get that wrong in either direction and the check is useless — too eager and it gets muted, too forgiving and it sleeps through the incident.

So this isn't a catalogue of checks to paste in. Your tables, entities, cadence, thresholds, and business semantics aren't ours, and a check that fits us would misfire for you. It's five failure modes used to work through five distinct tuning problems — the clock guard, the baseline, the per-entity threshold, the rate-of-change signal, and the deploy window. The SQL is illustrative; the reasoning is the transferable part. Every query here runs as well from psql and a cron line as it does from our own product.

The clock guard: an import that writes zero rows

The failure. A nightly import finishes green — exit code zero, success ping, dashboard cheerful — while writing zero rows, because an upstream API changed its auth or returned an empty page and the script swallowed the error. The table is still queryable, still full of last week's rows. A cron healthcheck confirms the job ran; it says nothing about what the job did, so the only honest signal is a count of what landed in the load window:

-- Did today's load write any rows at all?
SELECT COUNT(*) AS rows_today
FROM daily_orders_import
WHERE loaded_at >= CURRENT_DATE;

The tuning problem: zero is the correct answer most of the day. A naive "alert if rows_today = 0" fires every night before the import has run, because before the window opens, zero rows is not an incident — it's the expected state. The whole check turns on a clock guard: only evaluate the condition after the window should have closed (after 07:00, or whenever yours does). That single guard is the difference between a check that's right twice a day and one that's right always. Two more guards follow from the same logic. A manual backfill re-runs the load and shifts loaded_at, which a naive check can misread as a second missing window — a short grace period absorbs that. And if the source itself is quiet on weekends, comparing against the same weekday a week ago instead of against zero teaches the check the difference between "no data because the load broke" and "no data because nothing happened."

The baseline: a tenant goes quiet under a healthy total

The failure. The import ran. Plenty of rows landed. The daily total sits comfortably in its usual band. But one tenant — often a large, important one — stopped producing rows after a config change or a broken integration on their side, and every other tenant's volume papers over the gap. A whole-table count passes. The customer who went dark is the one who calls.

The tuning problem: there's no rule to assert, only a regularity that broke. You can't write a constraint that says tenant X must report today — you'd be maintaining that list against every signup and churn. The tunable move is to derive the expected set from the data's own recent history, then diff today against it:

-- Tenants that reported every day for two weeks but are silent today
WITH expected AS (
  SELECT tenant_id
  FROM events
  WHERE created_at >= CURRENT_DATE - INTERVAL '14 days'
    AND created_at <  CURRENT_DATE
  GROUP BY tenant_id
  HAVING COUNT(DISTINCT created_at::date) = 14
),
today AS (
  SELECT DISTINCT tenant_id
  FROM events
  WHERE created_at >= CURRENT_DATE
)
SELECT e.tenant_id
FROM expected e
LEFT JOIN today t ON t.tenant_id = e.tenant_id
WHERE t.tenant_id IS NULL;

The knob is the HAVING clause — the bar for "normally present." Set it to every one of the last 14 days and you only watch tenants reliable enough that silence is a real signal; the bursty long tail, which legitimately skips days, falls below the bar and never pages you. Loosen it and you trade quiet for coverage. There's no universal right setting, which is exactly why this can't be a copy-paste check: the bar is a statement about your tenants' normal behaviour. The same expected/today diff is the reusable part — swap tenant_id for integration_id, account_id, or source and you're watching whatever "is always there until it isn't" applies to your business.

The rate-of-change signal: a queue stops draining

The failure. A worker is online — it shows in your process monitor, it's heartbeating, a liveness probe is happy — but it's no longer making progress. A poison message, a deadlock, a downstream dependency timing out, and items pile up. The distinction that matters isn't "is the consumer alive" but "is the backlog shrinking," and only the second one matters to the user waiting on their export.

The tuning problem: depth alone fires on healthy bursts. The obvious check — alert when pending exceeds N — pages every time a legitimate burst of work arrives that a healthy worker will clear in minutes. The signal that actually means stalled is the age of the oldest waiting item, because a steadily climbing oldest-wait means nothing is leaving the front of the queue:

-- Backlog size and the age of the oldest item still pending
SELECT
  COUNT(*)                 AS pending,
  MAX(NOW() - enqueued_at) AS oldest_wait
FROM job_queue
WHERE status = 'pending';

Trigger on oldest_wait against your real SLA, not a round number, and depth becomes context rather than the alarm. To catch a slow leak as well as a hard stall, compare the backlog against itself an hour ago and two hours ago — a window function over periodic depth snapshots, or a small table you append readings to — and alert on a monotonic climb. The deploy guard here is a rolling restart that briefly pauses consumption: require the condition to hold across two consecutive checks so a momentary pause doesn't page anyone.

Reconciling two ledgers: payments succeed but no rows insert

The failure. A deploy changes how a webhook handler parses the provider's payload. The handler still returns 200, so the provider is satisfied and never retries, and the insert into payments quietly stops while charges keep succeeding on the provider's side. Your table stops growing. The first person to notice is a customer asking where their receipt went. No write-time constraint helps — the missing row isn't there to violate anything — and no aggregate looks wrong yet.

The tuning problem: you need a second source of truth, and a margin for your own latency. When a provider confirms an event you also record locally, you have a ledger to check yourself against. If you log webhook receipts — one insert before any business logic — you can ask which confirmed charges never became payment rows:

-- Confirmed charges that never became local payment rows
SELECT w.external_id, w.received_at
FROM webhook_events w
LEFT JOIN payments p ON p.external_id = w.external_id
WHERE w.event_type = 'charge.succeeded'
  AND w.received_at BETWEEN NOW() - INTERVAL '24 hours'
                        AND NOW() - INTERVAL '10 minutes'
  AND p.id IS NULL;

The tunable edge is that NOW() - INTERVAL '10 minutes' exclusion. Without it the check races your own handler: a charge received 30 seconds ago hasn't been processed yet, so it shows up as "missing" and you page yourself for a row that's about to arrive. The exclusion sets a margin wider than your normal processing latency. Then require two consecutive failures before alerting, which absorbs a deploy or a brief queue backup for the price of a few minutes' delay on a real gap. If you don't log webhook receipts, the weaker fallback is a volume comparison against the same hour last week — useful, but it can only tell you a total moved, not which charges went missing.

The deploy window: a region goes silent

The failure. After a release, traffic from one region stops arriving — its events route through a config or an edge that broke — and every other region's rows keep the total looking normal. Structurally it's the quiet tenant again: a slice that's always present went silent, derive the expected set and diff:

-- Regions present every day for two weeks but absent today
WITH expected AS (
  SELECT region
  FROM orders
  WHERE created_at >= CURRENT_DATE - INTERVAL '14 days'
    AND created_at <  CURRENT_DATE
  GROUP BY region
  HAVING COUNT(DISTINCT created_at::date) = 14
),
today AS (
  SELECT DISTINCT region
  FROM orders
  WHERE created_at >= CURRENT_DATE
)
SELECT e.region
FROM expected e
LEFT JOIN today t ON t.region = e.region
WHERE t.region IS NULL;

The tuning problem: cadence, not threshold. The baseline logic is identical to the tenant check, so the new lever is when you run it. A region failure caused by a release is found cheaply if you check minutes after a deploy and expensively if you wait for the daily run — a region silent for ten minutes is a config rollback; a region silent overnight is a morning of lost orders and an apology. So this is the check worth running on a tighter cadence right after a deploy than on its normal daily schedule. The false positive to watch is a genuinely low-traffic region that has natural gaps; raise the HAVING bar or widen the window past 14 days to drop it below the line. On ClickHouse, where this slice-by-slice scan is cheap, running it often costs almost nothing.

The five problems, side by side

The checks differ, but the work doesn't — and these levers transfer to failure modes we didn't list:

Failure modeThe naive check's failureThe lever that fixes it
Import writes zero rowsFires before the load has runClock guard on the window; weekday-over-weekday for quiet sources
Tenant goes quietTotal papers over one entityPer-entity baseline derived from recent history; tune the HAVING bar
Queue stops drainingDepth fires on healthy burstsTrigger on oldest-item age and rate of change, not raw depth
Payments succeed, no rowsNo bad row, no aggregate moves yetReconcile against a second ledger; exclusion margin for your own latency
Region goes silentDaily run finds it hours lateSame baseline, tighter cadence right after a deploy

Every one of these is the same underlying mistake: a check that treats absence as binary, when absence is the normal state for most of the day, for the long tail, between bursts, and on weekends. The tuning is teaching the check the shape of normal absence so the only thing left that fires is abnormal absence. That's the part a SELECT doesn't capture, and it's why a finished check copied from a blog post — including this one — will misfire against your data until you've fitted these knobs to it.

For the prior question — where a check has to run to see absence at all, versus a write-time constraint or a dbt test that's structurally blind to it — the continuous monitoring walkthrough lays out the three layers, and the data quality monitoring pillar walks one payments incident through them moment by moment. If you're choosing what to run these in, the data quality tools comparison goes tool by tool.

The honest floor for all of this is a cron job: a script runs the query, applies the threshold and the guards above, posts to Slack. It's free and it's yours, and for the first handful of checks it's the right call. What erodes is the tuning itself — the guards live in scripts nobody revisits, there's no record of which alerts were real, and a muted channel can't tell you the day it stops being safe to mute. We built Alertee so that classifying a fired check as real, transient, expected, or noisy feeds back into the check instead of into a person's memory of which alarms to ignore — and so the per-entity, multi-dimensional version runs as one check that reports which slice dropped, not a query per tenant. When the tuning is the part that's hurting, connect a Postgres or ClickHouse database and turn one check on. The queries are the easy part; keeping them tuned and out of the noise is the part worth not doing by hand.