Alertee LogoAlertee
Back to Blog
monitoringengineering

Databricks data quality monitoring: the silent failures the lakehouse never errors on

A MERGE that upserted nothing, a stream that fell behind, a column schema evolution dropped — none error, and here are the Spark SQL checks that catch each one.

ByNiallNiall

A Delta MERGE ran every hour, returned success, and upserted nothing. The source frame had changed its key column during a refactor, so the ON clause matched no target rows and the WHEN NOT MATCHED branch — gated by a condition that was now never true — inserted none either. The job's operationMetrics recorded numTargetRowsInserted and numTargetRowsUpdated at zero, the run logged green, and the table kept answering every query with last hour's rows. Nobody reads operationMetrics by hand. The first person to notice was an analyst reconciling a number a day later and finding it frozen.

That is the failure most of the advice misses. The cluster is up, the job succeeded, the SQL warehouse returns rows, and the data quietly stopped advancing — an absence, not a bad row, with no malformed value for a constraint to reject and no failing transform for a test to trip. It's the same shape we've written about on ClickHouse and Postgres and on Snowflake. The only thing that catches it is a query that knows what should be present, run on its own clock against the live table. This post covers four such failures — a MERGE that wrote zero rows, a Structured Streaming job that fell behind, schema evolution that dropped a column without erroring, and a partition that went quiet under a healthy total — with the reasoning and a toy Spark SQL check for each.

Up front: Alertee, which we build, monitors Postgres and ClickHouse natively and does not connect to Databricks. So this is a how-to — implement these in a scheduled Databricks SQL query, a Lakeflow job, or wherever you already run scheduled SQL. The lakehouse gives you two good places to look: the data tables, which tell you whether the outcome happened, and the Delta transaction log via DESCRIBE HISTORY, which tells you whether the write that should have produced it actually did anything.

A Delta MERGE that matched zero rows

A Delta MERGE INTO is the workhorse of lakehouse upserts, and it has a quiet edge: if the ON condition matches no target rows and the WHEN NOT MATCHED branch inserts none, the operation succeeds having changed nothing. No exception. The target table is untouched — which is exactly correct behaviour when there genuinely was nothing to apply, and exactly wrong when the source key drifted, a join condition went stale after a refactor, or an upstream filter quietly emptied the source frame. The rows already in the target are valid, so nothing about a SELECT against them looks off. The newest row just stopped advancing.

The first check is the one measured in what users see: the freshness of the target table itself. Take the max timestamp, compare it to now.

-- Toy schema: an hourly MERGE upserts into orders, stamped updated_at.
-- How fresh is the newest row the MERGE has written?
SELECT max(updated_at) AS freshest
FROM orders;
-- Alert if freshest < current_timestamp() - INTERVAL 2 HOURS

The threshold is one expected cadence plus a margin — an hourly MERGE gets two hours, so a run that lands a little late doesn't trip the check. That catches the symptom. To attribute it — was the MERGE a no-op, or did it not run at all? — read the table's own write history, which Databricks records per commit:

-- The last few writes to this table, with what each one actually did.
SELECT version, timestamp, operation,
       operationMetrics.numTargetRowsInserted AS inserted,
       operationMetrics.numTargetRowsUpdated  AS updated,
       operationMetrics.numSourceRows         AS source_rows
FROM (DESCRIBE HISTORY orders)
ORDER BY version DESC
LIMIT 5;
-- A MERGE that ran but matched nothing shows a recent row here
-- with source_rows > 0 but inserted = 0 and updated = 0 — the no-op.
-- No recent MERGE row at all means the job didn't run: a different incident.

The split matters because it routes the page to the right person. A MERGE that ran with numSourceRows above zero but numTargetRowsInserted and numTargetRowsUpdated both zero is a logic or key-drift bug in your transform. No recent MERGE row in the history at all means the job that should have invoked it didn't fire — an orchestration problem. The freshness check on the target catches both, which is why you run it first; DESCRIBE HISTORY is how you decide who owns the fix. The numSourceRows vs. numTargetRows* comparison is the difference between "nothing arrived to apply" and "plenty arrived and none of it stuck."

A stream that fell behind its retention window

A Structured Streaming query reading a Delta table must run at least once within the source table's retention window — 7 days by default for VACUUM-removed data files. Fall behind that window and the stream fails with DELTA_FILE_NOT_FOUND_DETAILED and has to be reset with a full refresh. That's the loud version. The quiet version is the workaround: Databricks is blunt that setting spark.sql.files.ignoreMissingFiles to true "silently produces incorrect results," so a team that reached for the flag now has a sink that drops files without complaint. Two adjacent failures, one loud and one silent.

Even the loud one benefits from an outside check, because a streaming job that died at 02:00 doesn't always page anyone, and the sink it fed just stops growing. The first check is the same as before: the sink's newest row against the clock.

-- Toy schema: a streaming job appends into events_silver, stamped ingest_ts.
-- Is the newest appended row recent enough?
SELECT max(ingest_ts) AS last_append
FROM events_silver;
-- Alert if last_append < current_timestamp() - INTERVAL 15 MINUTES
-- (the margin is your micro-batch trigger interval plus slack, not a round number)

The margin is the tunable edge. A stream triggered every few minutes is bursty at the row level; set the window narrower than your real trigger interval and you'll page yourself during every quiet stretch between batches. Set it to your longest healthy gap plus a margin. The backlog itself is observable while the stream is alive — the source reports numBytesOutstanding and numFilesOutstanding in its progress metrics, the bytes and files still waiting to be processed — and a climbing outstanding count is the early warning that the job is losing the race against its retention window before it actually trips. Those metrics live in the streaming query's progress, not in a table you can SELECT, so the practical outside check is the sink-freshness one above; the backlog metrics are what you watch inside the job to catch the fall-behind before freshness does.

Schema evolution that dropped a column

With automatic schema evolution enabled on a MERGE, a column quietly missing from the source no longer raises an error — it lands NULL in the unspecified columns and the write succeeds. The table gains rows. A downstream metric that sums or filters on the now-all-NULL column silently reads zero or drops the rows, and nothing in the pipeline objected. The rows are individually "valid" — NULL is a legal value — so a row-level expectation that doesn't assert non-null on that exact column won't catch it.

The check that does isn't a freshness check on the table but on the column's content: when did this column last receive a non-null value, and at what fill rate?

-- Toy schema: orders has a region column that should always be populated.
-- Has anything non-null landed in region recently?
SELECT
    count(*)                                              AS rows_today,
    count(region)                                         AS region_non_null,
    count(region) * 1.0 / nullif(count(*), 0)             AS fill_rate
FROM orders
WHERE updated_at >= current_date();
-- Alert if fill_rate drops below its normal floor (say 0.98),
-- not on "any null exists" — some nulls may be legitimate.

The other half is reading the schema itself as a baseline, because schema evolution makes the column set a moving target. DESCRIBE HISTORY records schema-changing commits, and comparing the live column set against a known-good snapshot turns "a column appeared or vanished" into an explicit signal rather than something you discover when a dashboard goes blank:

-- The columns the table currently has.
SELECT column_name
FROM information_schema.columns
WHERE table_name = 'orders';
-- Diff this against a stored expected set; a column that disappeared
-- or one that appeared unannounced is a schema-evolution event to review.

The knob is the fill_rate floor, and it's a statement about your data, not a universal number — a column that's legitimately sparse needs a low floor or a different check entirely, while one that should always be present gets a floor close to 1. You can't hand this to a copy-paste rule, because "how full should this column normally be" is something only your data knows.

A slice that went quiet under a healthy total

One partition — a region, a tenant, a source system — stopped producing rows after a config change, while every other partition's volume papered over the gap. The load ran, plenty of rows landed, the daily total sat in its usual band, and a whole-table count passed clean. The slice that went dark is the one that opens a ticket.

The check that surfaces it can't be a static list of "partitions that must report today" — you can't hand-maintain one against every change to your data. So derive the expected set from the data's own recent history and diff today against it:

-- Partitions present every day for two weeks but silent today.
WITH expected AS (
  SELECT region
  FROM orders
  WHERE updated_at >= current_date() - INTERVAL 14 DAYS
    AND updated_at <  current_date()
  GROUP BY region
  HAVING count(DISTINCT date(updated_at)) = 14
),
today AS (
  SELECT DISTINCT region
  FROM orders
  WHERE updated_at >= current_date()
)
SELECT e.region
FROM expected e
LEFT JOIN today t ON t.region = e.region
WHERE t.region IS NULL;

The knob is the HAVING clause — the bar for "normally present." Require all 14 of the last 14 days and you only watch partitions reliable enough that silence is a real signal; the bursty long tail that legitimately skips days falls below the bar and never pages you. Loosen it and you trade quiet for coverage. There's no universal setting, which is exactly why this can't be a copy-paste check: the bar is a statement about your partitions' normal behaviour. The expected/today diff is the reusable part — swap region for tenant_id, source, or whatever your table partitions on, and you're watching whatever "is always there until it isn't" applies to your data. The same idea covers row-count drift that isn't a clean zero: if a partition's daily volume usually sits in a band and quietly halves, compare today's per-partition count against that partition's own recent baseline — a window function over daily counts grouped by partition — and alert when it falls outside the band. The grouping makes the per-slice signal visible where the total hides it; that's the whole move.

Where this transfers, and where it doesn't

The four checks generalise past the specific failure that taught each one:

FailureWhy it's invisibleThe SQL pattern
MERGE matched zero rowsJob logs green; target holds valid but stale rowsNewest row vs. now; DESCRIBE HISTORY numSourceRows vs. numTargetRows*
Stream fell behind retentionSink stops growing; the dead job pages no oneSink freshness vs. trigger interval; watch outstanding backlog metrics
Schema evolution dropped a columnRows are valid; the column is silently all NULLColumn fill-rate floor; diff the live column set against a snapshot
Per-partition stalenessOne slice goes quiet under a healthy totalPer-partition baseline from recent history; tune the HAVING bar

What doesn't transfer is the idea that any of these is a finished check you paste in. Your MERGE cadences, your stream trigger interval, your retention window, your columns' normal fill rates, your partitions' normal behaviour, your tolerance for drift — those are statements about your data, not ours. A query lifted from this post will misfire against your schema until you've fitted the buckets, margins, and thresholds to it. That's not a Databricks quirk; it's why a generic check library is worth less than the reasoning behind it.

The floor for running these on Databricks is good: a scheduled SQL query or a Lakeflow job that runs the query, applies the threshold, and posts to Slack. It's native, and for the first few checks it's the right call. What erodes is the tuning — the margins and bars live in scheduled queries nobody revisits, there's no record of which alerts were real, and a muted channel can't tell you the day it stopped being safe to mute. That's the gap we built Alertee to close — though it runs against Postgres and ClickHouse, not Databricks, so on the lakehouse the value we can give you is the method itself.

If your stack also includes the engines we do run natively, the same method on ClickHouse and Postgres walks the failure modes peculiar to those — async inserts that never flush, materialized views that stop populating, uncollapsed duplicates — and the Snowflake version covers suspended tasks and stale streams. The general case for where a check has to run to see absence at all is in the continuous monitoring walkthrough, and the part that comes after — keeping a check from firing every quiet Saturday — is tuning against expected absence. Same method, different warehouse, each time.