Alertee LogoAlertee
Back to Blog
monitoringengineering

The data failures on ClickHouse that never throw an error

Stale views, unflushed async inserts, and uncollapsed ReplacingMergeTree duplicates throw no error; here's the SQL that catches each one.

ByNiallNiall

The failures that reach your users on a ClickHouse and Postgres stack are not generic. They have the shape of this stack. An async insert acknowledged before it ever reached storage. A materialized view that stopped populating because inserts into its source table started failing. Rows in a ReplacingMergeTree table that a background merge never collapsed, so a count reads double. A tenant whose hourly import quietly went to zero while every other tenant papered over the total. None of these throws an error. The pipeline stays green. The first person to notice is a customer asking why a number looks wrong.

We build Alertee to monitor this stack — production Postgres and ClickHouse — so these are failure modes we watch natively, and the SQL below is what we run. The product is ours; the value here is the pattern, not a check to paste in. A word first on where each query runs, because it isn't uniform. The freshness and dedup checks are single-engine: the freshness one runs in either clickhouse-client or psql, the dedup one is ClickHouse-specific. The reconciliation check spans both engines in one query and needs a federated read to do that — ClickHouse's postgresql() table function or a Postgres foreign data wrapper. We'll flag that at the query. None of this is exotic, but "paste it into a single psql session" is true for some of these and not others, and pretending otherwise would cost you an afternoon.

Every one of these is invisible for the same reason: each is an absence, not a bad row. There is no malformed value for a constraint to reject, no failing transform for a test to catch. The data that should be there simply isn't, and the only thing that sees that is a query that knows what should be present, run on its own clock against the live table. For the general version of that argument, the continuous monitoring walkthrough lays out the layers. Here we cover three failure modes peculiar to ClickHouse and Postgres, and the SQL behind each.

A materialized view that quietly stopped populating

This one bites teams new to ClickHouse. A ClickHouse materialized view is not a cache you REFRESH on a schedule like a Postgres MATERIALIZED VIEW. It's a trigger: each insert into the source table runs the view's SELECT and writes the result into the view's target table. That design has a sharp edge. The view only sees rows inserted after it exists, and if inserts into the source start failing — or get rerouted, or the source table is swapped during a migration — the view's target table simply stops growing. Nothing errors. The view definition is still there. Queries against it still return rows: yesterday's rows, and the day before's.

Freshness is a question about the newest row, not the rows you have. Bucket the target table by time, find the most recent bucket that has data, and compare it to now.

-- Toy schema: events flow into events_raw; a materialized view
-- rolls them into hourly_metrics keyed by event_hour.
-- How fresh is the newest bucket the view has written?
SELECT max(event_hour) AS freshest_bucket
FROM hourly_metrics;
-- Alert if freshest_bucket < now() - INTERVAL 2 HOUR   -- ClickHouse
-- (Postgres: now() - INTERVAL '2 hours')

This one is portable: a single max() over the view's target table, and the only thing that differs between the two engines is the interval literal — INTERVAL 2 HOUR in ClickHouse, INTERVAL '2 hours' in Postgres.

The threshold is one expected cadence plus a margin. An hourly rollup gets two hours, not one, so a view that runs a little late doesn't trip the check. The reason this works where a row-level check fails: the view's target table is full of valid rows, so nothing about the rows is wrong. The only signal that the view died is that its newest bucket stopped advancing — and that's a single max() away, asked from outside whatever feeds the view.

Two refinements make this honest on a real stack. Compare the view's freshest bucket against the source table's freshest bucket, not just against the wall clock — if the source itself went quiet, the failure is upstream of the view, and that's a different incident with a different owner. And if you've enabled async inserts, give the margin a little more room, because buffered rows aren't queryable until they flush, which is the next failure mode.

A fire-and-forget insert that was acknowledged but never landed

ClickHouse's asynchronous inserts buffer incoming rows server-side and flush them to storage once a size or time threshold is hit. With the default wait_for_async_insert=1, the insert is only acknowledged after the flush, so a failure comes back to the client. With wait_for_async_insert=0 — fire-and-forget, which teams reach for under high write volume — the server acknowledges as soon as the row is buffered, before it reaches disk. ClickHouse's own docs are blunt about the trade: there's no dead-letter queue, errors only surface at flush, and "the risk of silent failure remains." Your insert client got a success. The row may never have landed.

The write side can't see this by construction — it already moved on. So the check lives downstream: something counted the work on the way in, and you compare it to what actually landed. This is the one query in the post that crosses both engines — the intent ledger lives in Postgres, the rows you're reconciling against live in ClickHouse — so it can't run in a plain psql or clickhouse-client session as written. You read one engine from the other. The shape below runs inside ClickHouse, pulling the Postgres ledger in through the postgresql() table function; a Postgres foreign data wrapper would let you write the mirror image from psql. Either way it's a federated read you set up once, not a query you paste into a single shell.

-- Toy schema: the app records an intended write in Postgres
-- (ingest_log) before firing an async insert into ClickHouse (events).
-- Run from ClickHouse; ingest_log is read over the postgresql() table function.
-- Which intended writes never showed up in ClickHouse?
SELECT count(*) AS missing
FROM postgresql('pg-host:5432', 'app', 'ingest_log', 'user', '{{password}}') AS l
LEFT JOIN events AS e ON e.event_id = l.event_id
WHERE l.created_at BETWEEN now() - INTERVAL 6 HOUR
                       AND now() - INTERVAL 5 MINUTE
  AND e.event_id = ''   -- no matching row in events
-- Alert if missing exceeds a small tolerance

(The exact null-test depends on your event_id type and join — in ClickHouse a LEFT JOIN fills unmatched right-side columns with the type's default, so test against that default rather than IS NULL. The point is the reconciliation, not the cast.)

The tunable edge is now() - INTERVAL 5 MINUTE. Without it the check races the buffer: a row fired thirty seconds ago may still be sitting unflushed, so it reads as "missing" and you page yourself for a row that's about to arrive. The exclusion sets a margin wider than your flush interval. The deeper point is that fire-and-forget removes the signal you'd normally trust — the acknowledgement — so you have to manufacture a new one by counting intent before the insert and reconciling after. If you don't keep a second ledger, the weaker fallback is a freshness or volume check on the target table; it can tell you the stream stalled, but not which specific rows fell on the floor.

Duplicates a ReplacingMergeTree merge never collapsed

The subtle one, and the most ClickHouse-specific. A ReplacingMergeTree table deduplicates rows with the same ORDER BY key — but, in ClickHouse's own words, deduplication "occurs only during a merge," merges happen "in the background at an unknown time," and "some of the data may remain unprocessed." The engine does not guarantee the absence of duplicates. It's a space optimisation, not a uniqueness constraint. If your reads don't say FINAL and don't aggregate the duplicates away, a plain count() or sum() can read high — sometimes double — for as long as the rows sit unmerged, which can be a while after a retry storm or a backfill that re-inserted a partition.

The rows are individually correct; there are just too many of them, and only sometimes, depending on whether a merge has run. Count what a deduplicated read returns, count what the raw table returns, and alert when the gap between them crosses a line.

-- Toy schema: orders is a ReplacingMergeTree ORDER BY (order_id),
-- versioned so the latest row per order_id wins.
-- ClickHouse: raw row count vs. distinct keys right now.
SELECT
    count()                  AS raw_rows,
    countDistinct(order_id)  AS distinct_keys,
    raw_rows - distinct_keys AS uncollapsed
FROM orders
WHERE order_date >= today() - 1;
-- Alert if uncollapsed / distinct_keys exceeds, say, 0.05

Two ClickHouse details worth naming. count() - countDistinct(order_id) measures extra rows per key — the duplicate rows waiting for a merge — which is the signal you want; a key with one stray duplicate adds one, a key re-inserted by a backfill adds however many copies landed, so the gap reads high after a retry storm and not always cleanly double. And referencing raw_rows and distinct_keys in the third select-list expression works because ClickHouse lets you reuse a select alias within the same SELECT; Postgres doesn't, so the equivalent there would repeat the two aggregates or wrap them in a subquery. This check is ClickHouse-specific anyway — ReplacingMergeTree is a ClickHouse engine — so the alias shorthand costs you nothing here.

A small, steady gap is normal — it's the backlog of rows waiting for the next merge, and on a busy table it's never quite zero. The signal is the delta moving: a gap that jumps after a deploy or a backfill, or that climbs and stays climbing instead of being absorbed by merges. That's why the alert is on a ratio crossing a threshold, not on "any duplicates exist" — the latter pages constantly and gets muted within a week. Tune the ratio to what your read path tolerates: if every query already uses FINAL or aggregates by key, you care less about the raw gap and more about it growing unbounded, which is its own signal that merges have stopped keeping up.

The transferable shape outlives ReplacingMergeTree. Any time "the number the user sees" is computed differently from "the rows on disk" — a CollapsingMergeTree whose sign rows haven't netted out, a Postgres soft-delete where a query forgot WHERE deleted_at IS NULL — the check is the same: compute the gap between the two readings and threshold the delta, not the absolute.

Where these generalise

Each transfers past the specific bug that taught it:

FailureWhy it's invisibleThe SQL pattern
Materialized view stopped populatingTarget table holds valid but stale rows; nothing errorsNewest bucket vs. now, and vs. the source
Fire-and-forget insert never landedThe client got an ack before the row reached diskReconcile a second ledger against the target, with a flush-width margin
ReplacingMergeTree didn't collapseRows are individually correct; only the count is wrong, and only sometimesRaw count vs. collapsed count, alert on the ratio

None of these is a copy-paste check that fits everyone. Your ORDER BY keys, your flush thresholds, your view cadence, your tolerance for an uncollapsed gap are statements about your data, not ours. A check lifted from this post will misfire against your schema until you've fitted the buckets, the margins, and the thresholds to it.

The honest floor for running these is a cron job: a script runs the query, applies the threshold, posts to Slack. It's free and it's yours, and for the first few checks it's the right call. What erodes is the tuning — the flush margins and delta ratios live in scripts nobody revisits, there's no record of which alerts were real, and a muted channel can't tell you the day it stopped being safe to mute. We built Alertee to run these against your real ClickHouse and Postgres schema natively: it reads your tables, helps you turn the patterns above into editable SQL fitted to your keys and cadences — including the cross-engine reconciliation, which it can run against both connected databases without you standing up a foreign data wrapper — schedules them, and turns a fired check into an owned incident you classify so the noise drops over time. When the tuning is the part that's hurting, connect a database and turn one check on.

These three checks carry over to a warehouse — freshness, ledger reconciliation, and delta thresholds don't depend on the engine — but the failure modes that trigger them are this stack's. For where a check has to run to see absence in the first place, the continuous monitoring walkthrough covers the layers; for the part that comes after you've placed a check and it's firing on a quiet Saturday, tuning a check against expected absence is its own skill; and if you're choosing what to run these in, the tools comparison goes vendor by vendor.