Alertee LogoAlertee
Back to Blog
monitoringengineering

Your Snowflake task suspended itself a week ago and nobody noticed

It failed three runs, Snowflake paused it to save credits, and the table it fed still answers every query with frozen rows — here's the freshness check that would have caught it the next hour.

ByNiallNiall

A Snowflake task that ran every hour for a year stops. It failed three runs in a row — a permission change on its target table — and Snowflake suspended it, because SUSPEND_TASK_AFTER_NUM_FAILURES was set to do exactly that. No exception reaches anyone. The table the task fed is still there, still queryable, still full of yesterday's rows. Every dashboard built on it keeps loading. The first person to notice is whoever reconciles a number a week later and finds it frozen.

That is the shape of a silent data failure on Snowflake: the warehouse is up, the queries return, and the data quietly stopped advancing. It's the same shape we've written about on ClickHouse and Postgres — an absence, not a bad row — and the method that catches it transfers cleanly. This post applies that method to Snowflake's own failure modes: suspended tasks, a Snowpipe that stopped, a stale stream, and per-tenant row counts that drift. For each one we show the reasoning and a toy SQL check.

One thing up front: Alertee, which we build, monitors Postgres and ClickHouse natively and does not connect to Snowflake. So this is a how-to, not a pitch — it teaches the failure modes and the check pattern so you can implement them yourself, on a Snowflake task, in a stored procedure, or wherever you already run scheduled SQL. The same reasoning that drives what we run on Postgres and ClickHouse drives a check you can write on Snowflake today, and Snowflake gives you unusually good native plumbing to run one: a scheduled query that posts to Slack.

Two places hold the answers. Your data tables — the rows a load was supposed to write — tell you whether the outcome happened. The ACCOUNT_USAGE and INFORMATION_SCHEMA metadata views — TASK_HISTORY, COPY_HISTORY, PIPE_USAGE_HISTORY — tell you whether the machinery ran. The strongest checks read the data tables, because that's what users actually see; the metadata views are how you attribute a gap to a cause. We'll use both.

A task that suspended itself and stopped feeding a table

A Snowflake task is created suspended and resumed with ALTER TASK ... RESUME; from then on it runs on its schedule until something stops it. The sharp edge is SUSPEND_TASK_AFTER_NUM_FAILURES: after that many consecutive failed runs, Snowflake suspends the task automatically. That's a sensible guard against a task burning credits in a failure loop — but a suspended task is silent. It doesn't run, so it doesn't fail, so nothing fires. The table it populated simply stops getting new rows.

The rows already in the table are valid, so nothing about them is wrong. The only signal is that the newest row stopped advancing. Freshness is a question about the newest row, not the rows you have — bucket by time, take the max, compare to now:

-- Toy schema: a task refreshes daily_revenue once an hour.
-- How fresh is the newest bucket it has written?
SELECT MAX(revenue_hour) AS freshest_bucket
FROM daily_revenue;
-- Alert if freshest_bucket < DATEADD('hour', -2, CURRENT_TIMESTAMP())

The threshold is one expected cadence plus a margin — an hourly task gets two hours, so a run that lands a little late doesn't trip the check. That data-table freshness check is the one that matters most, because it's measured in what users see. To attribute the gap — was it the task, or something upstream of it? — read the task's own run history:

-- When did this task last actually run, and how did it end?
SELECT name, state, scheduled_time, return_value
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY(
  TASK_NAME => 'REFRESH_DAILY_REVENUE',
  SCHEDULED_TIME_RANGE_START => DATEADD('hour', -6, CURRENT_TIMESTAMP())))
ORDER BY scheduled_time DESC;
-- A task that's healthy shows recent SUCCEEDED rows on schedule.
-- A suspended one stops producing rows here entirely — the absence is the signal.

Here's the catch: a suspended task doesn't write a FAILED row you can alert on, it stops writing rows to TASK_HISTORY at all. So the metadata check is itself a freshness check — "when did this task last appear in its own history" — and the same max-bucket-vs-now logic applies. The data-table check tells you the outcome broke; the history check tells you which task to go fix.

A Snowpipe that quietly stopped loading

Snowpipe loads files into a table as they land in a stage, driven by cloud event notifications — an S3, GCS, or Azure queue tells Snowpipe a new file arrived. That event chain is where it goes quiet. If the notification integration breaks, the bucket's event configuration is changed, or the queue stops delivering, files pile up in the stage and never load. The pipe object still exists. The target table still answers queries. It just stops growing, and nothing about a SELECT against it looks wrong.

Same reasoning, same first check: the load target's newest row against the clock.

-- Toy schema: a Snowpipe loads events into raw_events as files arrive.
-- Is the newest loaded row recent enough?
SELECT MAX(loaded_at) AS last_load
FROM raw_events;
-- Alert if last_load < DATEADD('minute', -30, CURRENT_TIMESTAMP())
-- (the margin is your expected file cadence plus slack, not a round number)

The margin is the tunable edge. Snowpipe is micro-batched, not instant, and file arrival is often bursty — set the window narrower than your real cadence and you'll page yourself every quiet stretch. Set it to your longest healthy gap plus a margin. Then, to tell "the pipe stopped" apart from "no files were produced upstream," read the pipe's load history:

-- Files actually loaded by this pipe in the last few hours.
SELECT file_name, last_load_time, row_count, status
FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(
  TABLE_NAME => 'RAW_EVENTS',
  START_TIME => DATEADD('hour', -4, CURRENT_TIMESTAMP())))
ORDER BY last_load_time DESC;
-- No recent rows here while files are arriving in the stage = the pipe stopped,
-- not the source. That's a different incident with a different owner.

The split matters because the two failures need different people. A pipe that stopped is an integration or queue problem on your side; a source that went quiet is someone else's. The data-table freshness check catches both — which is why it's the one you run first — and COPY_HISTORY is how you decide who to page.

A stream that went stale and lost its changes

This one is the most likely to bite quietly. A Snowflake stream tracks change data on a table — the rows inserted, updated, or deleted since you last consumed it — and a task usually drains it on a schedule. A stream has a hard constraint most people meet the hard way: it has to be consumed within the table's data retention window. If the consuming task is suspended (the first failure mode above) and stays suspended past that window, the stream goes stale. A stale stream returns no change rows. The changes it was tracking are gone, not queued — and the downstream table that depended on those changes silently stops reflecting them.

The failure compounds: a suspended task doesn't just stop the load, it can stale the stream so that even resuming the task doesn't recover the missed changes. Snowflake exposes the state directly, so the check is a status read rather than a count:

-- Is this stream still consumable, and how close is it to going stale?
SHOW STREAMS LIKE 'orders_stream';
SELECT "name", "stale", "stale_after"
FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()));
-- Alert if stale = true (already lost changes),
-- and warn well before stale_after passes (about to).

When an engine gives you an explicit health field, the check reads the field — but the useful alert fires before the failure, on stale_after approaching, not after, on stale = true. A stale-stream alert tells you change data was already lost; a "stale within the hour" alert tells you a task has stopped draining in time to do something about it. Pair it with the task-history check above and you have the whole chain: the task suspended, the stream is about to stale, the downstream table is about to drift.

Per-tenant row counts that drift while the total looks fine

The load ran. Plenty of rows landed. The daily total sits in its usual band. But one tenant — often a large one — stopped producing rows after a config change on their side or a broken slice of the load, and every other tenant's volume papers over the gap. A whole-table count passes. The tenant who went dark is the one who opens a ticket.

There's no rule to assert here — you can't hand-maintain a list of "tenants that must report today" against every signup and churn. So derive the expected set from the data's own recent history and diff today against it:

-- Tenants that reported every day for two weeks but are silent today.
WITH expected AS (
  SELECT tenant_id
  FROM events
  WHERE created_at >= DATEADD('day', -14, CURRENT_DATE)
    AND created_at <  CURRENT_DATE
  GROUP BY tenant_id
  HAVING COUNT(DISTINCT DATE_TRUNC('day', created_at)) = 14
),
today AS (
  SELECT DISTINCT tenant_id
  FROM events
  WHERE created_at >= CURRENT_DATE
)
SELECT e.tenant_id
FROM expected e
LEFT JOIN today t ON t.tenant_id = e.tenant_id
WHERE t.tenant_id IS NULL;

The knob is the HAVING clause — the bar for "normally present." Require all 14 of the last 14 days and you only watch tenants reliable enough that silence is a real signal; the bursty long tail that legitimately skips days falls below the bar and never pages you. Loosen it and you trade quiet for coverage. There's no universal setting, which is exactly why this can't be a copy-paste check: the bar is a statement about your tenants' normal behaviour. The expected/today diff is the reusable part — swap tenant_id for warehouse, region, account_id, or source and you're watching whatever "is always there until it isn't" applies to your data.

The same idea covers row-count drift that isn't a clean zero. If a tenant's daily volume usually sits in a band and quietly drops by half, compare today's per-tenant count against that tenant's own recent baseline — a window function over daily counts grouped by tenant — and alert when it falls outside the band. The grouping makes the per-entity signal visible where the total hides it; that's the whole move.

Where this transfers, and where it doesn't

The four checks generalise past the specific failure that taught each one:

FailureWhy it's invisibleThe SQL pattern
Task suspended itselfTable holds valid but stale rows; nothing errorsNewest bucket vs. now, plus task history as a freshness check
Snowpipe stopped loadingTarget answers queries; files pile up unloadedNewest loaded row vs. cadence; COPY_HISTORY to attribute
Stream went staleA stale stream returns no changes; downstream driftsRead the stale field; alert before stale_after, not after
Per-tenant row driftOne slice goes quiet under a healthy totalPer-entity baseline from recent history; tune the HAVING bar

What doesn't transfer is the idea that any of these is a finished check you paste in. Your task cadences, your file arrival pattern, your retention window, your tenants' normal behaviour, your tolerance for drift — those are statements about your data, not ours. A query lifted from this post will misfire against your schema until you've fitted the buckets, margins, and thresholds to it. That's not a Snowflake quirk; it's why a generic check library is worth less than the reasoning behind it.

The honest floor for running these is good on Snowflake: a task that runs the query, applies the threshold, and calls a notification integration or a small procedure that posts to Slack. It's native, it's free, and for the first few checks it's the right call. What erodes is the tuning — the margins and bars live in tasks nobody revisits, there's no record of which alerts were real, and a muted channel can't tell you the day it stopped being safe to mute. That's the gap we built Alertee to close — though it runs against Postgres and ClickHouse, not Snowflake, so on Snowflake the value we can give you is the method itself.

If your stack also includes the engines we do run natively, the same method on ClickHouse and Postgres walks the failure modes peculiar to those — async inserts that never flush, materialized views that stop populating, uncollapsed duplicates. The general case for where a check has to run to see absence at all is in the continuous monitoring walkthrough, and the part that comes after — keeping a check from firing every quiet Saturday — is tuning against expected absence. Same method, different warehouse, each time.