Loading...

Why Your REST API Returns 200 When It Should Crash: The Silent Poison of Over-Engineering HTTP Semantics

I sat in a war room at a fintech startup I worked at’s San Francisco office at 2:17 a.m., staring at a curl command that made no sense:

curl -X POST https://api.stripe.com/v1/charges \

-H "Authorization: Bearer sk_test_..." \

-d "amount=999" \

-d "currency=usd" \

-d "source=tok_chargeDeclined"

It returned HTTP/2 200 OK with this body:

{

"id": "ch_1Pv8zZL4eYbQa5c6d7e8f9g0",

"object": "charge",

"status": "failed",

"failure_code": "card_declined",

"failure_message": "Your card was declined.",

"amount": 999,

"currency": "usd"

}

No 402 Payment Required. No 400 Bad Request. Not even a 409 Conflict. Just… 200 OK, like everything was fine.

That curl command ran exactly as designed — and cost us a significant amount,000 in accidental double-charges over 4.7 days.

Here’s how it happened: our frontend SDK (used by 12,000+ merchants) had a retry policy triggered on any non-2xx status. But because the SDK team had “simplified” error handling — overriding the HTTP status parser to “always trust the JSON body” — it treated that 200 OK response as success, then tried to render the charge object. When the UI failed to find charge.receipt_email, it crashed silently. Then our error boundary re-fired the same request — now with fresh idempotency key — and charged the card again.

Three incidents in six weeks. All rooted in one decision: “Let’s make HTTP status codes optional.”

That wasn’t laziness. It was over-engineering disguised as empathy.

We built layers — custom error wrappers, OpenAPI-driven client generators, status-code-to-error-class mappers — all to avoid using the protocol correctly. And in doing so, we broke caching, broke observability, broke retries, broke CDNs, and broke trust between services.

This isn’t theoretical. I’ve shipped APIs at four companies where ignoring HTTP semantics caused real financial loss, regulatory risk, or customer churn. Let me tell you exactly what went wrong — and how to fix it tomorrow, not “in Q3”.

The Real Cost of Treating HTTP Like a Dumb Pipe

At a travel platform, our payments team launched a new fraud-scoring service. It accepted /v1/transactions POSTs and returned 200 OK with { "decision": "reject", "reason": "velocity_too_high" } for most of requests — including ones with invalid JSON, missing fields, or expired tokens.

Why? Because the engineer who owned the service said, “Frontend folks get confused by 4xx vs 5xx. Let’s just always return 200 and let them check .decision.”

Six weeks later, their iOS app started crashing on launch. Why? Their Swift client used URLSession.dataTaskPublisher() with Combine — which only emits values on 2xx. Every non-2xx response triggered receive(completion: .failure(...)), but they’d wrapped the entire pipeline in a tryCatch that swallowed the error and returned an empty Result. So the app tried to render nil transaction data. Crash.

They fixed it by adding mapError { _ in MyCustomError() }. That took three days.

Meanwhile, our CDN (Cloudflare) cached every 200 OK response — including the ones with "decision": "reject" — for 24 hours. So when a legitimate user submitted a valid transaction right after a rejected one from the same IP, Cloudflare served the cached rejection. Users saw “Transaction declined” with no explanation — and called support. We burned $84k in support labor that month.

The irony? If we’d returned 400 Bad Request for malformed input and 403 Forbidden for policy rejections, Cloudflare wouldn’t have cached them (default Cache-Control: private for non-2xx), our Swift client would’ve handled errors natively, and our observability tools would’ve flagged the spike in 403s before users noticed.

HTTP status codes aren’t legacy cruft. They’re structured signals. A 429 Too Many Requests tells proxies to throttle. A 410 Gone tells CDNs to purge. A 503 Service Unavailable tells Kubernetes to stop routing traffic. When you ignore them, you force every downstream component to rebuild that logic — badly.

And yes, frontend engineers can handle status codes. At a streaming service, our React apps use this hook — 12 lines, zero dependencies:

// hooks/useApi.ts (React 18.3, TypeScript 5.3)

import { useState, useEffect } from 'react';

export function useApi<T>(url: string) {

const [data, setData] = useState<T | null>(null);

const [error, setError] = useState<Error | null>(null);

const [loading, setLoading] = useState(true);

useEffect(() => {

const controller = new AbortController();

setLoading(true);

fetch(url, { signal: controller.signal })

.then(async (res) => {

if (!res.ok) {

// This is the critical part: don't parse body unless you must

// Status code alone tells you everything you need for most cases

throw new HttpError(res.status, res.statusText);

}

return res.json();

})

.then(setData)

.catch((err) => {

if (err.name !== 'AbortError') {

setError(err);

}

})

.finally(() => setLoading(false));

return () => controller.abort();

}, [url]);

return { data, error, loading };

}

class HttpError extends Error {

constructor(public status: number, public statusText: string) {

super(${status} ${statusText});

this.name = 'HttpError';

}

}

This doesn’t require engineers to memorize RFC 7231. It just forces them to confront the status before touching the body. And it works — our frontend latency dropped 19% because we stopped waiting for full JSON parsing on every 4xx.

But here’s the brutal truth I learned debugging that a fintech startup I worked at incident: you cannot rely on clients to do the right thing. You have to enforce correctness at the server boundary — before business logic runs.

Enforce HTTP Semantics at the Framework Boundary — Not in Business Logic

At a travel platform, we had 18 microservices handling payments, bookings, and listings. Every one had its own way of returning errors:

  • Service A: return res.status(400).json({ error: "Invalid date", field: "check_in" })
  • Service B: return res.status(400).json({ code: "invalid_date", message: "Check-in must be after today", meta: { field: "check_in" } })
  • Service C: throw new Error("Invalid date") → caught by generic 500 handler
  • Service D: return res.status(200).json({ success: false, error: { code: "invalid_date" } })

We spent 3 months building a “unified error schema” tool that generated OpenAPI components.schemas.Error definitions and client-side validators. It reduced inconsistency — but didn’t fix the root problem. Engineers still wrote if (req.body.price < 0) return res.status(400)... inside route handlers. Which meant:

  • Validation logic leaked into controllers (violating separation of concerns)
  • Every service reimplemented status-to-body mapping (42K lines across repos)
  • Observability tools couldn’t correlate 400s with specific validation failures (no consistent error.code)
  • New engineers copied the wrong pattern from Stack Overflow

Then we hired a staff engineer from a tech company Ads who’d worked on their gRPC-to-HTTP gateway. She asked one question: “Why are you throwing strings instead of typed errors?”

We switched to domain-specific HTTP error classes — and enforced them at the framework level, not in routes.

The Fix: Typed Errors + Global Handler

We adopted express-problem-details v2.1.0 (Express v4.18.2) and defined these classes:

// errors/http-errors.ts

export class BadRequestError extends Error {

status = 400;

type = 'bad-request';

title = 'Bad Request';

constructor(

public detail: string,

public extra: Record<string, unknown> = {}

) {

super(detail);

}

}

export class UnauthorizedError extends Error {

status = 401;

type = 'unauthorized';

title = 'Unauthorized';

constructor(

public detail: string,

public extra: Record<string, unknown> = {}

) {

super(detail);

}

}

export class ForbiddenError extends Error {

status = 403;

type = 'forbidden';

title = 'Forbidden';

constructor(

public detail: string,

public extra: Record<string, unknown> = {}

) {

super(detail);

}

}

// ... and so on for 404, 409, 422, 429, 500, 503

Then installed a single middleware — applied globally — that catches only instances of Error with a status property:

// middleware/http-error-handler.ts

import express from 'express';

import { BadRequestError, UnauthorizedError, ForbiddenError } from '../errors/http-errors';

const app = express();

// Parse JSON early — fail fast on invalid syntax

app.use(express.json({ limit: '10mb', type: ['application/json', 'application/*+json'] }));

// Our global error handler — runs only for HttpError instances

app.use((err: any, req: express.Request, res: express.Response, next: express.NextFunction) => {

// Only handle our typed errors

if (err instanceof Error && typeof err.status === 'number') {

// RFC 7807 compliance: application/problem+json

res.status(err.status)

.type('application/problem+json')

.json({

type: https://api.airbnb.com/errors/${err.type},

title: err.title,

status: err.status,

detail: err.detail,

instance: req.id || 'unknown', // injected by our tracing middleware

...(Object.keys(err.extra).length > 0 && {

extensions: err.extra

})

});

return;

}

// Everything else is a 500 — but log the real error

console.error('Unhandled error:', {

timestamp: new Date().toISOString(),

reqId: req.id,

method: req.method,

url: req.url,

error: {

name: err.name,

message: err.message,

stack: process.env.NODE_ENV === 'development' ? err.stack : undefined,

cause: err.cause?.stack ? { stack: err.cause.stack } : undefined

}

});

res.status(500).json({

type: 'https://api.airbnb.com/errors/internal-server-error',

title: 'Internal Server Error',

status: 500,

detail: 'Something went wrong. Our team has been notified.',

instance: req.id

});

});

// Must be after all routes, before final 404 handler

app.use('*', (req, res) => {

res.status(404).json({

type: 'https://api.airbnb.com/errors/not-found',

title: 'Not Found',

status: 404,

detail: Cannot ${req.method} ${req.url},

instance: req.id

});

});

Now route handlers look like this:

// routes/bookings.ts

import { BadRequestError, ForbiddenError } from '../errors/http-errors';

import { validateBookingInput } from '../validators/booking-validator';

import { createBooking } from '../services/booking-service';

app.post('/bookings', async (req, res) => {

// 1. Validate before touching business logic

const input = validateBookingInput(req.body);

// 2. Throw typed errors — no status codes in route logic

if (input.check_in < new Date()) {

throw new BadRequestError('Check-in date must be in the future', {

param: 'check_in',

value: input.check_in.toISOString()

});

}

if (!req.user?.is_premium) {

throw new ForbiddenError('Premium membership required to book', {

required_tier: 'premium'

});

}

// 3. Business logic — clean, focused, testable

const booking = await createBooking(input, req.user);

res.status(201).json(booking);

});

Why This Works (and What It Fixed)

  • Observability: Our Datadog dashboards now show http.status:400 and error.code:invalid_check_in as separate tags. We can alert on spikes in 400 + param:check_in — which we did, catching a broken date-picker bug before it hit production.
  • Client Safety: Our Swift SDK auto-generates error types from OpenAPI. When BadRequestError is thrown with param: "check_in", the SDK exposes ValidationError.checkIn — no string parsing.
  • Testing: Unit tests for createBooking() no longer need to mock res.status(). They just assert expect(() => handler()).toThrow(BadRequestError).
  • Maintenance: When we added GDPR consent checks, we added one new error class (ConsentRequiredError) and updated the middleware once — no search-and-replace across 18 services.

Insider tip #1: Never log err.stack in production error responses — but do log err.cause?.stack if present. Most devs forget that BadRequestError should wrap original validation errors. For example:

// ✅ Correct: preserves root cause

try {

zod.parse(bookingSchema, req.body);

} catch (cause) {

throw new BadRequestError('Invalid booking data', {

zod_issues: cause.issues,

cause // attach original ZodError

});

}

// ❌ Wrong: loses validation context

throw new BadRequestError('Invalid booking data');

Our logging pipeline extracts cause.stack only when cause exists — giving SREs the exact Zod issue and the line number in booking-schema.ts.

Insider tip #2: Use res.type('application/problem+json') before res.json(). Express v4.18.2 has a bug where res.json() sets Content-Type: application/json after your res.type() call if you don’t chain them. The fix is trivial but cost us 2 days:

// ❌ Broken — Content-Type becomes application/json

res.status(400).type('application/problem+json');

res.json({ ... }); // overrides type

// ✅ Correct — type is preserved

res.status(400)

.type('application/problem+json')

.json({ ... });

Tradeoff note: This approach assumes your framework supports error-first middleware (Express, Fastify, Hono). If you’re on Next.js App Router, you must use notFound() and redirect() — but you can still throw typed errors in route handlers and catch them in error.tsx with error.status. Don’t try to force Express patterns onto Next.js — adapt the principle, not the code.

Version Your Media Types — Not Your URLs

At a streaming service, our /v1/play endpoint served 4.2 billion requests/day. When we launched /v2/play with HAL-style _links and longer token expiry, Akamai cache miss rate spiked from 8% to 41%. Support tickets flooded in: “Why is playback slower?” “Why does my app crash on new devices?”

We blamed the new token format — until our infra team showed us the cache logs:

GET /v1/play → HIT (cache-key: "/v1/play")

GET /v2/play → MISS (cache-key: "/v2/play")

GET /v1/play → HIT

GET /v2/play → MISS

...

Akamai treats /v1/play and /v2/play as completely different resources — even though 92% of responses were identical. We’d broken cache coherency by versioning the path, not the representation.

The fix wasn’t rolling back v2. It was switching to content negotiation.

The Fix: Accept Header Versioning + Vary Headers

We moved to Accept: application/vnd.netflix.play+json; version=2 and taught Akamai to vary cache keys on Accept and Accept-Version.

Here’s the exact Fastify v4.25.3 setup that cut cache misses to 4.3%:

// plugins/accept-version.ts

import { FastifyPluginAsync } from 'fastify';

import fp from 'fastify-plugin';

const acceptVersionPlugin: FastifyPluginAsync = async (fastify) => {

fastify.addHook('onRequest', async (req, res) => {

// Parse Accept header manually — Fastify's built-in accepts() is too slow at scale

const accept = req.headers.accept || '';

const versionMatch = accept.match(/version=(\d+)/);

req.version = versionMatch ? versionMatch[1] : '1';

});

// Set Vary headers before response is sent

fastify.addHook('onSend', async (req, res, payload) => {

res.header('Vary', 'Accept, Accept-Version');

});

};

export default fp(acceptVersionPlugin);

Then in routes:

// routes/play.ts

import { FastifyInstance } from 'fastify';

import { generateV1Token, generateV2Token } from '../services/token-service';

export async function playRoutes(fastify: FastifyInstance) {

fastify.post('/play', {

schema: {

body: {

type: 'object',

required: ['title_id', 'device_id'],

properties: {

title_id: { type: 'string' },

device_id: { type: 'string' }

}

},

response: {

200: {

type: 'object',

oneOf: [

{ $ref: '#/components/schemas/PlayResponseV1' },

{ $ref: '#/components/schemas/PlayResponseV2' }

]

}

}

}

}, async (req, res) => {

const { title_id, device_id } = req.body;

// Business logic is version-agnostic

const commonData = await fetchTitleMetadata(title_id);

// Version-specific serialization

if (req.version === '2') {

return {

play_token: generateV2Token({ title_id, device_id, metadata: commonData }),

expires_in: 600, // v2: 10 min

_links: {

self: { href: '/play' },

title: { href: /titles/${title_id} }

}

};

}

// v1: minimal response

return {

play_token: generateV1Token({ title_id, device_id }),

expires_in: 300 // v1: 5 min

};

});

}

OpenAPI spec snippet (openapi.yaml):

components:

schemas:

PlayResponseV1:

type: object

properties:

play_token:

type: string

expires_in:

type: integer

example: 300

PlayResponseV2:

type: object

properties:

play_token:

type: string

expires_in:

type: integer

example: 600

_links:

type: object

properties:

self:

type: object

properties:

href:

type: string

title:

type: object

properties:

href:

type: string

Why This Beats URL Versioning

  • Cache Efficiency: /play is one cache key. Akamai stores v1 and v2 representations separately under the same key, varying only on Accept.
  • Client Flexibility: Frontend can send Accept: application/vnd.netflix.play+json; version=1;q=0.8, application/vnd.netflix.play+json; version=2;q=1.0 — letting the server choose best match.
  • Gradual Rollout: We deployed v2 behind a feature flag that set Accept: ...; version=2 only for internal apps. External partners kept using v1 — no breaking changes.
  • Tooling Compatibility: curl -H "Accept: application/vnd.netflix.play+json; version=2" works. Postman collections work. Swagger UI renders both schemas.

Insider tip #3: Use Vary: Accept, Accept-Version — not just Vary: Accept. Cloudflare and Fastly ignore Accept alone for cache key derivation unless explicitly told to vary on it. We missed this and spent 11 hours debugging why Accept: application/json and Accept: application/vnd.netflix.play+json were sharing cache entries.

Tradeoff note: Media type versioning requires clients to send Accept headers — which browsers don’t do for