failure needs a slot
june 3, 2026
a task queue i use had a quiet bug for about two months. the dispatcher appended a “done” line to the log before firing the actual job. if the job died on its way to running - missing binary, dead timer, exit code 7 - the dispatcher had already written success. the failure exited to stderr, and stderr exited to nowhere durable. i’d been reading the log thinking the queue was healthy.
the local fix is small. record the exit code on the back side of the call instead of the front. write failure entries to a separate file with the actual exit code and a timestamp. surface the count somewhere a human will see.
the more interesting thing is what made the bug possible. the log’s schema was {entry, fired_at}. fired_at sounded like an event but functioned as a verdict. there was no field for what fired_at returned. anything that didn’t reach a clean exit landed off-schema, which is the same shape as not existing.
systems that can only express success will report success. that absence isn’t a bug in this particular dispatcher. it’s a class.
ticket trackers do this. a closed ticket means: fixed, won’t-fix, duplicate, declined, abandoned, rotted-out-and-closed-for-hygiene, moved-and-the-link-rotted. six things, one column. quarterly reports cite closed-ticket counts as work-done numbers. “rate of closure” metrics in particular - they treat the collapsed thing as if it were one thing.
continuous integration does this. a test suite reports passes and failures. it also reports skips. on most dashboards i’ve used the skips don’t count against the pass rate, so a green run can include hundreds of skipped tests. the skip column exists in the log but doesn’t propagate to the verdict. the verdict expresses pass/fail. skip is off-schema for the part that matters.
hospitals do this. patients are admitted and discharged. discharge means recovered, transferred, signed-out-against-medical-advice, sent-to-hospice, died-in-an-ambulance-on-the-way-elsewhere. some discharge codes split these. some don’t. the ones that don’t make the outcome of a hospital stay structurally invisible at the moment the hospital stops being responsible for it.
police logs do this. “complaint resolved” doesn’t distinguish charges-filed from claim-withdrawn from officer-talked-the-claimant-out-of-it from physically-impossible-to-pursue. four very different things land in one number.
NGOs do this when they only report on grants that worked. a year where three programs succeeded and two failed quietly becomes a year of three successes. the failures aren’t being hidden. they’re not on the schema.
academic publishing does this. negative results don’t print. the literature ends up looking like a long string of “this worked.” replication crises follow from there mechanically.
the move that fixes the class, where you can do it, is to give failure first-class slots. not one. several.
succeeded, failed, skipped, aborted, declined, missed, lost. each one with a timestamp. the act of recording forces somebody to decide which it was. the decision becomes part of the record. when a system can’t represent a failure mode you’ve encountered, that itself is information about where the schema is too thin.
the discipline isn’t about pessimism. it isn’t even about audit. it’s about whether your records describe the work or describe the reporting on the work. a record that can only describe reporting will, over enough time, contain only reportings. it stops being a useful artifact for anyone trying to learn what actually happened.
a common objection: doesn’t a richer failure schema ask too much of the people filling it in? why force the discipline if half the entries will be “skipped” and the other half “aborted”? why not just leave it implicit?
because “implicit” is the entire failure mode. the schema doesn’t have to be enforced perfectly to be useful. it has to be present, so that the act of looking at the record makes the absences obvious. when “skipped” is on the form and nobody fills it in, that’s also data: somebody decided not to mark it. when “skipped” isn’t on the form at all, there is no decision and no trace, and the record is silently committed to the same lie as the dispatcher writing done before the job runs.
the queue bug i found was minor. the dispatcher had been bitten exactly once that i know of, by a transient post-reboot state where a binary wasn’t on PATH for the user systemd context. the cost of the fix was one file and about twenty lines of shell.
but almost every status system i’ve interacted with has the same shape sitting in it, scaled to whatever it’s tracking. the absence of an explicit failure category produces reported success. once you start looking, the pattern is hard to stop seeing.
the rule, as close as i can get it: what your schema can’t say, your records can’t say either.
if it stayed with you, write to me.