From LinkedIn · · 2 min
Anatomy of a P0
Most production outages are boring. They hide in the part of the system nobody looks at because it has always just worked. The anatomy of one.
Most production outages are not exotic. The expensive ones are boring, and they hide in the part of the system nobody looks at because it has always just worked.
Here is the anatomy of one.
I have watched a single deploy take down every API endpoint in a system. The cause was a file that was never supposed to be in the repository.
The setup looked normal. The deploy pipeline copied the service onto the server on every release. Standard. Boring. Safe.
Except .env was sitting in the repository. Someone had committed it years earlier, before .gitignore covered it, and nobody had ever removed it from tracking. So every deploy quietly copied the repo’s .env over the server’s .env.
For months that did nothing. The two files matched.
Then the database password was rotated. Someone updated the server’s .env by hand to the new one. They did the correct thing. What they could not see was that the repo’s .env still held the old password.
Then the next release shipped. The pipeline copied the old password over the corrected one. Every query started failing. Every endpoint that touched the database returned 500. The system was down for about forty minutes before anyone connected the outage to a deploy that “only touched the frontend.”
Here is what makes this a real P0 and not a config typo. Nothing in the deploy was wrong. The pipeline did exactly what it was told. The bug was that config lived in two places that were allowed to disagree, and on every release the wrong one won.
The fix is cheap. Stop tracking .env in the repo, and rotate everything that has ever been in its history. Exclude config from the deploy sync so the server keeps its own. Better still, stop keeping secrets in files at all. A .env secret also lives in your git history, your backups, and every machine that ever cloned the repo. Files were always a stopgap. Secrets belong in a dedicated secrets store that the app reads at startup. On AWS, that is Secrets Manager, which rotates them for you, so the hand-edit that started this never has to happen.
This is one of the cheapest P0s to fix once you see it. And one of the most expensive to discover when you don’t.