Every five minutes, BangOn runs a cron job. It checks whether any matches are currently in play, fetches the latest scores from an external API, updates the database, and recalculates the leaderboard. When it works, nobody notices. When it doesn't, the leaderboard freezes mid-match and I notice.
This time, it failed at around 10:30pm.
This is the story of finding out why.
THE WRONG ANSWER
The first thing Claude suspected was rate limiting. The external API has a daily request cap. If the cron job had been firing too often, or if something was making duplicate calls, we might be hitting the ceiling and getting rejected.
I checked the API dashboard. We had used 70 requests out of a daily limit of 7,500. It was not rate limiting.
The second thing Claude suspected was a deployment issue — maybe a recent change had introduced a bug that only showed up under certain conditions. I got Claude to review what had changed. Nothing obvious.
I noticed the first failures late in the evening. My immediate assumption was a blip — a brief API hiccup, nothing to worry about. So I went to bed.
THE ACTUAL ANSWER
When I woke up the next morning and checked the logs, it had been failing more than half the time all night. That concentrated the mind.
I looked more carefully at the timing. The failures had started at the exact point the last match of the game window ended — and that match had been cancelled.
The cron job decides whether to call the external API by checking if any fixtures are currently in play. The original check was simple: has this fixture finished? If not, it must still be active. Go and fetch the score.
The problem was that "finished" was the only terminal state the code knew about. A cancelled match wasn't finished. So as far as the cron was concerned, it was still live. Every five minutes, all night, it looked at the cancelled fixture, concluded there was a match in progress, and called the API for a score update.
The API had nothing to return. The endpoint for a competition with no active matches would hang and then time out. The cron threw a 500 error and tried again five minutes later.
The fix was straightforward once the cause was clear. Treat CANCELLED and POSTPONED as terminal states, the same as FINISHED — if every fixture in the window is in one of those states, there is nothing to sync and the API should not be called at all. Then wrap the API call in a try/catch so that if it times out for any reason, the cron returns a clean skip with a logged reason rather than a wall of 500 errors that looks like something serious has gone wrong.
The diagnosis took most of the morning. The fix took twenty minutes.
WHAT THE CRON JOB LOOKS LIKE NOW
The logic now runs roughly like this: are there any matches in this window that have kicked off and aren't in a terminal state — finished, cancelled, or postponed? If yes, fetch scores and update. If no, return early and log a reason.
It sounds simple. It is simple, once you know what the problem is. The tricky part was that the original version also felt simple — it just had one gap in its understanding of what "done" looks like.
The lesson I took from it: when something starts failing, the first question isn't what changed in the code. It's what changed in the world. The cancelled match was the answer. It was sitting there in the data all along.
