Now that all of our services (including backend-main, née “monorepo”) use Trunk-based development and most use Continuous Deployment, we’ve increased the responsibility of individual developers to shepherd their changes into production and into the hands of our customers. This is in sharp contrast to previous deployment approaches where a “release owner” would be responsible for making sure that a release goes out without incident. Now, every developer is a “release owner” for their individual changes.
This role comes with significant responsibilities - changes to code (and configuration) are risky and are a significant source of incidents for us (25% of incidents in the last two months and 50% of Sev-1 incidents in the last two months were attributed to new code) - so developers need to effectively monitor changes they release in order to quickly detect any possible negative customer impact and mitigate it.
We’ve seen a significant failure of this process. In the last two months, we had a Sev-1 incident which lasted 6 hours caused by new code. In this case, the issues were apparent after only a few minutes, but the developer did not detect the issue for ~5 hours.
While it is tempting to blame “attention to detail” or “ownership” (and certainly, those things can be improved), it is more productive to view this as a systems failure: developers who deploy code:
There are a few rough outlines of approaches we might consider to solve this problem:
I’m proposing the third approach: build an automation which notifies developers via Slack once their code is starting to be deployed with links to all the relevant guides for monitoring. For example, the Datadog dashboard showing our monitors for critical pathways.
There are three key reasons why this third approach is preferred:
This would be most impactful for backend-main, but could be re-used for all of our services and apps.
Initially, the included guidance would focus on: