Problem

Now that all of our services (including backend-main, née “monorepo”) use Trunk-based development and most use Continuous Deployment, we’ve increased the responsibility of individual developers to shepherd their changes into production and into the hands of our customers. This is in sharp contrast to previous deployment approaches where a “release owner” would be responsible for making sure that a release goes out without incident. Now, every developer is a “release owner” for their individual changes.

This role comes with significant responsibilities - changes to code (and configuration) are risky and are a significant source of incidents for us (25% of incidents in the last two months and 50% of Sev-1 incidents in the last two months were attributed to new code) - so developers need to effectively monitor changes they release in order to quickly detect any possible negative customer impact and mitigate it.

We’ve seen a significant failure of this process. In the last two months, we had a Sev-1 incident which lasted 6 hours caused by new code. In this case, the issues were apparent after only a few minutes, but the developer did not detect the issue for ~5 hours.

While it is tempting to blame “attention to detail” or “ownership” (and certainly, those things can be improved), it is more productive to view this as a systems failure: developers who deploy code:

  1. Do not reliably know when their code goes to production
  2. Do not know what and how to monitor their changes for issues

Proposal

There are a few rough outlines of approaches we might consider to solve this problem:

  1. Training - train developers on how to do post-deployment monitoring
  2. Process - institute a process by which developers themselves outline what monitoring they’ll do for individual changes which is subject to review
  3. Automation - notify developers when their deployment is finished with links to relevant resources

I’m proposing the third approach: build an automation which notifies developers via Slack once their code is starting to be deployed with links to all the relevant guides for monitoring. For example, the Datadog dashboard showing our monitors for critical pathways.

There are three key reasons why this third approach is preferred:

  1. It is universal - it does not depend on each team’s onboarding processes and applies to all engineers (new and existing, new to a specific repo, etc)
  2. It does not slow us down - unlike an added process and review (which usually involves a costly handoff) this approach delivers the right information at the right time without getting in the way of shipping in the first place
  3. It solves the problem of knowing when your changes go to production - all the training in the world will not assure that developers know exactly when their changes are going to production, and any developer may get sidetracked with other work while their changes are being deployed

This would be most impactful for backend-main, but could be re-used for all of our services and apps.

Initially, the included guidance would focus on: