How we handle incidents

Reading time5min


Updated


Authors

Resend is committed to providing reliable and consistent service to our customers. However, we recognize that incidents might occur. This document outlines our approach to incident management, including detection, response, communication, and post-incident review.

When to declare an incident

An incident is declared if there is customer impact or degraded performance. An internal incident can also be declared if we anticipate customer impact or degraded performance. This allows us to be proactive in our response and minimize the impact on our customers.

We believe in transparency. If an incident has customer impact, it should be declared. No incident is ignored for being too small. This is the only way we can build trust with our customers.

We also do not believe in shifting blame. In the unfortunate case that our providers are experiencing issues, we declare an incident and work with them to resolve it. Uptime is our responsibility, whether it is our own service experiencing issues or our providers.

If an incident is declared, we will communicate it to our customers via our status page. In situations where a single customer is impacted and we have a direct relationship with them, we work with them via Slack.

Severity levels

We have three severity levels for incidents:

  • Critical: all customers affected, critical service degradation or interruption
  • Major: large subset of customers affected, critical degradation of service
  • Minor: small subset of customers affected, no critical degradation of service

Detection and declaring

We manage incidents through incident.io, integrated with Slack.

Incidents can be created in two ways:

  • Automatically based on our monitors
  • Manually by anyone via incident.io

Roles

Every incident has clear roles. One person can hold multiple roles early on, but we separate them as soon as possible.

  • Incident Lead (IL): Owns coordination, priorities, and decisions (mitigation first, then diagnosis)
  • Communication Lead (CL): Owns updates to the status page and internal communication

First response

The primary on-call engineer is the first responder and is paged.

If the alert is not acknowledged within our escalation window, it escalates to the secondary on-call.

Mitigation-first mindset

Our first priority is to restore service quickly. Common mitigations:

  • rollback to last known good deploy
  • feature flag disable

Diagnosis and a “proper fix” can come after stability is restored.

Communication

We communicate early and clearly.

  • Internal updates happen continuously in the incident channel.
  • Status page updates are owned by the Communication Lead.

Closing an incident

An incident is closed when:

  • customer impact is resolved, and
  • the system is stable (monitoring is green, backlog drained, no ongoing degradation)

Post incident

We treat every incident as an opportunity to learn, grow, and take ownership. Getting to the root cause helps us avoid repeat incidents and continuously improve on product quality and reliability.

For every incident, the IL prepares the incident report, which includes a timeline of events, impact assessment, contributing factors, resolution steps, and follow-up actions.

Once completed, the incident report is shared with the team and discussed in a blameless post-mortem meeting. The focus is on learning and improvement, not blame. Follow-up actions are created on Linear and tracked and assigned to teams with deadlines based on the severity of the incident.

For a more detailed breakdown of our post incident management process, please refer to our How we handle post incident reviews