Resend is committed to providing reliable and consistent service to our customers. However, we recognize that incidents might occur. This document outlines our approach to incident management, including detection, response, communication, and post-incident review.
An incident is declared if there is customer impact or degraded performance. An internal incident can also be declared if we anticipate customer impact or degraded performance. This allows us to be proactive in our response and minimize the impact on our customers.
We believe in transparency. If an incident has customer impact, it should be declared. No incident is ignored for being too small. This is the only way we can build trust with our customers.
We also do not believe in shifting blame. In the unfortunate case that our providers are experiencing issues, we declare an incident and work with them to resolve it. Uptime is our responsibility, whether it is our own service experiencing issues or our providers.
If an incident is declared, we will communicate it to our customers via our status page. In situations where a single customer is impacted and we have a direct relationship with them, we work with them via Slack.
We have three severity levels for incidents:
We manage incidents through incident.io, integrated with Slack.
Incidents can be created in two ways:
Every incident has clear roles. One person can hold multiple roles early on, but we separate them as soon as possible.
The primary on-call engineer is the first responder and is paged.
If the alert is not acknowledged within our escalation window, it escalates to the secondary on-call.
Our first priority is to restore service quickly. Common mitigations:
Diagnosis and a “proper fix” can come after stability is restored.
We communicate early and clearly.
An incident is closed when:
We treat every incident as an opportunity to learn, grow, and take ownership. Getting to the root cause helps us avoid repeat incidents and continuously improve on product quality and reliability.
For every incident, the IL prepares the incident report, which includes a timeline of events, impact assessment, contributing factors, resolution steps, and follow-up actions.
Once completed, the incident report is shared with the team and discussed in a blameless post-mortem meeting. The focus is on learning and improvement, not blame. Follow-up actions are created on Linear and tracked and assigned to teams with deadlines based on the severity of the incident.
For a more detailed breakdown of our post incident management process, please refer to our How we handle post incident reviews