Post-Mortem Template¶
Post-mortems are blameless by design. The goal is to understand what happened, why the system allowed it to happen, and what changes will reduce the likelihood or impact of recurrence. Individual actions are recorded as facts, not as failures.
Template¶
Incident Summary¶
| Field | Value |
|---|---|
| Incident ID | |
| Date / Time (UTC) | |
| Duration | |
| Severity | P1 / P2 / P3 |
| Service(s) affected | |
| Author | |
| Review date |
One-paragraph summary: What happened, when it was detected, and how it was resolved.
Timeline¶
All times in UTC. Record facts as they occurred — do not reconstruct or tidy the sequence.
| Time (UTC) | Event |
|---|---|
Contributing Factors¶
List the conditions that made this incident possible. Each factor should be a system or process property, not a person. A contributing factor explains why the system behaved the way it did, not who caused it.
- Factor 1: ...
- Factor 2: ...
- Factor 3: ...
Customer Impact¶
Describe what customers or internal users experienced. Be specific: which journeys were affected, whether data was lost or delayed, and whether the impact was visible to end users.
- User-facing impact: ...
- Data integrity impact: None / Partial / Full — describe
- Estimated users/requests affected: ...
Detection Gap¶
How long between the fault occurring and the team becoming aware? What surfaced the issue — an alert, a customer report, a manual check? What would have caught it sooner?
- Time to detection: ...
- Detection mechanism: ...
- Gap identified: ...
Action Items¶
Each action item should have a single owner and a committed due date. Items without owners are not action items.
| # | Action | Owner | Due Date | Status |
|---|---|---|---|---|
| 1 | Open | |||
| 2 | Open | |||
| 3 | Open |
Worked Example (Redacted)¶
Incident Summary¶
| Field | Value |
|---|---|
| Incident ID | INC-2024-047 |
| Date / Time (UTC) | 2024-09-12, 09:14 UTC |
| Duration | 2 hours 11 minutes |
| Severity | P2 |
| Service(s) affected | Order Fulfilment Background Processor |
| Author | [Redacted] |
| Review date | 2024-09-16 |
One-paragraph summary: Between 09:14 and 11:25 UTC on 12 September 2024, the Order Fulfilment Background Processor silently stopped processing messages from its Azure Service Bus queue. Affected messages were moved to the dead-letter sub-queue after exceeding the MaxDeliveryCount threshold, which had been incorrectly set to 1 during a configuration change deployed the previous evening. No alert fired during the two-hour window. The incident was identified via a customer support report at 11:19 UTC. The service was restored by correcting the configuration value and replaying dead-lettered messages.
Timeline¶
| Time (UTC) | Event |
|---|---|
| 2024-09-11 18:42 | Configuration change deployed to production; MaxDeliveryCount set to 1 (intended value: 10) |
| 2024-09-12 09:14 | First message delivery attempted post-restart; message immediately dead-lettered after single processing attempt |
| 09:14 – 11:19 | Messages continued to arrive and dead-letter silently; no alert triggered |
| 11:19 | Customer support receives report of unprocessed orders; ticket raised internally |
| 11:31 | On-call engineer identifies dead-letter queue depth at 847 messages |
| 11:38 | Root cause identified: MaxDeliveryCount = 1 in production configuration |
| 11:44 | Configuration corrected and service restarted |
| 11:52 | Dead-lettered messages replayed via Service Bus Explorer |
| 11:58 | Message processing confirmed healthy; dead-letter queue draining |
| 12:25 | Queue fully drained; incident closed |
Contributing Factors¶
- No dead-letter queue depth alert existed. The monitoring setup covered queue lag on the main topic but not dead-letter accumulation. A spike in dead-letter depth was not observable without a manual check.
- The configuration change was not reviewed against the service's processing contract.
MaxDeliveryCountcontrols how many times a message is attempted before dead-lettering. Setting it to 1 was consistent with valid Service Bus configuration syntax, so no validation error was raised at deployment time. - The deployment pipeline did not include a post-deploy smoke test that would have surfaced the misconfiguration. A single end-to-end message processed successfully after deploy would have caught this within minutes.
- The dead-letter replay path was not documented. The engineer on call had to locate the correct tooling and procedure during the incident, adding time to resolution.
Customer Impact¶
- User-facing impact: Orders submitted between 09:14 and 11:52 UTC were not fulfilled. Users received no error — the checkout flow completed normally, but fulfilment was not triggered.
- Data integrity impact: No data was lost. All messages were recoverable from the dead-letter sub-queue and successfully replayed.
- Estimated users/requests affected: 847 orders affected. All subsequently processed.
Detection Gap¶
- Time to detection: 2 hours 5 minutes
- Detection mechanism: Customer support report, not automated alerting
- Gap identified: No alert existed for dead-letter queue depth. The system failed silently within its operational boundary — Service Bus behaved correctly given the configuration it was given. The gap was in observability, not in the message broker.
Action Items¶
| # | Action | Owner | Due Date | Status |
|---|---|---|---|---|
| 1 | Add Azure Monitor alert for dead-letter queue depth > 10 on all production Service Bus queues | [Redacted – Platform Engineer] | 2024-09-19 | Completed |
| 2 | Add post-deploy smoke test to fulfilment processor pipeline: publish one synthetic message and assert it is consumed within 60 seconds | [Redacted – Backend Engineer] | 2024-09-26 | Completed |
| 3 | Document dead-letter replay procedure in the service runbook; link from the team wiki | [Redacted – Tech Lead] | 2024-09-20 | Completed |
| 4 | Add configuration schema validation step to deployment pipeline to check MaxDeliveryCount is within expected range (1–10 is valid; flag anything below 3 for manual review) |
[Redacted – Platform Engineer] | 2024-10-04 | In Progress |