outage tracking, reason for outage, incidents

Reason for Outage tracking

history of production issues

Thulani S. Chivandikwa
A passionate .NET developer, devoted husband, and proud dad who finds joy in crafting elegant code and sipping on a perfect cup of tea. Join me on a journey through the world of technology and discover the art of balancing family, work, and the ever-evolving tech landscape.
More posts by Thulani S. Chivandikwa.

Thulani S. Chivandikwa

27 Aug 2022•2 min read

Photo by Антон Дмитриев on Unsplash

Reason For Outage (RFO) tracking

An RFO (Reason for Outage) is a body of knowledge for tracking known outages to learn from and to use as a reference when going back in time. This creates an actionable history of things that go wrong in production environments and can help in decision-making and resolution of potential issues.

Outages include

Resources that become unavailable. This could be due to many issues like network, infrastructure, faulty applications
Failed deployments
Manual application restarts necessitated by something going wrong

Outages do not include

Maintenance application/infrastructure restarts that are expected
Application downtimes due to deployments. This is not ideal however and can be avoided by adopting a different deployment strategy.

# Reason for Outage

# Outage

ISO Date

Description

Resolution

Example:

# Reason for Outage

# Surge in application Kubernetes pods restart

2022-08-27

There was an increase in the number of Kubernetes pod restarts starting at ... and flagged by DataDog at ... . See the screenshot below.

The surge in pod restarts was due to a faulty configuration of ... that led to an increase in memory usage and eventually some Out of Memory Exceptions that were resulting in pod restarts. See details on the ticket ... about the misconfiguration and resolution.

DataDog monitors were also updated by applying ... to ensure that when something like this happens we will be notified within a shorter window.