Log Management

Facebook, Instagram, and Whatsapp's Outage: Understanding MTTR

Deepa Ramachandra
Deepa Ramachandra
Share:

Understanding Mean Time To Resolution (MTTR) from the Facebook, Instagram, and Whatsapp’s Outage

Yesterday, the most used social media platforms worldwide were inaccessible for 6 hours straight. Later, in a press release, Facebook revealed that the outage was due to configuration changes in their routers. There is no doubt that Facebook has an intense incident response plan, yet a small blind spot resulted in a significant business interruption. So how do we avoid this? The truth is that outages and performance issues are bound to happen in any network. It is more important how quickly we react and resolve incidents. This post takes you through the most critical factors in your incident response – Mean Time To Resolution (MTTR) and Mean Time To Detect(MTTD).

What is MTTD?

Mean Time To Detect is a key performance indicator in any incident response plan. MTTD is the average time the SRE takes to detect an incident from the time of its occurrence. The mathematical formula used to calculate this value is:

MTTD = Total time taken to detect incidents over some time/ number of incidents.

As in the case of Facebook, incident reports and outage notifications can come from end-users, or application monitoring and management tools, such as observIQ, can send you alerts immediately when something is not working as expected. Businesses aim to lower their MTTR as much as they can. Enterprises invest in observability tools that can provide granular-level insights into the event logs to correlate data from various application sources and infrastructure.

What is MTTR?

Mean Time To Resolution is the time taken to resolve an application issue or an outage and get the application functioning as per the set KPI. MTTR is arrived at by dividing the total time to resolve incidents by the number of incidents that occurred during that period. It is important to note that this is only a statistical average. There are possible deviations to this value based on other contributing factors. The main objective of having a good MTTR is to make an outage less impactful on the end-users. The best-case scenario is to avoid the disruption of system usage. A good system architecture makes the system resilient to performance issues and outages. In a hybrid system architecture, the application’s code resides separately from the web service calls to keep each component’s upgrades or deployments unique. MTTR cannot be maintained low when the resolution, deployment, or upgrade to one component creates the dependency to update the application’s other features.

What factors influence MTTR and MTTD?

Incident detection and resolution strategy:

All incidents that occur in the application are not similar. Therefore, MTTR calculated without factoring in the severity levels of the incidents would give only a generic value for the MTTR and not specific to a severity level. Another critical factor that many MTTR calculations need to factor in is the differences in user traffic. For instance, e-commerce applications see heavy site traffic during the Black Friday sale season. The MTTR calculation in such a scenario should be specifically for that period and not year-round.

You'll need to frame a plan knowing the varied levels of incidents. Strategically laying out a lean ITSM plan would save your business time and resources on incident response. There cannot be a one-size-fits-all approach to your incident response. The cause and effect of every action are gauged, the responses calibrated, and the inferences drawn to make plans for effective responses. An ad-hoc approach to incident response is a recipe for failure. An ad-hoc response or fix is bound to cause unforeseen issues in other KPIs.

Modern businesses adopt a more streamlined approach to incident response. To form a response, the incidents in a system are recorded in a controlled environment, and the values are calibrated. Not only does this help create a picture of a real-time occurrence, but it also helps assign roles to team members during the response.

Using an efficient log management tool:

Monitoring is critical for incident response. If data accumulates over time, arriving at essential SLAs and KPIs would be possible. An application monitoring tool gives you a real-time read on the application’s health over time. Functionalities like Live Tail in observIQ make troubleshooting collaborative and straightforward. The theory that monitoring is for intrusion only has been debunked, with points of vulnerability extending to every corner of the system. If there is a failure caused by an intrusion, without monitoring, it may take days for the SRE to assess the site of intrusion. Monitoring tools provide alerting options for service level indicator SLIs such as performance and security-related anomalies. When an incident occurs, the monitoring tool alerts all configured SRE resources to draw attention to the incident right after it happens, drastically reducing the MTTD and MTTR. You can set alerts for anything from a latency issue to a throughput drop below the threshold value.

Although the Facebook outage is much discussed today, businesses of all sizes struggle with performance and outage issues. The best way to stay prepared is to monitor. We offer a free plan at observIQ. You can try it out to see how you can tackle your MTTR.

Deepa Ramachandra
Deepa Ramachandra
Share:

Related posts

All posts

Get our latest content
in your inbox every week

By subscribing to our Newsletter, you agreed to our Privacy Notice

Community Engagement

Join the Community

Become a part of our thriving community, where you can connect with like-minded individuals, collaborate on projects, and grow together.

Ready to Get Started

Deploy in under 20 minutes with our one line installation script and start configuring your pipelines.

Try it now