Yesterday the most used social media platforms in the world were inaccessible for 6 hours straight. Later, in a press release, Facebook revealed that the outage was due to configuration changes in their routers. There is no doubt that Facebook has an intense incident response plan, yet a small blind spot resulted in a significant business interruption. So how do we avoid this? The truth is, outages and performance issues are bound to happen in any network. It is more important how quickly we react and resolve incidents. This post takes you through the most critical factors in your incident response – Mean Time To Resolution (MTTR) and Mean Time To Detect(MTTD).
Mean Time To Detect is a key performance indicator in any incident response plan. MTTD is the average time taken by the SRE to detect an incident from the time of its occurrence. The mathematical formula used to calculate this value is:
MTTD = Total time taken to detect incidents over a period of time/ number of incidents.
Incident reports and outage notifications can come from end-users, as in the case of Facebook, or application monitoring and management tools, such as observIQ, can send you alerts immediately when something is not working as expected. Businesses aim to lower their MTTR as much as they can. Businesses invest in observability tools that can provide granular level insights into the event logs to correlate data from the various sources of the application and infrastructure.
Mean Time To Resolution is the time taken to resolve an application issue or an outage and get the application functioning as per the set KPI. MTTR is arrived at by dividing the total of the time to resolution of incidents over a period of time by the number of incidents that occurred during that period. It is important to note that this is only a statistical average. There are possible deviations to this value based on other contributing factors. The main objective of having a good MTTR is to make an outage less impactful on the end-users. The best-case scenario is to avoid the disruption of system usage. A good system architecture makes the system resilient to performance issues and outages. In a hybrid system architecture, the application’s code resides separately from the web service calls to keep each component’s upgrades or deployments unique. MTTR cannot be maintained low when the resolution, deployment, or upgrade to one component creates the dependency to update the application’s other features.
All incidents that occur in the application are not similar. Therefore, MTTR calculated without factoring in the severity levels of the incidents would give only a generic value for the MTTR and not specific to a severity level. Another critical factor that many MTTR calculations fail to factor in is the differences in user traffic. For instance, e-commerce applications see heavy site traffic during the Black Friday sale season. The MTTR calculation in such a scenario should be for that period of time specifically and not a year-round calculation.
You must frame a plan knowing the varied levels of incidents. Strategically laying out a lean ITSM plan would save your business time and resources spent on incident response. There cannot be a one size fits all approach to your incident response. The cause and effect of every action is gauged, the responses calibrated, and the inferences drawn to make plans for effective response. An ad-hoc approach for incident response is a recipe for failure. An ad-hoc response or fix is bound to cause unforeseen issues in other KPIs.
Modern businesses adopt a more streamlined approach to incident response. To form a response, the incidents in a system are recorded in a controlled environment and the values are calibrated. Not only does this help form a picture of a real-time occurrence, but it also helps assign roles to individuals in the team during the response.
Monitoring is critical for incident response. If there is no data accumulation over a period of time, arriving at key SLAs and KPIs would become impossible. Having an application monitoring tool gives you a real-time read on the application’s health over a period of time. Functionalities like Live Tail in observIQ make troubleshooting collaborative and simple. The theory that monitoring is for intrusion only has been debunked, with points of vulnerability now extending to every corner of the system. If there is a failure caused by an intrusion, without monitoring it may take days for the SRE to assess the site of intrusion. Monitoring tools come equipped with alerting options for service level indicator SLIs such as performance and security-related anomalies. When an incident occurs, the monitoring tool alerts all configured SRE resources to draw attention to the incident right after it occurs, drastically reducing the MTTD and MTTR. You can set alerts for anything from a latency issue to a throughput drop below the threshold value.
Although the Facebook outage is much talked about today, businesses of all sizes struggle with performance and outage issues. The best way to stay prepared is to monitor. We offer a free plan at observIQ. Try it out to see how you can tackle your MTTR.