Visualizing log data is one of the biggest perks of using good log management software. Data is many businesses’ most critical asset. But, without proper use, a business’ data becomes just an artifact and no longer an asset. Visualization and analysis are the end goals of collating log data from their sources. The need for visualization arises from the fact that we intuitively process visual information faster than a random jumble of numbers and letters. Visualizing log data brings clarity to the application and infrastructure that is easy to read, decipher, and react to. Dashboards display information in the form of pie charts, geo charts, histograms, etc. Within a dashboard, businesses view complex log data simplified visually; when clicking through the visuals, the finer details of the Data are displayed.
Though most applications offer dashboards, not all of them do it well. Some applications reset the dashboard settings when they release new versions, and others provide too many general pre-built dashboard templates instead of taking the time to create one good pre-built option for every use case. Even though a solid log agent, like our very own Stanza, should be your primary consideration when choosing a log management software, a good and reliable dashboard should be a critical factor as well. In this post, we take you through the various uses of visualizing your log data via Dashboards. The information presented should help you gauge if you are using the correct log management software for your business.
Application and Web Usage stats:
In the virtual business space, companies value any insight that they can get from their data to understand the end-user personas. A good practice would be to set up a dashboard specifically for usage metrics such as:
- top URL accessed,
- devices used to access the application,
- the location, language, and local time of the end-user.
A usage dashboard can also be used to track network-based events that are logged from network devices such as firewalls, routers, and switches. It gives an overview of all the requests sent and received from and to the network, requests that were denied, and a list of all devices that are monitored in the network. This helps businesses easily capture anomalies in network behavior by comparing a standard graph to a graph with deviations from the expected. Network events such as unusual user activity, high application requests/denials, and failure in one or more network devices is easier to capture.
Correlating Data to study trends:
In the past, when all we read were system logs, reading logs didn’t require visualizations. But, in the day of containerized microservices-based applications, reading logs manually is impossible, and skimming through all the logs ingested into a log management tool could take hours. In addition, in a landscape where logs originate from a dozen or more sources, an engineer is often left wondering which disparate system in the network is causing the error. That’s why businesses turn to visualization. In visualizing, businesses can create a correlation between the various log sources to arrive at a common outcome from all of the sources. The logs based on sequences, event patterns, and expected results are visually correlated on the dashboard. So when there is something that is of concern, businesses have the events from all the sources to compare and identify the problem areas quickly. In some scenarios, it may not be necessary to read through every log event; instead, a dashboard could convey the information businesses are looking for. For instance, a dashboard showing delayed response times from an application component could not just mean a flaw in that component, so a parallel chart of all components could present a clearer picture.
Easier and quicker troubleshooting:
The biggest advantage of having a dashboard view is identifying issues even before they are reported. Monitoring the application for meeting SLAs and focussing on the KPIs set for the application is easier in the dashboard view. Businesses can react and fix an issue before it trickles down to the end-user. Often, high cardinality data that systems generally log are overlooked. To have this visualized on the dashboard makes troubleshooting more streamlined. When an issue is reported, businesses can begin with a check on the basics such as load volume, CPU usage, etc., before they move further into a detailed analysis. In some cases, the issue can be something as simple as CPU usage. Dashboards also give a clear picture of connectivity issues, helping businesses identify the areas that need some fine-tuning and avoid service disruptions.
observIQ offers a pre-built dashboard based on the logs ingested into your account. You can also build custom dashboards or clone an existing dashboard to create a new one. The visualization capabilities and dashboards are available to ALL users of observIQ. We do not restrict any of our users from taking advantage of this great functionality. The dashboards in observIQ are highly malleable; You get to add, edit. Delete, and position the visualization widgets based on your needs.
Try using our dashboards and send us your comments.
With the advent of IaaS (Infrastructure as a service) and IaC (Infrastructure as Code), it is now possible to manage versioning, code reviews, and CI/CD pipelines at the infrastructure level through resource provisioning and on-demand service routing. Kubernetes is the indisputable choice for container orchestration. At this point, globally, most DevOps teams turn to Kubernetes to orchestrate and automate their software development processes, reducing the web server provisioning cost to the bare minimum.
Although Kubernetes offers an unparalleled solution to container orchestration, DevOps and development teams term the observability practice for Kubernetes applications as a constant challenge. Kubernetes as a platform is dynamic and expansive with several components that are unique functionally and produce distinct outcomes behaviourally during implementation. As a result, teams formulate solutions on their own to their application logging challenges, even though Kubernetes offers a logging architecture OOTB. In addition, all Kubernetes applications produce a large volume of logs, making manual management a practical impossibility. The absence of a monitoring solution could bring about unfavorable network intrusions, as in the case of Tesla. In 2018, a very high-risk hack in Tesla’s network occurred from the Kubernetes admin console that was not password protected.
We wrote about simple logging techniques for Kubernetes applications. In this post, we look at how you can troubleshoot issues in pods using the live tail feature in observIQ.
Kubernetes components to monitor:
- Clusters: There are two primary components in a working Kubernetes cluster: nodes, and a control plane. The control plane maintains the Cluster in the desired state as per the DevOps team’s configuration. The nodes manage the application workload.
- Pods: Every node in a working Kubernetes cluster has one or more pods. A group of containers that share the network’s resources, storage, namespace, etc., are called pods.
- Applications: A software application and its dependencies are packaged and containerized using Kubernetes as the container orchestration tool.
- Containers: A software application, its libraries, dependencies, and configurations are collectively called a container.
Why monitoring pod activity is critical
Pods are the simplest deployable units of a containerized application. The most common use case for Kubernetes orchestrated applications is the one container per pod model. Pod statuses are transient, pod health equates to application’s functioning. So it is vital to constantly keep up with the pod’s activities, statuses and events. Check Kubernetes’ documentation to know more about using pods.
To get the statuses of all the pods in your cluster use kubectl get pod
- Running: When a pod is in the running status, it denotes that it is assigned to a node and it has one or more containers that are operating as expected.
- Pending: A pod moves to the pending status if one or more of its containers are in the waiting status or if the pod cannot be scheduled. A container could move to the waiting status in the following scenarios:
- When the image defined for that container is unavailable. This may arise due to an error in spelling the name tag or an authentication failure.
- When there is a delay in downloading the container’s image due to its image size.
- When there is a readiness probe set for the container spec, the container will move to the ready status only if the conditions in the probe are met.
- When the pod fails to mount all the volumes specified in the spec, this could be due to a failed dynamic volume request or if the volume requested is already in use.
Use the kubectl describe pod <podname> command to check the change in container states as shown below:
- Succeeded: When all the containers in the pod are exited and will not restart.
- Failed: A pod can be in the failed status for a number of reasons. Some of the most common causes for a pod failure are:
- When a container runs out of memory, the pod restarts the container as per the restart policy. However, if there are continuous restarts, the pod backs off the container from restarting. The way to check if this is the cause for pod failure is to run a check on the spec memory request and the limit set.
- When containers restart continuously either due to memory or CPU usage issues. A good way to check this is to run the kubectl exec -it <podname> <containername> command.
- When a pod terminates due to the removal from service of a node in which the pod exists, and all the pods attached to the node are not cleaned out by the cluster scheduler and the control manager.
- When a pod has insufficient bandwidth for resources from the node or there is insufficient persistent volume
Why a log management tool is necessary for troubleshooting Kubernetes pod issues
Troubleshooting any application environment works best when you can live tail the logs. To live tail pod events in kubernetes, use the command kubectl logs -f <podname> to view logs flowing into the pod.
But when you’re tired of troubleshooting is when having a tool that gives a 30 second agent installation option to collate your Kubernetes logs works better. In the video below, the logs from a specific pod in a Kubernetes application is live tailed.
If you are impressed by what you see, try out observIQ for free today. The steps to install the observIQ agent and live tail logs are available in our documentation. We don’t restrict your trial with a credit card or endless setup processes. Sign up and get the logs flowing in minutes.
Understanding Mean Time To Resolution (MTTR) from the Facebook, Instagram, and Whatsapp’s Outage
Yesterday the most used social media platforms in the world were inaccessible for 6 hours straight. Later, in a press release, Facebook revealed that the outage was due to configuration changes in their routers. There is no doubt that Facebook has an intense incident response plan, yet a small blind spot resulted in a significant business interruption. So how do we avoid this? The truth is, outages and performance issues are bound to happen in any network. It is more important how quickly we react and resolve incidents. This post takes you through the most critical factors in your incident response – Mean Time To Resolution (MTTR) and Mean Time To Detect(MTTD).
What is MTTD?
Mean Time To Detect is a key performance indicator in any incident response plan. MTTD is the average time taken by the SRE to detect an incident from the time of its occurrence. The mathematical formula used to calculate this value is:
MTTD = Total time taken to detect incidents over a period of time/ number of incidents.
Incident reports and outage notifications can come from end-users, as in the case of Facebook, or application monitoring and management tools, such as observIQ, can send you alerts immediately when something is not working as expected. Businesses aim to lower their MTTR as much as they can. Businesses invest in observability tools that can provide granular level insights into the event logs to correlate data from the various sources of the application and infrastructure.
What is MTTR?
Mean Time To Resolution is the time taken to resolve an application issue or an outage and get the application functioning as per the set KPI. MTTR is arrived at by dividing the total of the time to resolution of incidents over a period of time by the number of incidents that occurred during that period. It is important to note that this is only a statistical average. There are possible deviations to this value based on other contributing factors. The main objective of having a good MTTR is to make an outage less impactful on the end-users. The best-case scenario is to avoid the disruption of system usage. A good system architecture makes the system resilient to performance issues and outages. In a hybrid system architecture, the application’s code resides separately from the web service calls to keep each component’s upgrades or deployments unique. MTTR cannot be maintained low when the resolution, deployment, or upgrade to one component creates the dependency to update the application’s other features.
What factors influence MTTR and MTTD?
Incident detection and resolution strategy:
All incidents that occur in the application are not similar. Therefore, MTTR calculated without factoring in the severity levels of the incidents would give only a generic value for the MTTR and not specific to a severity level. Another critical factor that many MTTR calculations fail to factor in is the differences in user traffic. For instance, e-commerce applications see heavy site traffic during the Black Friday sale season. The MTTR calculation in such a scenario should be for that period of time specifically and not a year-round calculation.
You must frame a plan knowing the varied levels of incidents. Strategically laying out a lean ITSM plan would save your business time and resources spent on incident response. There cannot be a one size fits all approach to your incident response. The cause and effect of every action is gauged, the responses calibrated, and the inferences drawn to make plans for effective response. An ad-hoc approach for incident response is a recipe for failure. An ad-hoc response or fix is bound to cause unforeseen issues in other KPIs.
Modern businesses adopt a more streamlined approach to incident response. To form a response, the incidents in a system are recorded in a controlled environment and the values are calibrated. Not only does this help form a picture of a real-time occurrence, but it also helps assign roles to individuals in the team during the response.
Using an efficient log management tool:
Monitoring is critical for incident response. If there is no data accumulation over a period of time, arriving at key SLAs and KPIs would become impossible. Having an application monitoring tool gives you a real-time read on the application’s health over a period of time. Functionalities like Live Tail in observIQ make troubleshooting collaborative and simple. The theory that monitoring is for intrusion only has been debunked, with points of vulnerability now extending to every corner of the system. If there is a failure caused by an intrusion, without monitoring it may take days for the SRE to assess the site of intrusion. Monitoring tools come equipped with alerting options for service level indicator SLIs such as performance and security-related anomalies. When an incident occurs, the monitoring tool alerts all configured SRE resources to draw attention to the incident right after it occurs, drastically reducing the MTTD and MTTR. You can set alerts for anything from a latency issue to a throughput drop below the threshold value.
Although the Facebook outage is much talked about today, businesses of all sizes struggle with performance and outage issues. The best way to stay prepared is to monitor. We offer a free plan at observIQ. Try it out to see how you can tackle your MTTR.
Our take on New Relic’s observability projections for 2021
New Relic recently published a survey on the observability projections and trends for the year 2021. At observIQ, an emerging platform of choice in the observability space, we decided to give our users a rundown on what we agreed and disagreed within their survey results.
Push to implement observability:
The survey presents a convincing argument that organizations need to implement observability solutions earlier than they currently tend to. Logs help DevOps teams understand the system’s behavior through the system’s outputs and log management has been common practice since the outset of terminals and highly hierarchical computing language. In the early 2010s, companies released application libraries that are installed within the application to track the application’s performance and events. This led to observability as we know it now. The difference between application monitoring and observability is cardinality. Cardinality refers to the ability to establish unique identifiers to a data set. For instance, in a database of travelers on a flight, the highest cardinality would be their passport number followed by the first name and last name combination. With high cardinality, modern observability facilitators make debugging and system performance analysis exponentially easier. This could be the reason why more and more organizations and individuals are seeing the true value in implementing log management for their businesses.
Effective monitoring of containerized applications is the need of the hour:
The survey reiterated that observability for Kubernetes and containerized applications is trending towards becoming a necessity. This is a no-brainer. At observIQ, we predicted this and built one of the simplest ways to plugin our agent into Kubernetes to gain visibility of all events. It is good that New Relic factored this because companies like Splunk and honeycomb.io did not include any stats related to Kubernetes adoption among businesses and the increasing need for observability among Kubernetes applications. As more and more organizations embrace containerized and micro-services-based architecture models for their applications and infrastructure, incorporating observability into their new applications is easier.
Observability awareness is on the rise.
Earlier definitions of observability referred to the three pillars of observability, metrics, traces and logs, very theoretically. But any DevOps engineer following best practices would point out that metrics, traces and logs are merely three sets of observability data. The true sense of a functioning observability practice is using these sets of data efficiently within a tool that can parse, enrich, and visualize the data.
There is no perfect observability method, probably there never will be.
With distributed, containerized, and cloud applications, everything is transitory. Platforms such as OpenTelemetry have spelled out what is always known. As software systems change, processes for managing those systems are compelled to keep up with those changes. Hence, a prebuilt model that fits all businesses for the next 5 years does not exist. Observability is a space that will evolve as it needs to. For businesses to keep up with this evolution, there must be a common ground formed based on processes. As per the survey, the general consensus is that there is an immense opportunity to streamline and mature the observability practice, which is an agreeable statement.
Organizations are ready to invest higher in observability.
This brings up a seldom-talked-about question. Is an expensive solution really better? We started offering a free tier starting this June. We did this to offer our platform to users who don’t have an exhaustive list of items in their observability wishlist. Our free tier offers short-term log data storage while maintaining your access to all existing and new features of the system. Many tools in the market offer users observability that is often tied down by the cost. Companies are limited by the cost when implementing a detailed observability practice. Instead of pushing for organizations to invest more in observability, the best way to make organizations embark on their observability journey would be to offer meaningful pricing. The verbiage in the survey refers to the C-suite executives making higher budgetary allocations, and it does not factor other user personas in the budgetary study. This is an evident gap in analysis. This disparity extends to the survey’s data breakup of the type of subscription/licensing models. Without having an inclusive user persona, these observations speak only to one specific category of users.
Lack of engineers with Implementation skills is a dealbreaker in adaptability.
We agree with this statement. Companies that are still in the code library-based application monitoring practice find the code level retracing and modifying very intimidating. Although some businesses see that the pros of this transition outweigh the cons, the fear of issues that may arise during implementation in a functioning system stops businesses from pursuing modern-day observability solutions. Solutions like observIQ can help users overcome the need to fix or repatch. Instead, the agents are installed onto the application stacks via preconfigured plugins. If you would like to explore more about this, speak to our implementation support team.
Dev teams agree that observability should be implemented across the board
We have written in the past about how observability is now an integral part of every phase in the SDLC. The survey confirms this with a majority of development teams believing that they need observability across the board in every phase of application development, quality assurance, implementation, and maintenance. An interesting result observed from the survey’s results is that users in the APAC region felt more compelled to incorporate observability through the SDLC than their North American counterparts.
Simple observability platforms to promote faster implementation
The survey results tied adaptability to the skill level of an implementer. We have a different take on this. Instead of scaling up the implementation abilities of observability users, we believe it is more efficient to build simpler observability platforms. Platforms such as obervIQ, have a growing list of pre-built log sources. This is conducive to making organizations with smaller teams implement observability. With large, complicated observability platforms, evaluating engineers are thrown off by the complexity of the application.
Wider user personas for observability
The sampling size, occupational diversity, and geographic spread of responders used in the survey were good. However, the sample personas used seemed very old school. Observability is no longer a restrictive space. There is an identified observability need from businesses, public entities, professional gamers, independent contractors, homeowners using IoT for home security, technical influencers – the list of possible user personas is endless. So having a restrictive sample pool of individuals working in similar work streams makes this survey extremely unidirectional.
Fragmented monitoring is still common practice.
One-third of the users are still working with their observability data between multiple systems. With their analysis data housed in different systems, most developers must mend together siloed data manually to debug or gauge the system’s performance.
In conclusion, this survey highlights what is known for the past few years. Observability is vertical in the tech space that is gaining momentum. The earlier the adoption, the better for organizations of all sizes. This is also the era of cloud-native application emergence. Our recommendation would be for organizations to incorporate observability as part of their cloud-native migration.
I want to be remembered. I think a lot of us do.
At least, that’s what I used to think. Now I am not so sure. I have a bad habit of looking at the universe through an existential lens where value is measured by impact. Impact, meaning the measurable change created by specific action. Since everything physical ultimately decays, the longest lasting impacts are those that linger in our collective memory. Great works, great triumphs, great discoveries, and great inventions – great impacts.
We remember the names behind great impacts for a long time – Einstein, Shakespeare, Lincoln, Tesla, the like – but even names fade into myth, leaving only the impact. Ramses II, one of the greatest conquerors in history, and the inventor of countless military tactics still used today, is now better known by the name Ozymandias, given to him by an English poet three thousand years after his death. Beowulf is taught in English classes around the world, and no original author is recorded. Ragnar, the most vicious, most successful Viking in history, is remembered more as a concept than a person (and may not have been one person.)
As our collective perspective on reality changes, so does our perspective on the old historical characters that impacted our present. George Washington, long celebrated as the father of the world’s oldest surviving democracy, is now referenced more and more as an idol of a shameful past, rather than a harbinger of freedom and progress. He owned slaves. Most were inherited from his wife’s father, and while the eastern states were still British colonies, it was illegal to free them. He freed them upon his wife’s death, after the United States was established, but in the lens of our time, owning another person even for a second, let alone many years, is unforgivable.
On the flip side, William Shakespeare, it is thought by some scholars, was Catholic. Others also think he was gay. (And others think he was more than one person – so a gay person and a Catholic person may very well have been part of the “Shakespeare group”.) Today we say “so what?”, but back then, and for many hundreds of years after his time, both were considered shameful, if not illegal. Remember, when JFK was elected president, his Catholicism was a contentious subject. That was only 1960. Gay marriage, of course, is a very recent institution in the United States, and broader LGBT rights are still an ongoing battle and debate around the world. Are people who shamed Shakespeare’s religious and romantic preferences bad people? Are people who celebrated George Washington worse? My intention is not to make any judgments either way, but to note on the point of being remembered that everyone is a slave to their time and place, and everyone who looks back at them a slave to theirs.
I am far from the first to put these ideas into words. The takeaway I want to illustrate, though, is that none of these characters are remembered as people. We don’t remember them as human beings with thoughts, desires, needs, wants, regrets, and an ever-changing, but somehow persistent, self identity. We can’t. We didn’t know them. After everyone who knew a person is gone, we can only remember the impact, not the human being. There is a danger to that kind of memory. We tend either to evangelize or demonize the figures of our past. Rarely do we land anywhere in between. Rarely do we recognize the flaws in our heroes, or the good in our enemies. Impacts do not capture character, so our memories become static while the world keeps moving. There is no consideration for who someone in the past might have been today. My romantic desire to be remembered is fundamentally flawed. We don’t get to choose how we are remembered.
Now we live in a world where everything is recorded. Everyone is remembered like a caricature of a long gone historical figure, only while they are still alive. This whole episode of amateur philosophy and self-reflection was brought on by log management, of all things. It’s essentially just a form of data collection, and my job is to sell it. Before the 1990’s, collecting and storing information was a challenge. You had to have some idea ahead of time about what information you needed and how you were going to use it before going through the trouble of physically collecting and writing it down. The tech boom made it possible to collect more information every day than we had recorded in all of our previous history. We live our lives in the digital world, and we record everything. More than 99% of it, no one will ever come looking for, but it’s still there, waiting. Our entire lives laid out like a biography we had no say in writing. Privacy concerns have become the hottest topic in tech, but why are we so concerned about it? Don’t we want to be remembered?
I’m steering around the seductive idea that tech is the problem and privacy concerns boil down to companies acting slimy with our data. Companies do act slimy with our data, but that reflects on them and not the data collection itself. Our log management platform, like most digital data collection methods, is just a tool. It was built for the purpose of making it possible for a small number of people to monitor large development infrastructures, secure networks, and cloud applications (among other things). That’s a valuable service that has nothing to do with personal activity. It can technically be used to reverse-engineer individual activity within any given system, but only by someone with the explicit intention of doing so and at great effort. Ultimately, there are far superior tools for ‘spying’ than ours, so I am not concerned with an affront to privacy on our part, but the mere possibility still intrigues me. Every mouse click is an impact etched in stone. Every angry message. Every embarrassing photo. Every sarcastic tweet gone wrong. We no longer have the right to make mistakes.
I don’t think the problem is tech. I think the problem is that the digital footprint, like the memory of a historical figure, is detached from the human being that made it. People are dynamic and their impacts are singular, static events. In the worst case scenarios, people in their thirties find themselves suffering for posts they made when they were fifteen. I don’t know about you, but I am not the same person now that I was at fifteen. In fact, I bet he and I wouldn’t even get along that well. He exists, somewhere. If you went looking, you could put him back together. Every short-sighted political post, comically unaware text meant to impress a girl, and enough photos to reconstruct a 3D model. Fifteen year old Paul is alive and well in the vinyl etches of Google’s internet backups, Facebook’s servers, and AT&T’s phone records. Each of us has that digital clone, representing who we were, but not who we are. A lot of people are pretty upset about that.
This is my profile picture from 2008. My first ever, apparently. I don’t even remember taking it. Was I embarrassed to show my face? Was I naively thinking I could keep my internet activity anonymous? Did I just think that the visual effect was cool? There is no doubt that the thirteen year old in that photograph is me, but I can only speculate as to what was in his head at that moment. But here it is, frozen in time forever.
The idea of “the right to be forgotten” has taken off in frontline politics. The first time I heard that phrase, I was astonished. Who would ever want such a thing? I thought it was just about advertisers and general disdain for big tech, but I get it now. Growth is a process riddled with errors. If we can’t make mistakes, we can’t grow. If everything we ever do is recorded forever, we can’t make mistakes. In some way, our entire society has stopped maturing. We sit in whatever corner we found ourselves in, too afraid to move, posting a re-skinned version of whatever got the best response last time.
Solutions, though, are elusive. We could place a mandatory erasure date on personal internet activity, but some things we want to keep forever, and honing in on exactly the right amount of time to retain angsty statuses and socially incriminating Instagram posts is impossible. Some data is necessary for technological advancement and has nothing to do with personal privacy. Most legislators are moving towards the most pragmatic solution – giving people the legal protection they need to take control of their personal data. That’s an important step, but it doesn’t solve the fact that no one is going back to vett their teenage selves for impacts they no longer agree with. In many cases, like mine, they don’t identify with that person enough to put in the effort. It doesn’t feel connected to us any more.
Maybe, in the end, this is all a non-issue. The internet is still young, after all. We haven’t really figured it out yet, culturally. On the one hand, people are posting more and more, and on the other, they want more autonomy over what happens to that data. Governments are pushing for more personal privacy in the hands of the user, but also pushing for back doors and direct feeds into government agencies. There is a tug and pull at play that will take decades to sort out. I guess all we can do is wait, watch, and try not to worry too much about what we’re remembered for. At least, in that regard, my greatest wish will be fulfilled.
*Views presented by Paul Stefanski are not necessarily those of observIQ