The Observability Blog

Categories:
  • Uncategorized

Finding Unexpected Development Solutions Through Log Management

Paul Stefanski headshot
by Paul Stefanski on
July 20, 2021

This is a personal story from before I worked at observIQ. I am not a technical person in any professional sense. I have no direct training and my coding experience is limited to front-end web design and some indie game development. Before observIQ, all I knew about log management was that it has something to do with tracking computer performance and behavior, and I associated it mostly with DevOps and the cloud. I never imagined it would play any valuable role in my professional endeavors. 

I was working on a chess-based indie game called Chess Heroes. Only three of us developed the game, so time and resources were spread thin. Early on we tested the game’s multiplayer functionality, and noticed that some moves took too long to reach the opposing player – seconds instead of milliseconds, and sometimes minutes. We assumed the problem was somewhere in our Kubernetes cluster.

The front-end of the game is built in Unity, which acts mainly as a visualizer for the actual logic of the game. The rules and moves run on a C# chess engine that we installed on a Kubernetes cluster through DigitalOcean. The Unity instance on each player’s phone is only responsible for sending and receiving moves, and translating the board position it receives from the server on the visual interface, which is just a normal chess board with some power cards and a Hero character image. 

Diagnosing an issue like this can be challenging, even seem impossible. There are no error codes to track down because everything thinks it’s working as intended, and dozens of interconnecting components – many built from open-source code that none of us were familiar with. Anything could be causing the delay. Digging through the code was a non-starter. We needed a way to get more visibility into the problem. That’s where, to my surprise, log management came in.

I had heard of the big, enterprise level log management companies like Splunk and Datadog, but our project was small, and we were really only looking to diagnose a small issue, not jump into a log management endeavor that could easily spiral into a larger time sink than the problem itself. A free(ish) option, New Relic, popped up in my search, but they limit free accounts to one user. The Chess Heroes team lives in California, Michigan, and Washington. One account was no good. 

observIQ had the answer. Free accounts with unlimited users, up to 3 gigabytes of logs per day, 3 day retention, and no hidden charges whatsoever. It didn’t even ask for a credit card. This sounds like an infomercial now, but trust me, I was not sold at the time. I had never installed a logging agent on anything, let alone in Kubernetes, which I had only just learned how to use. To my sincere relief, installing observIQ was just a matter of copying and pasting an agent into Kubernetes, creating a .yml component and viola – logs started appearing in observIQ’s dashboard. 

It was time to play a game and look for the cause of our turn delay. We were looking for anything that could lead us to the source. We made the first move in our dummy game. All the appropriate logs flew across the Live Tail real time log feed. It looked normal, but there was no move delay in the game either. We tried again, hoping to reproduce the delay. It took about seven moves until we encountered the issue. One player move, and the new board position didn’t appear on the other player’s board for several seconds. We jumped back into the logs and began searching for an issue, but there was nothing wrong. Every move, including the delayed one, was sent, processed, recorded, and sent back out to the opponent in less than a hundredth of a second every time. 

Log management didn’t show us where the problem was, but it did the next most valuable thing. It showed us where the problem wasn’t. We were previously convinced that the problem had to be on the server. We would have rebuilt the Kubernetes cluster from the ground up before checking the Unity project. Once we knew the server was good to go, we turned to Unity. Thanks to the new insight from observIQ, we knew the server sent new game states flawlessly, so how were they received? Since there were no errors in the code and the moves always displayed (eventually), we knew the board positions arrived and processed properly. No one could think of a good reason why the game took so long to process and display a new board state, so we popped open the hood and looked for something that could be holding the board state back from appearing on the players’ screens. 

Early in the game’s construction we had pasted in a free turn-based-multiplayer API from the Unity asset store that was responsible for sending and receiving messages from our server. It had a line buried somewhere in the stack that restricted the game from checking for new board states more than once every two minutes. It made sense, once we found it. The API was intended for a game where turns happened simultaneously and were restricted to 2 minutes, so it was a resource saving technique by the designers of the API. Smart for them, bad for us. In chess, every millisecond matters. We flipped it so that the board state updated any time the server sent a new position, and the problem finally resolved.

As a small development team focussed on a pet project, log management was not on our radar. Before finding observIQ, we all assumed that the cost and overhead of dealing with logs would be more effort than benefit. Yet, when we encountered an issue during development that we couldn’t solve, it was log management that played the most critical role in the diagnosis. It isn’t just for big enterprise organizations. I can honestly say I am a believer in the value of observability for small teams and independent developers.