Finding Unexpected Development Solutions Through Log Management

This is a personal story from before I worked at observIQ. I am not a technical person in any professional sense. I have no direct training, and my coding experience is limited to front-end web design and some indie game development. Before observIQ, all I knew about log management was that it has something to do with tracking computer performance and behavior. I associated it mainly with DevOps and the cloud. I never imagined it would play any valuable role in my professional endeavors.

I was working on a chess-based indie game called Chess Heroes. Only three of us developed the game, so time and resources were thin. We tested the game’s multiplayer functionality early on and noticed that some moves took too long to reach the opposing player – seconds instead of milliseconds and sometimes minutes. We assumed the problem was somewhere in our Kubernetes cluster.

The front end of the game is built in Unity, which acts mainly as a visualizer for the actual logic of the game. The rules and moves run on a C# chess engine installed on a Kubernetes cluster through DigitalOcean. The Unity instance on each player’s phone is only responsible for sending and receiving moves and translating the board position it receives from the server on the visual interface. It is a standard chess board with power cards and a Hero character image.

Diagnosing an issue like this can be challenging, even if it seems impossible. There are no error codes to track down because everything thinks it’s working as intended, and dozens of interconnecting components – many built from open-source code that none of us were familiar with. Anything could be causing the delay. Digging through the code was a non-starter. We needed a way to get more visibility into the problem. That’s where, to my surprise, log management came in.

I had heard of the big, enterprise-level log management companies like Splunk and Datadog. Still, our project was small, and we were only looking to diagnose a minor issue, not jump into a log management endeavor that could quickly spiral into a more significant time sink than the problem itself. A free(ish) option, New Relic, popped up in my search, but they limit free accounts to one user. The Chess Heroes team lives in California, Michigan, and Washington. One account could have been better.

observIQ had the answer. Free accounts with unlimited users, up to 3 gigabytes of daily logs, three-day retention, and no hidden charges. It didn’t even ask for a credit card. This sounds like an infomercial now, but trust me, I was not sold then. I had never installed a logging agent on anything, let alone in Kubernetes, which I had only just learned how to use. To my sincere relief, installing observIQ was just a matter of copying and pasting an agent into Kubernetes, creating a .yml component, and viola – logs started appearing in observIQ’s dashboard.

It was time to play a game and look for the cause of our turn delay. We were looking for anything that could lead us to the source. We made the first move in our dummy game. All the appropriate logs flew across the Live Tail real-time log feed. It looked normal, but the game had no move delay, either. We tried again, hoping to reproduce the delay. It took about seven moves until we encountered the issue. One player moved, and the new board position didn’t appear on the other player’s board for several seconds. We returned to the logs and searched for an issue, but nothing was wrong. Every move, including the delayed one, was sent, processed, recorded, and sent back to the opponent in less than a hundredth of a second.

Log management didn’t show us where the problem was, but it did the next most valuable thing. It showed us where the problem wasn’t. We were previously convinced that the problem had to be on the server. We would have rebuilt the Kubernetes cluster from the ground up before checking the Unity project. Once we knew the server was good to go, we turned to Unity. Thanks to the new insight from observIQ, we knew the server sent new game states flawlessly, so how were they received? Since there were no errors in the code and the moves always displayed (eventually), we knew the board positions arrived and processed correctly. No one could think of a good reason why the game took so long to process and display a new board state, so we popped open the hood and looked for something to keep the board state from appearing on the players’ screens.

Early in the game’s construction, we pasted in a free turn-based multiplayer API from the Unity asset store responsible for sending and receiving messages from our server. It had a line buried somewhere in the stack that restricted the game from checking for new board states more than once every two minutes. It made sense once we found it. The API was intended for a game where turns happened simultaneously and were restricted to 2 minutes, so it was a resource-saving technique by the designers of the API. Smart for them, bad for us. In chess, every millisecond matters. We flipped it so that the board state updated when the server sent a new position, and the problem was finally resolved.

Log management was off our radar as a small development team focused on a pet project. Before finding observIQ, we all assumed that the cost and overhead of dealing with logs would be more effort than benefit. Yet, when we encountered an issue during development that we couldn’t solve, log management played the most critical role in the diagnosis. It isn’t just for big enterprise organizations. I can honestly say I believe in the value of observability for small teams and independent developers.