The very act of writing software implies that we (unintentionally) introduce bugs. It's possible to reduce the number of bugs you introduce. But it's definitely not possible to bring it down to zero. This means that most software we work with build or use will fail, at times in interesting ways. And then you the software developer would have to go in and figure out where things are going wrong.
After spending a few years building, breaking, and then fixing software, I've found a couple of things to keep in mind that help me when I'm trying to debug something. They work on all kinds of bugs - from broken libraries where there's a difference between documented and actual behavior, to distributed systems where touching one component affects a different one in ways that one would normally not expect.
In this article, I'd like to describe some of these approaches that I've found useful in discovering the root causes of issues in production and how to resolve them.
1. Read and understand what's happening
The first approach is also the most obvious one.
We need to read, and more importantly, understand any and all the information available to us. And when I use the word "understand", I mean really give it a long and hard look and try to get to the bottom of it.
When you're trying to fix something that's broken, it's critical to build a mental model of what the expected behavior is and what is happening instead. Once you have that in place, reading and understanding the information available to you (in the form of an error message or a log entry, or whatever you may have at hand) goes a long way in figuring out what exactly is going wrong.
Often, I talk to engineers who are stuck with some exception in their code, unable to make progress. Almost in every single case I've noticed that they tend to ignore reading the stack trace that they are looking at. The general practice is to put the entire exception inside the Google search box, with the hope of finding a StackOverflow response with the exact same exception. At times this works. When it doesn't though, it's pretty much a dead-end strategy of debugging.
2. Think using first principles
Software is generally pretty deterministic (note the emphasis on the word generally).
Things tend to have a well-defined flow. There are obviously huge exceptions to this rule. But if your work involves developing user-facing products on the web (like mine), you're most likely working at an abstraction level where the flow tends to be pretty well-defined.
As an example, a SaaS product generally has a frontend component, a backend component and then the infrastructure that the whole thing is running on. There are other things as well. For instance you likely have a load balancer layer that all user-traffic goes through before hitting the frontend. Or you might have some queues which the backend is using to perform some background processing.
Each of those components has a specific purpose in this chain. If, for instance you're facing a bug that spans across multiple layers, it's generally helpful to think from first principles as to what each component is supposed to do and what it might be doing instead.
For instance, if you notice that the data the backend is receiving is not something that it's supposed to, it'll help to build a mental model of the complete user request - what the frontend sends to the backend, if the load balancer can modify the request headers in some way, stuff like that. Visualizing the stack like this goes a long way in being able to pinpoint the problematic layer(s).
3. Avoid trying out random things
One of the rather interesting things I've seen software developers (including me) do when trying to fix a problem is trying out random things. I will try to illustrate this with a very basic example. Consider the following function:
def add(a, b):
return a - b
We have a function called add
which is supposed to add both its arguments
and return the result. Instead, the function returns the subtraction as a
result. What's I've seen folks often do is to try out random approaches. One
example would be changing the definition to something like:
def add(a, b):
return a * b
... in the hopes that the final result will be what we expect.
This urge is natural. And at times this may certainly work. But it's a bit like throwing spaghetti at the wall. It's not particularly efficient. It's much better to spend a quiet moment with the code at hand and try to get to the bottom of it. The success rate will be much higher that way.
Again, I've been guilty of this myself as well. There are times when in a desperate attempt to fix things, I try out solutions that are more or less random relative to the problem at hand. These are definitely times when the "hey I wonder what happens if I do this" thought process kicks the rational part of the brain out. Over time though, I've noticed that this is not particularly effective. So now whenever I feel this urge coming up, I try to keep it in check and start by really trying to understand what's wrong.
4. Ask someone else
One of the best things to do is to ask someone! This could be a coworker, or someone on the Internet in some online community you're a part of.
Asking a coworker is definitely the more helpful option though, because coworkers are likely to have more context on what you're working on. Ideally this is someone with more experience than you. Senior Engineers have ran into enough problems over their careers. And over time they develop an ability to pattern-match problems. So they might look at your bug which might remind them of something different they were working on a few years ago (if not the same). Having someone give you this perspective is invaluable and can go a long way in solving the problem at hand.
Of course, this is not to say that only senior folks can help. Good engineers, regardless of whether they're junior or senior, tend to ask very interesting questions. They might help you see the problem from a different angle, or follow a line of questioning that you weren't even considering earlier. Such conversations are super helpful.
And if you don't have any of those available, you can always try talking to a rubber duck. The first time I tried it out, I was extremely surprised at how effective it was.
5. Step out and take a walk
If you've done all of the above and still don't know how to proceed, the best way forward might just be to take a break. Sometimes our brains have had enough. When you reach that point, there's really no way that you can consume and/or process more information.
In such situations, just stop working. Get away from your desk, ideally go out and get some fresh air. Go for a run. Jump in the shower. Anything that keeps you away from the problem at hand so that your body gets the break it needs.
You may feel guilty about not "working" when there are problems to solve. But what's really happening is that you're letting your brain work in the background. So technically speaking, you're still making progress. And sooner or later, the answer will hit you out of nowhere.