I have designed my own web reader. With the web reader, I can subscribe to blogs and websites I like. The reader updates once a day to show all the web pages published in the previous day. I can also see pages published in the last seven days.
Yesterday, I found an edge case: the title for article that used curly quotes was not properly displayed. The curly quotes were “mojibake” characters. This refers to characters that appear malformed as a result of character encoding. It can happen when, for example, you convert a piece of text to the wrong encoding.
I started to think about how to fix the error. First, I opened my web reader code. Then, I ran a test on the exact feed that was causing problems. Focusing on the specific case you are looking to fix is an essential point in debugging I have come to realise. It is much more efficient to test only the broken feed than it is to run the whole script with all feeds and wait until the error comes up.
Then, I started to ask: “what is the nature of the problem?” In discussion with the maintainer of a package I was using, it was noted that the issue was mojibake. This information gave me a clue as to what was going wrong and where to look next. I looked up mojibake and learned the problem came from character encoding. I didn’t know too much about character encoding, so I knew I’d have to do more research to help figure out what was wrong.
Knowing the error, I could then ask “what is breaking?” Initially, I thought that the problem could be the library that converts feeds from one to another but that didn’t prove out after checking what the feed looked like before it was converted. Before the feed was converted, there was still mojibake. To help figure out what was breaking, I created a new Python file and created a minimum example of the feed request. This helped me decouple the bug from the rest of my code. I isolated the problem to the web request.
Then, I opened the feed in my browser to validate how the response from the server looked. All the characters looked good; there was no mojibake. Thus, the published feed was okay – the problem was something to do with how I retrieved the feed.
The error turned out to be that I needed to convert the feed contents to UTF-8 encoding. When I made this conversion, the previously-malformed curly quotes were rendered as proper characters.
I have a few reflections from this experience that I am starting to connect with some of my prior experiences with debugging. First, you need to know what the problem is and have an example of the problem. Then, you can test the example that caused the problem. From there, you can isolate what caused the problem. It is hard to assume what causes a problem: you have to test to find out. Then, when you know what’s wrong, you can start to fix it.
Breaking out some of my logic into a separate file and writing a minimally reproducible example was helpful. This let me test my assumptions about the fundamental primitives – the web request and the feed conversion – that I was using in my code. I found that the web request was the problem, then went on to test several potential fixes before finding one that worked.
When I have been debugging and improving my search engine, I have found myself breaking out logic into files a lot too. I find it useful to test small examples of code to validate if something works and how fast it is before I incorporate it into my larger code.
If you write code, I would love to hear stories about how you debug code and what you have learned. I haven’t consciously reflected on debugging procedure too much until today. I’m keen to read more!