Chapter11 Case Study War Stories

If you find this content useful, consider buying this book:

If you enjoyed this book considering buying a copy

Chapter 11: Case Studies and War Stories #

Noah Gift

This section contains a mixture of testing case studies and war stories. Items that didn’t cleanly fit in one part or another got placed here.

Testing Click Commandline Tools #

Writing command-line tools is one of my favorite ways to write code. A command-line tool is one of the simplest ways to take programming logic and create a flexible tool. Let’s walk through a simple click tool.

The following is an intentionally simple command-line tool that takes a flag --path (path to a file) and flag --ftype (file type). If this runs without specifying, it will prompt a user for both of these flags.

{caption: “Simple click based cli that searches for files:"}

#!/usr/bin/env python
import click
import glob

# this is bad code intentionally
# varbad=

    prompt="Path to search for files",
    help="This is the path to search for files: /tmp",
    "--ftype", prompt="Pass in the type of file", help="Pass in the file type:  i.e csv"
def search(path, ftype):
    results = glob.glob(f"{path}/*.{ftype}")
    click.echo("Found Matches:", fg="red"))
    for result in results:
        click.echo("{result}", bg="blue", fg="white"))

if __name__ == "__main__":
    # pylint: disable=no-value-for-parameter

Another useful feature of click is the --help flag comes for free and autogenerates from the options. This following is the output from ./glci --help.

{caption: “Output of ./gcli --help"}

(.tip) ➜  click-testing git:(master) ✗ ./ --help
Usage: [OPTIONS]

  --path TEXT   This is the path to search for files: /tmp
  --ftype TEXT  Pass in the file type:  i.e csv
  --help        Show this message and exit.

Next up is to run ./ and let it prompt us for both the path and the file type.

{caption: “Output of search with prompts”}

(.tip) ➜  click-testing git:(master) ✗ ./
Path to search for files: .
Pass in the type of file: py
Found Matches:

Another run of ./ shows how it can be run by passing in the flags ahead of time.

{caption: “Output of search with flags”}

(.tip) ➜  click-testing git:(master) ✗ ./ --path . --ftype py
Found Matches:

So can we test this command-line tool? Fortunately, the authors of click have an easy solution for this as well. In a nutshell, by using the line from click.testing import CliRunner it invokes a test runner as shown.

{caption: “Test file:"}

from click.testing import CliRunner
from gcli import search

# search(path, ftype):

def test_search():
    runner = CliRunner()
    result = runner.invoke(search, ["--path", ".", "--ftype", "py"])
    assert result.exit_code == 0
    assert ".py" in result.output

Just like pytest, a simple assert statement is the only thing needed to create a test. Two different types of assert comments show. The first assert checks the result of the call to the command line tool and ensures that it returns a shell exit status of 0. The second assert parses the output returned and ensures that .py is in the production since the file type passed in is py.

When the command make test is run, it generates the following output. Again take note of how using the convention make test simplifies the complexity of remembering what exact flags to run. This step can set once and then forget about it.

{caption: “Output of make test"}

(.tip) ➜  click-testing git:(master) make test                               
python -m pytest -vv --cov=gcli
========================================= test session starts ==========================================
platform darwin -- Python 3.7.6, pytest-5.3.2, py-1.8.1, pluggy-0.13.1 -- /Users/noahgift/.tip/bin/python
cachedir: .pytest_cache
rootdir: /Users/noahgift/testing-in-python/chapter11/click-testing
plugins: cov-2.8.1
collected 1 item                                                                                PASSED                                                                 [100%]

---------- coverage: platform darwin, Python 3.7.6-final-0 -----------
Name      Stmts   Miss  Cover
-----------------------------      11      1    91%

========================================== 1 passed in 0.03s ===========================================

War Story: The Health Check that wasn’t wrong #

Working in the film industry has a glamorous side. It is fun telling people I worked on the movie Avatar. It also has a dark side. It may be the poster child for the most naughty software development practices. I have been told many times, “this isn’t a software company” while a film studio paid software developers by the hour, including overtime, to patch a constant stream of nonsense bugs in code. Unsurprisingly they later shut down because it was too expensive to make a movie.

I remember a major film studio with hundreds of employees running workloads through a logging server that an artist set up under their desk. The entire render farm (thousands of incredibly high performance and expensive servers) was sending network traffic to a random workstation that could get kicked over or unplugged. The best comparison I can make is that film in the era I worked in it was a lot like living in the dorms in college. You would see things that would always make you shake your head. Something like someone pulling the fire alarm at midnight during final exams is an example. Another is encountering a room that smelled like it needed to cleaning with a hazmat crew. The common thread in living in a college dorm room and working as a software engineer was that nihilism was the norm.

The peak film nihilism moment for me was after I spent a year of my life painstakingly tracking down the root cause of mysterious all day outages that stopped the entire film studio from working (hundreds of employees). They triggered by a piece of code written with EXPONENTIAL BACK ON (the opposite of exponential back off). A specific department would increase the frequency of job submissions to a troubled database to “lockout” all of the other departments from using the database. After I “caught them” they nonchalantly said, shutting down an entire film studio through a denial of service attack was part of their rights. They were the most critical department, after all.

I naively thought to make an efficient system that worked correctly was more critical. Just as I thought to study for final exams was more significant than getting drunk finals week and getting thrown out of school by failing out (which happened to many people in my dorms in college). Even though ultimately the film studio shutdown and the team partially responsible for the problems were never held accountable, there is an exciting story about testing to discuss.

After our team had substantially cleaned up a critical system by adding continuous integration, code reviews, load-testing, and logging, we felt pretty good. We felt like we had tamed the beast and most likely stopped any critical issues. A final best practice that we implemented was to create a health-check that would “check-in” an asset into our asset management system. The idea was that it would be a tremendous functional test of our policy.

To our surprise, soon after that, our system started to grind to a halt. The core logic of the asset management was a “home-grown ORM (object-relational mapper)” that a developer who no longer worked there created. It exponentially generated SQL with each new check-in. After a ten plus checkins, there was something like 80K lines of SQL hitting the database. Health checks can expose things you wouldn’t expect!

War Story: The Nine Circles of Hell While Parsing XML #

I remember moving to a new department, and a friendly manager asked for help with a GUI he developed. It performed a vital automation process at this film company, and it just wouldn’t work anymore. I started to dive into the code, and I quickly discovered one of the problems. The data format that it used was highly nested XML. Much of the system looked like the following code sample.

xx = x[0][0][0][1]
yy = xx[3][1]

The code would get more and more nested as I read through it. It was quite literally nine circles of hell. Luckily there was a comprehensive test suite that I could run that helped me quickly get the root problem [Warning: sarcasm alert]. What I did to fix this problem is to adjust one mystery at a time.

First I made the code “lintable” using pylint --disable=R,C This step is a great place to start when faced with a giant mess. It is a bit like using a mop on a giant food mess on the kitchen floor. You have to start somewhere.

Once that was working, I moved these nested XML statements into functions and wrote tests for them. I also added logging statements for each service. Eventually, I got to the point where I identified a bizarre behavior. The program, a GUI, would suddenly quit, and there was no traceback. This step is what testing does for a legacy mess of spaghetti code; it creates an infrastructure that will reveal bugs.

Next, I added the following debug code to the point in the system where the logging statements stopped.

import pdb;pdb.set_trace()

This step is the tool of last resort that I use when both tests and logging don’t reveal the issue. When I reran the code, I noticed that it imported a library and then quit. When I looked at the library, it appeared to do many things and these things wrapped in the following.

    import sys;sys.exit(0)

I had finally found the devil himself. Someone had written a library with zero logging, zero tests, and that would silently fail with an exit status of 0 if an exception happened. There is no worse anti-pattern in Python than this. Hiding all Exceptions and then killing a program if there is an error is pure, unvarnished evil.

What are the takeaways?

  • At least have a basic lint of your code working.
  • Why not write a few tests?
  • Why not automatically run linting and testing upon each check in?
  • Why not add logging?
  • Never hide errors, instead log them.

War Story: The Mysterious Vanishing Servers #

The Oxford dictionary describes zealotry as “Fanatical and uncompromising pursuit of religious, political, or other ideals; fanaticism.” This story tells a tale of technical zealotry gone bad. As humans, zealotry is part of our DNA. When building software systems, technical zealotry can cause unintended consequences.

At a SaaS company I worked at, we would frequently get machines that would “disappear” from our Amazon infrastructure. It was frustrating because there were no artifacts. A Linux server would simply vanish. It was infrequent enough that it got ignored for quite a while. Eventually, this problem occurred about once a week. I set out to resolve the issue once and for all.

Digging into a software mystery is a lot like being a murder detective on a TV show. It helps to be skeptical of humans and to figure out their motives. The first part of the mystery was discovering how each Linux machine assembles. The servers had the following features:

  • Ubuntu Operating System
  • No Swap partition (because “Google doesn’t use swap”)
  • Ran a Java application
  • Ran Apache ZooKeeper, a distributed synchronization component for the Java application
  • Ran Supervisor, a client/server system that can restart a failed process
  • The use of ephemeral disks on EC2 (these don’t save your data when they crash)

As I dug into the problem further, I was curious about the “Google doesn’t use swap” argument. There are good reasons for not configuring swap because it could cause a machine to slow down as the slow physical disk starts to become used for memory. The upside to using swap is that at least you can investigate what is going on and fix it without it going away.

The first thing I did was enable swap because I didn’t care about the theory of a “perfect” server that didn’t need swap. I wanted to stop these machines from crashing. I am not sure why no one had tried this before, but perhaps they were afraid of the blowback from the “zealot” who would tell them the real problem was the software, not his server.

Low and behold, this did immediately pay dividends. Several machines started to slow down a bit later, and when I logged into one, I could see that was running out of memory. What was happening was that the Java process used more RAM than the machine had available. This caused the Linux kernel to trigger something called (OOM), or “Out of Memory” error. The Linux kernel then finds the process with the most significant memory usage and kills it.

There was a twist, though. This machine was also running Supervisor, which would automatically restart a Java process that died. Ultimately, this became a race condition that had the Linux kernel fighting against this process manager to start and restart a process over and over again. To further add to the excitement, these machines were running Apache ZooKeeper, which performs poorly on highly utilized devices. This process then caused a ripple effect of instability and high CPU and high memory usage. Because the machines had no swap and ephemeral disks, this error could never surface.

The ultimate fix was a methodical reconfiguration of the machines with the following fixes:

  • Configure Java to use less memory than the computers had
  • Enable swap
  • Move Apache Zookeeper to dedicated devices (as the documentation had said)
  • Add better logging and instrumentation

Zealotry is a dangerous way to approach technology. There is no perfect answer or perfect methodology. What works well in one situation may not work well in another. Skilled testers debugging complex problems should never take anything for granted.