Chapter04 Integrate Linux Click

If you find this content useful, consider buying this book:

If you enjoyed this book considering buying a copy

Chapter 4: Integrating Linux Commands with Click #

Alfredo Deza

Python comes with lots of different utilities to interact with a system. Listing directories, getting file information, and even lower-level operations like socket communications. There are situations where these are not sufficient or just not solving the right problem for us. I remember this time I worked with Ceph (a distributed file storage system) and had to interact with different disk utilities. There are quite a few tools to retrieve device information like blkid, lsblk, and parted. They all have some overlap and some distinct features. I had to retrieve some specific information from a device that one tool wouldn’t have and then go to a different tool to retrieve the rest.

To make matters worse, and because Ceph supports different various Linux distributions, some tools didn’t have the features I needed on older versions of a particular Linux distro. What a problem. I ended up creating utilities that would try one tool first and then fall back to the others if they failed, with an order of preference. The result, however, ended up being very robust and resilient to all of these differences. Along the way, there are a few essential pieces that need to be in place, though, and in this chapter, I go through these pieces that make the interaction seamless, practical, and extraordinary resiliency.

Understand subprocess #

If you search on the internet for how to run a system command from Python, you shouldn’t be surprised to find hundreds (thousands?) of examples that may show something like this one:

>>> import subprocess
>>> subprocess.call(['ls', '-l'])
total 13512
drwxrwxr-x    9 root  admin      288 Feb 11 13:13 itcl4.1.1
-rwxrwxr-x    1 root  admin  2752568 Dec 18 14:06 libcrypto.1.1.dylib
-rwxrwxr-x    1 root  admin    88244 Dec 18 14:07 libformw.5.dylib
-rwxrwxr-x    1 root  admin    43080 Dec 18 14:07 libmenuw.5.dylib
-rwxrwxr-x    1 root  admin   408344 Dec 18 14:07 libncursesw.5.dylib
-rwxrwxr-x    1 root  admin    25924 Dec 18 14:07 libpanelw.5.dylib
-rwxrwxr-x    1 root  admin   529676 Dec 18 14:06 libssl.1.1.dylib
-r-xrwxr-x    1 root  admin  1441716 Dec 18 14:06 libtcl8.6.dylib
drwxrwxr-x    5 root  admin      160 Dec 18 14:06 tcl8
drwxrwxr-x   17 root  admin      544 Feb 11 13:13 tcl8.6
-rw-rw-r--    1 root  admin     8275 Dec 18 14:06 tclConfig.sh
drwxrwxr-x    5 root  admin      160 Feb 11 13:13 thread2.8.2
-rw-rw-r--    1 root  admin     4351 Dec 18 14:07 tkConfig.sh
0

If the example is trying to capture the results and assign it to a variable, it may use something like this though:

>>> from subprocess import Popen, PIPE
>>> process = subprocess.Popen(['ls', '-l'], stdout=PIPE)
>>> output = process.stdout.read()
>>> for line in output.decode('utf-8').split('\n'):
        print(line)
total 13512
drwxrwxr-x    9 root  admin      288 Feb 11 13:13 itcl4.1.1
-rwxrwxr-x    1 root  admin  2752568 Dec 18 14:06 libcrypto.1.1.dylib
-rwxrwxr-x    1 root  admin    88244 Dec 18 14:07 libformw.5.dylib
-rwxrwxr-x    1 root  admin    43080 Dec 18 14:07 libmenuw.5.dylib
-rwxrwxr-x    1 root  admin   408344 Dec 18 14:07 libncursesw.5.dylib
-rwxrwxr-x    1 root  admin    25924 Dec 18 14:07 libpanelw.5.dylib
-rwxrwxr-x    1 root  admin   529676 Dec 18 14:06 libssl.1.1.dylib
-r-xrwxr-x    1 root  admin  1441716 Dec 18 14:06 libtcl8.6.dylib
drwxrwxr-x    5 root  admin      160 Dec 18 14:06 tcl8
drwxrwxr-x   17 root  admin      544 Feb 11 13:13 tcl8.6
-rw-rw-r--    1 root  admin     8275 Dec 18 14:06 tclConfig.sh
drwxrwxr-x    5 root  admin      160 Feb 11 13:13 thread2.8.2
-rw-rw-r--    1 root  admin     4351 Dec 18 14:07 tkConfig.sh
0

The output variablesaves the return string fromprocess.stdout.read(), then it gets decoded (read()` returns bytes, not a string), and finally, it prints the result. This is not very useful, except for demonstrating how to keep the output around for processing.

These examples are everywhere. Some go further into checking exit status codes and waiting for the command to finish, but they are lacking into crucial factors of correctly (and safely) interacting with system commands. These are a few questions that come to mind that need answering when crafting these types of interactions:

  • What happens if the tool does not exist or is not in the path?
  • What to do when the tool has an error?
  • If the tool takes too long to run, how do I know if it is hanging or doing actual work?

Running system commands looks easy, but it is vital to understand resilient interfaces so that addressing failures is easier.

There are, primarily, two types of system calls you interact with: one doesn’t care about the output, like starting a web server, and the other one that produces useful output that needs to be processed. There are strategies that you need to apply to each one depending on the use case to ensure a transparent interface. As long as there is consistency, these system interactions are easy to work with.

Parsing Results #

When the output of a system command gets saved for post-processing, then parsing code must be implemented. The parsing can be as easy as checking if a specific word is in the output as a whole. For more involved output, a line-by-line parsing has to be crafted. The more difficult the output, the more chance there is for brittleness in the handling. The parsing then needs to apply a strategy from simple (more robust) to complex (prone to breakage). One common thought is to apply a regular expression to anything coming out from a system command to parse, but this is usually my last resort, as it is incredibly hard to debug compared to non-regular-expression approaches.

If the tool you are calling from Python has a way to produce a machine-readable format, then use it. It is almost always the best chance you have for an easy path to parsing results. The machine-readable format can be many things, most commonly it is JSON or CSV (Comma Separated Values). Python has native support for loading these and interacting with them easily.

I find that working directly with subprocess.Popen takes too much boilerplate code, so I usually create a utility that runs a command and always returns the stdout, stderr, and exit code. It looks like this:

import subprocess


def run(command):
    process = subprocess.Popen(
        command,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
    )

    stdout_stream = process.stdout.read()
    stderr_stream = process.stderr.read()
    returncode = process.wait()
    if not isinstance(stdout_stream, str):
        stdout_stream = stdout_stream.decode('utf-8')
    if not isinstance(stderr_stream, str):
        stderr_stream = stderr_stream.decode('utf-8')
    stdout = stdout_stream.splitlines()
    stderr = stderr_stream.splitlines()

    return stdout, stderr, returncode

It accepts the command (as a list), then reads all the output produced by both stdout and stderr and decodes them if necessary. The utility splits the lines for processing them later, making it easier to consume the output.

Simple Parsing #

The simplest parsing, using the run() example utility, is going to be checking if there is a specific line or string in the output. For example, imagine you need to check if a given device is using the XFS file system. A utility needs to check if the reporting is mentioning that it is XFS, and nothing more. In my system, I can do this with the following:

$ sudo blkid /dev/sda1
/dev/sda1: UUID="8ac075e3-1124-4bb6-bef7-a6811bf8b870" TYPE="xfs"

So all I need to do is check if TYPE="xfs" is in the output. The utility becomes straightforward:

def is_xfs(device):
    stdout, stderr, out = run(['sudo', 'blkid', device])
    if 'TYPE="xfs"' in stdout:
        return True
    return False

Interacting with this utility is easy enough, including when there are errors that the blkid tool reports but don’t matter to determine if a device is using the XFS filesystem:

>>> is_xfs('/dev/sda1')
 True
>>> is_xfs('/dev/sda2')
False
>>> is_xfs('/dev/sdfooobar')
False

In some cases, no parsing at all needs to happen because the exit code gives you everything you need to know. That system call and its subsequent exit code answers the question: “Was the command successful?”. A quick example of this is using the Docker command-line tool. Have you ever tried stopping a container? It is pretty simple, first check if the container is running:

$ docker ps | grep pytest
542818cd6d7f        anchore/inline-scan:latest   "docker-entrypoint.sh"   14 minutes ago      Up 14 minutes (healthy)   5000/tcp, 5432/tcp, 0.0.0.0:8228->8228/tcp   pytest_inline_scan

I’ve confirmed it is running so now I stop the pytest_inline_scan container, and then check its exit code:

$ docker stop pytest_inline_scan
pytest_inline_scan
$ echo $?
0

The container is stopped, and its exit status code is 0. Although I highly recommend using the Python Docker SDK, you can use this minimal example to guide you when you only need to check for an exit status:

def stop_container(container):
    stdout, stderr, code = run(['docker', 'stop', container])
    if code != 0:
        raise RuntimeError(f'Unable to stop {container}')

Advanced Parsing #

There are multiple levels of painful parsing for command-line output. Like I’ve mentioned earlier in this chapter, you should always try to tackle the problem with the simple approaches first, then move on to machine-readable output if possible or just checking the exit code. This section dives into some of the advanced parsing I’ve implemented in production where most of the time, I avoid regular expressions until I’ve exhausted every other way, making it necessary.

Even though I highly recommend configuring tools to produce machine-readable formats like CSV or JSON, sometimes this is not possible. One time, I saw that the lsblk tool (another tool to inspect devices like blkid) had the --json flag to produce an easily consumable output in Python. After creating the implementation, I realized that this wouldn’t work in older Linux distributions because that special flag didn’t exist, as it is a somewhat new feature. To have one implementation that would work for older Linux versions as well as new ones, I had to do the hard thing: parsing.

The first thing to do is to separate the implementation in two parts: one that runs the command, and the other one that parses the output. This is crucial to do because testing, maintaining, and fixing any problems is easier when the pieces are isolated with lots of tests to ensure expected behavior. I first start by running the command to produce the output I’m going to work within the parser. I know what the command is, and how the flags should be issued, so I want to verify the output before writing the parsing:

$ lsblk -P -p -o NAME,PARTLABEL,TYPE /dev/sda
NAME="/dev/sda" PARTLABEL="" TYPE="disk"
NAME="/dev/sda1" PARTLABEL="" TYPE="part"

The tool allows a little bit of machine-readable friendliness with the -P flag which produces pairs of values, as if this output was going to be read by a BASH script. The -p flag uses absolute paths for the output, and finally -o specifies what device labels I’m interested in. With that output, the parsing part can get started. I want the parsing to return a dictionary, so I need to extract the label (NAME for example) as a key, and then the value that is within the quotes. A good way to tinker with the parsing is to do this in the Python shell:

>>> line = 'NAME="/dev/sda" PARTLABEL="" TYPE="disk"'
>>> line
'NAME="/dev/sda" PARTLABEL="" TYPE="disk"'
>>> line.split(' ')
['NAME="/dev/sda"', 'PARTLABEL=""', 'TYPE="disk"']
>>> line.split('" ')
['NAME="/dev/sda', 'PARTLABEL="', 'TYPE="disk"']

I try two ways of splitting the line, and I decide to use the last one that splits on the double quote because it partially cleans up the value. These are minor implementation details, and you can try other ways of splitting this that can be more efficient. The important part is to separate the parsing from the command execution function, add tests, and narrow down the path by playing with the output in a Python shell.

Now that I’m happy with the splitting, I do another pass of splitting each item to produce the pairs:

>>> for item in line.split('" '):
...     item.split('="')
...
['NAME', '/dev/sda']
['PARTLABEL', '']
['TYPE', 'disk"']
>>>

Very close. There is a trailing quote that needs cleaning in the last item, but the good thing is that the items are now paired nicely, so the parsing doesn’t need to guess what value needs to go with what key. For example, if one value was empty or if there are spaces and the splitting goes bad, it is easier because of grouping. Let’s try again by adding items into a dictionary and cleaning up further:

>>> parsed = {}
>>> for item in line.split('" '):
...     key, value = item.split('="')
...     parsed[key] = value.strip('"')
...
>>> parsed
{'NAME': '/dev/sda', 'PARTLABEL': '', 'TYPE': 'disk'}

Excellent. The parsing side is complete, as I’m happy with the result in the shell. Before moving forward, I write a dozen unit-tests to make sure I got this right. However, I know you are thinking about doing this with regular expressions, because that would be super easy to do, and you think I’m plain silly in not writing a simple regular expression that would split on a group of upper case letters. Regular expressions are not straightforward, and they are hard to test (no if or else conditions, impossible to get a sense of coverage). Let’s try a couple of rounds of regular expressions to achieve the same result.

First, and somewhat on cheating mode here, I split on whitespace:

>>> import re
>>> line = 'NAME="/dev/sda" PARTLABEL="" TYPE="disk"'
>>> line
'NAME="/dev/sda" PARTLABEL="" TYPE="disk"'
>>> re.split('\s+', line)
['NAME="/dev/sda"', 'PARTLABEL=""', 'TYPE="disk"']

I now have each part with its pairs. Many regular expressions can be thrown at these items in the list to produce what we want. I’m the person at work that tries the simplest approach first and ensure that it works great. Simple always wins for me. Instead of splitting further, I use a regular expression to get rid of the characters I don’t need:

>>> for item in re.split('\s+', line):
...      re.sub('("|=)', ' ', item)
...
'NAME  /dev/sda '
'PARTLABEL   '
'TYPE  disk '

It looks broken without any delimiters or quotes as it was presented before, but this is fine because in the next step, I split on whitespace which is the fantastic default for the .split() string method:

>>> for item in re.split('\s+', line):
...      result = re.sub('("|=)', ' ', item)
...      result.split()
...
['NAME', '/dev/sda']
['PARTLABEL']
['TYPE', 'disk']

In this case, result is each string separated by whitespace that was produced by replacing the characters I don’t want. Then .split() removes that whitespace creating the pairs. The final piece of code that produces the dictionary mapping is now easy:

>>> parsed = {}
>>> for item in re.split('\s+', line):
...     result = re.sub('("|=)', ' ', item)
...     key, value = result.split()
...     parsed[key] = value
...
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
ValueError: need more than 1 value to unpack

Oh no. What happened here? I was distracted and didn’t realize that by replacing the double quotes along with the equal character it caused some values to disappear. That is why PARTLABEL="" produces just one item, not two. The fix is to just remove the = character and clean up the quoted value later:

>>> for item in re.split('\s+', line):
...     result = re.sub('=', ' ', item)
...     key, value = result.split()
...     parsed[key] = value.strip('"')
...
>>> parsed
{'TYPE': 'disk', 'NAME': '/dev/sda', 'PARTLABEL': ''}

I mentioned I cheated because, in the end, I’m still trying to follow the pattern of splitting and then doing further cleaning. Separating the parsing process in this way is easier to understand and improve when output isn’t conforming to the expectations. This chapter doesn’t cover the testing part of this, but it is imperative to add as many tests as possible to ensure the parsing is correct.

The end result of these approaches gives us a nice API to work with:

def _lsblk_parser(lines):
    parsed = {}
    for line in lines:
        for item in lines.split('" '):
            key, value = item.split('="')
            parsed[key] = value.strip('"')

    return parsed


def lsblk(device):
    command = [
        'lsblk',
        '-P',   # Produce pairs of key/value
        '-p',   # Return absolute paths
        '-o',   # Define the labels we are interested in
        'NAME,PARTLABEL,TYPE',
        device
    ]

    stdout, stderr, code = run(command)
    return _lsblk_parser(stdout)

And the resulting API interaction with the functions:

>>> lsblk('/dev/sda')
{'NAME': '/dev/sda1', 'PARTLABEL': '', 'TYPE': 'part'}
>>> lsblk('/dev/sda1')
{'NAME': '/dev/sda1', 'PARTLABEL': '', 'TYPE': 'part'}
>>> lsblk('/dev/sdb')
{}

Shell Safety #

One of the things that prevented me from loving Python in the first place (I was coming from BASH), was that subprocess.Popen defaults to accepting a list for the command to run. This can get very tiring and repetitive quite fast, but it is generally safer to use a list. It doesn’t mean that using a plain string (accepted if shell=True gets used in Popen) is inadequate or very unsafe. It depends where this string is coming from. In all the examples in this chapter, the input was curated and carefully constructed by the functions. Still, if the interfaces in a command-line tool were to accept input from a user, then that is a security concern. Python spawns a sub-shell to evaluate the input first, expanding variables, for example, which can have undesired effects. In security, this is called shell injection, where input can result in other arbitrary execution.

Even in the case where you aren’t accepting input from external sources, you are at the mercy of the system and how it interprets (or expands) variables and other behavior with a sub-shell.

In short: don’t use shell=True to pass a whole string and always sanitize input coming from the user.