Chapter02 Store Data

If you find this content useful, consider buying this book:

  • Amazon
  • Purchase all books bundle
  • Purchase from pragmatic ai labs
  • Subscribe all content monthly
  • If you enjoyed this book considering buying a copy

    Chapter 2: Learn to store data #

    Alfredo Deza

    The first time I got into dealing with data in Python was when Noah gave me a cool idea to try out on my own: “you should write a tool to find duplicates”. We were both working at a media agency that had a large shared storage server. Finding duplicates was a neat idea to reduce usage.

    I was learning Python still, but having the chance to think about how to solve a problem thoroughly was very enticing. To tackle the project, I had to understand how to store data. This can get tricky pretty fast; tutorials for beginners don’t explain this problem as storing data, but rather, they use a more engineering-friendly terminology: data structures.

    Data structures are things that you use to store information. This information can be anything really, and I demonstrate a few different types of information you can store. The next time you hear “data structure”, you can be sure that these just refer to a place where you store items.

    Is all about state #

    Another word that is thrown around to describe items or pieces of information that define the environment of a program is state. One of the common pieces of state is time. Many scripts tend to keep track of time while they do work. Use the time module to calculate the time elapsed:

    >>> import time
    >>> now = time.time()
    >>> now
    1583884231.5750232
    >>> time.time() - now
    17.257133722305298
    

    17 seconds passed since I created the now variable. Although no items were stored, the program was keeping track of time to report it later. This is what state is all about: allow the program to gather information and use it at any given point in time. In the next sections, I explain in detail the actual data structures that can hold information. These data structures are the core of this chapter.

    A program’s state can be many different things. A few common ones that I deal with almost daily is configuration. Some command-line tools, for example, allow to configure them with configuration files in the file system. The idea is that you can have some configuration done in the file, and when the tool starts, it reads those values and stores them to be retrieved later. A usual parameter in the configuration is setting the verbosity of the output, for example, debug or info levels. A tool would read that setting and adjust accordingly. Inside the program, different pieces can consult the value of that setting to decide if it needs to log output or not.

    To read configuration files and store that information is a somewhat advanced feature of building command-line tools. Probably the first type of data structure you will deal with is the list, followed by one of my favorite ones: the dictionary.

    Lists #

    Lists are easy to encounter and easy to abuse. Lists hold individual items, keeping a specific order. To access them, treat the order like an index. The index starts at 0, and it continues incrementally every time a new item gets added. A loop (sometimes referred to as “for loop”) is the most common operation you can encounter. In its simplest form it looks like this:

    >>> directories = ['Documents', 'Music', 'Desktop', 'Downloads', 'Pictures', 'Movies']
    >>> for directory in directories:
    ...     print(directory)
    ...
    Documents
    Music
    Desktop
    Downloads
    Pictures
    Movies
    

    The example is cheating because it is looping over a pre-made list of directories. Python can list directories with the os module, let’s improve the simplistic example. The constraint is that I want to list directories only, and the os.listdir module is listing everything in a given path, so the loop must detect if the path processed is a directory or not so that it can decide to print:

    >>> for item in os.listdir('.'):
    ...     if os.path.isdir(item):
    ...         print(item)
    ...
    .packer.d
    .getmail
    vim
    .config
    Music
    .distlib
    ssl
    .virtualenvs
    .vim
    .nvim
    .gnupg
    dotfiles
    VirtualBox VMs
    .tox
    bin
    python
    .vagrant.d
    vpn
    .pyvim
    .local
    Pictures
    .pylint.d
    .ipython
    Desktop
    Library
    Downloads
    .cache
    .aws
    .python-eggs
    .zsh
    

    I have to cut the example short because I have too many directories, and a lot of them start with a dot. I could further process the items by checking if they start with a dot.

    But a list can be more than helping consume items that are produced by a function or defined in a variable. A list can be used to store items, and add more items to it. You can add items to a list by appending (item goes at the end of the list), inserting at a given position, or extending using another list. There are many options to manipulate a list!

    In the first example, I was checking if an item produced by os.listdir was a directory and printing it immediately. Storing items in a list can help the loop to be more granular:

    >>> import os
    >>> important_directories = []
    >>> for item in os.listdir('.'):
    ...     if os.path.isdir(item):
    ...         important_directories.append(item)
    ...
    >>> for directory in important_directories:
    ...     if directory[0].isupper():
    ...         print(directory)
    ...
    Music
    Lightroom4
    VirtualBox VMs
    Pictures
    Desktop
    Library
    Public
    Movies
    Applications
    Documents
    Downloads
    

    You can always add if and else conditions to loops, but sometimes when the lists are no bigger than a few items, it is helpful to store what you need (important_directories in this case) and then process the items captured in the first loop. I’ve encountered loops in production code that were hundreds of lines long, which is very problematic to understand what is going on and make changes with confidence. It is always a win when code is reduced to improve clarity.

    There are a couple of other interesting operations with lists that you can use that are especially useful when using a list for storing and retrieving data. Just like you can insert items at a given position (called index), you can retrieve items as well from a position:

    >>> items = ['first', 'second', 'third']
    >>> items[0]
    'first'
    >>> items[2]
    'third'
    

    The example uses 0 because lists (and tuples) start at that number, so a 0 indicates that you are requesting the first item in the list at position 0.

    In some situations, you don’t know the index, and instead of going through every item in a list you can use the .index() method to find the item and indicate what is the index on it. Reusing the previous example:

    >>> items.index('third')
    2
    

    Very useful! But watch out, if the item is not found a ValueError exception is raised:

    >>> items.index('foo')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: 'foo' is not in list
    

    Tuples #

    This section covers both lists and tuples, and I haven’t mentioned tuples yet, because lists are more commonly found in production code. There is also an overlap in features that apply for both. For example, you are allowed to retrieve and find items in the same way, and the errors like the ValueError above happen in the same circumstances. There is one main difference: tuples are immutable. Once created, items can’t be removed or added. If you need that type of control over a list, then tuples are what you are looking for. Aside from these commonalities, you can differentiate them in code because lists use square brackets, and tuples use parenthesis.

    This is how tuples behave, you can see the similarities:

    >>> ro_items = ('first', 'second', 'third')
    >>> ro_items.index('second')
    1
    >>> ro_items[1]
    'second'
    >>> ro_items.index('foo')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: tuple.index(x): x not in tuple
    

    There are only two methods for tuples: count() and index(), making them a read-only data structure (or immutable as I mentioned before). There have been a few situations where I’ve used tuples, one of them was when a global variable gets defined at the top of a module, and the application was manipulating the list in a way that made it inconsistent because items got removed, or the order kept changing. This was unacceptable, and there wasn’t a clear way to ensure every interaction with that globally-defined list to behave correctly and not change it at all. The solution? Make the list a tuple. At runtime, it caused any piece of code to break if they altered the contents in any way.

    This is how tuples complain with exceptions when trying to manipulate them:

    >>> ro_items.append('fourth')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'tuple' object has no attribute 'append'
    

    Since tuples have only two methods (count() and index()) then all other methods that exist in lists fail with an AttributeError. There is one case where it might look like a tuple is manipulated, but it isn’t. You need to be careful!

    >>> ro_items + ('fourth', 'fifth')
    ('first', 'second', 'third', 'fourth', 'fifth')
    >>> ro_items
    ('first', 'second', 'third')
    

    Using the addition operand (+) is supported, but that creates a new tuple. The original one stays the same. A TypeError is raised if attempting to use a subtraction operand (-):

    >>> ro_items - ('first', 'second')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: unsupported operand type(s) for -: 'tuple' and 'tuple'
    

    List comprehensions #

    List comprehensions are useful and incredibly powerful, but a word of warning: like everything in life, anything in exaggeration is not good. Just because drinking water is a healthy habit of staying hydrated, you can die if you attempt to drink ten gallons at once. Although I like them and use them when I find a good fit, they are very odd if you have never used them before or don’t know the syntax that well. Even today, I have to double-check I’m doing it correctly, and more often than not, I have exceptions raised because I get the ordering wrong.

    The ordering is the important item to understand to get list comprehensions right. It is like a loop written backwards. To illustrate this, the following two examples produce the same result, one with a loop and the other one with a list comprehension:

    >>> items = ['a', '1', '23', 'b', '4', 'c', 'd']
    >>> numeric = []
    >>> for item in items:
    ...     if item.isnumeric():
    ...         numeric.append(item)
    ...
    >>> numeric
    ['1', '23', '4']
    

    The for loop goes through each item in the items list, and checks if it is a string that looks numeric, if it is, it gets appended to the numeric list that was defined before the loop started. The list comprehension to create these messages would look like this:

    >>> numeric = [item for item in items if item.isnumeric()]
    >>> numeric
    ['1', '23', '4']
    

    Adding the if condition there at the end makes the example a bit more complicated. There is no appending, and everything happens in-line. Note how the loop is reversed. The statement doesn’t start with for ..., it rather starts with the result of the loop. This is not what happens with a regular loop.

    When not to use them #

    Many times, I’ve seen developers use them to avoid a loop. Don’t do this. Loops are fine, but using a list comprehension to save the code from a loop is a terrible idea. This argument is so common that list comprehensions aren’t the only options developers use to avoid loops. A lot of the helpers in the functools module, sometimes mixed with map() are abused in such a way that makes code very obtuse and hard to understand. That is my prime concern with abusing the syntax to avoid loops: it makes code difficult to understand and debug. Even worse, if you are used to testing, these shortcuts make it impossible to measure coverage accurately.

    In short: don’t use them to avoid loops and use them sparingly. Not everything needs to go in a list comprehension. If the usage is readable, not overly long, and is not trying to avoid loops like a disease, then go for it.

    In addition to avoiding loops, developers try to do nested iterations in a list comprehension. Things get really complicated when doing nested loops because, as I’ve mentioned already, the ordering is reversed and it takes quite the effort to try and follow what is going on. To complicate things, in the next example, I nest the items list:

    >>> nested_items = [items, items]
    >>> nested_items
    [['a', '1', '23', 'b', '4', 'c', 'd'], ['a', '1', '23', 'b', '4', 'c', 'd']]
    

    In the scenario above we have a nested list that I want to process just like before. The result should be a flat list containing only numeric strings. I will process them with a normal loop first:

    >>> numeric = []
    >>> for parent in nested_items:
    ...     for item in parent:
    ...         if item.isnumeric():
    ...             numeric.append(item)
    ...
    >>> numeric
    ['1', '23', '4', '1', '23', '4']
    

    Now the same goal with a list comprehension:

    >>> numeric = [item for item in parent for parent in nested_items if item.isnumeric()]
    >>> numeric
    ['1', '1', '23', '23', '4', '4']
    

    Sometimes, breaking up the loop statements within the list comprehension is done to improve readability:

    >>> numeric = [
        item for item in parent
            for parent in nested_items
                if item.isnumeric()
    ]
    >>> numeric
    ['1', '1', '23', '23', '4', '4']
    

    I still think this is too complicated to parse even though the example is trivial. Imagine doing this in production code with more complex data. If I were reviewing proposed code changes, I would frown upon this.

    The fantastic dictionary #

    After using lists for a while, I found myself struggling to create mappings of items. I didn’t come from a formal education in computer science, and I had not done any other programming aside from BASH (the Bourne Again Shell), so I didn’t have a concept of a dictionary at all. An exciting way to explain it is by thinking of dictionaries like the cellphone application used to store your contacts. You probably don’t remember every single phone number, although you know the names. Your contacts app behaves like a dictionary because it consists of a mapping between a name (the key) and a phone number (the value). Dictionaries are a mapping between some key and a value.

    There are some restrictions on what can be the key, but a string is the most common. In the cellphone contacts example, it would look like this:

    >>> contacts = {
        'alfredo': '+3 678-677-0000',
        'noah': '+3 707-777-9191'
    }
    

    Once defined, the dictionary has a few helper methods to interact with it. You can ask for the keys() and a list-like object will be returned, which can be iterated over. And just like you can ask for the keys, it is possible to retrieve the values from the dictionary by using values():

    >>> contacts.keys()
    dict_keys(['alfredo', 'noah'])
    >>> contacts.values()
    dict_values(['+3 678-677-0000', '+3 707-777-9191'])
    

    The objects returned look a bit odd because they are called dict_keys and dict_values respectively. Treat them like a regular list when using them:

    >>> for name in contacts.keys():
    ...     print(name)
    ...
    alfredo
    noah
    >>> for phone in contacts.values():
    ...     print(phone)
    ...
    +3 678-677-0000
    +3 707-777-9191
    

    Some try to force these objects to be lists, thinking it is a requirement for iterating over them:

    >>> for name in list(contacts.keys()):
    ...     print(name)
    

    This isn’t necessary at all, and looping over these objects returned works fine as it is. There is no need to be redundant here.

    Before understanding dictionaries and wanting to map two items together, I used a list of pairs (which is really a list of lists). If you ever encounter a stream of information that might be returning pairs of items and you need to work with a dict, you can convert that into an actual dictionary. This is a neat trick I do:

    >>> list_of_lists = [['a', 1], ['b', 2], ['c', 3]]
    [['a', 1], ['b', 2], ['c', 3]]
    >>> dict(list_of_lists)
    {'a': 1, 'b': 2, 'c': 3}
    

    By calling dict() on the list of lists, it produces a dictionary with the mappings. There are a couple of caveats with this approach. One is that you must be sure that the keys are unique. If they aren’t, the mappings get reduced, which is not what you might be expecting to happen:

    >>> list_of_lists = [['a', 1], ['a', 2], ['a', 3]]
    >>> dict(list_of_lists)
    {'a': 3}
    

    The other potential issue you need to be aware, is that the nested lists must be pairs. That is, they need to be lists (or tuples!) of two items. Anything more, or less, will cause a ValueError with a message that doesn’t tell us much about what is going on:

    >>> list_of_lists = [['a', 1], ['b', 2], ['c', 3], ['d', 4, 5]]
    >>> dict(list_of_lists)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: dictionary update sequence element #3 has length 3; 2 is required
    

    Dictionary as a database #

    You can potentially think of a dictionary as a nice little database. You store items in it with a key, and you can retrieve them later. There are two ways for retrieving values from a dictionary, one is using square brackets, and the other one is using the .get() method. They both work the same except on their error handling. To retrieve the value of a key, you must pass the key:

    >>> user = {'name': 'alfredo', 'surname': 'deza', 'username': 'alfredodeza'}
    >>> user['name']
    'alfredo'
    >>> user['surname']
    'deza'
    

    An error (KeyError) is raised if the key doesn’t exist:

    >>> user['address']
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    KeyError: 'address'
    

    This similar with the get() method, except it doesn’t produce an exception if the key is not found, it will simply return a None with the option of defining a fallback return value:

    >>> user.get('name')
    'alfredo'
    >>> user.get('surname')
    'deza'
    >>> user.get('address')
    >>>
    >>> user.get('address') is None
    True
    >>> user.get('address', 'Unknown address!')
    'Unknown address!'
    

    At the beginning of this chapter, I explained how Noah and I were discussing a cool idea to try out, which ended being a small command-line tool to find duplicates. Playing off of that idea, let’s create a small function that traverses the current path and finds the largest files. Python has a useful module for this: os.walk. When called, it produces a tuple with three items: the top-level directory, a list with all the directories found, and a list with all the files in that directory. This example calls it with the current path, and stop the iteration on the first loop, showing the contents of what we are dealing with:

    import os
    
    for path_info in os.walk('.'):
        print(path_info)
        break
    

    Save this as a file called sizes.py and run it. In my case, I’m in a Python environment and gets information about the top-level directory:

    $ python sizes.py
    ('.', ['bin', 'include', 'lib'], ['pyvenv.cfg', 'sizes.py'])
    

    The first items are the current working directory, which has three directories, as shown by the second list. And two files are there, including sizes.py, which I just executed. This means that at least another nested loop needs to happen. With the information provided, the tool can show absolute paths, and since one of the constraints of paths in a filesystem is that they need to be unique, I can use them as keys in a dictionary. Modify the script to provide absolute paths instead of printing the whole path information tuple:

    import os
    from os.path import abspath, join
    
    for top_dir, directories, files in os.walk('.'):
        for directory in directories:
            print(abspath(join(top_dir, directory)))
        for _file in files:
            print(abspath(join(top_dir, _file)))
        break
    

    With more information available, the script imports a couple of utilities: abspath and join, so that the absolute path can be retrieved and the top directory can be joined with the current directory or file respectively. A break is in place at the end so that it only processes the first directory level and prevent from printing every single file within. Rerun it to see what it does:

    $ python sizes.py
    /private/tmp/venv/bin
    /private/tmp/venv/include
    /private/tmp/venv/lib
    /private/tmp/venv/pyvenv.cfg
    /private/tmp/venv/sizes.py
    

    Great, we now have absolute paths to directories and files. To simplify things a little bit, let’s concentrate on files only, and calculate their size. This gets rid of the directory loop, and require an additional call to a helper that calculates the size of the file:

    import os
    from os.path import abspath, join, getsize
    
    for top_dir, directories, files in os.walk('.'):
        for _file in files:
            full_path = abspath(join(top_dir, _file))
            size = getsize(full_path)
            print('Full path: {0}, size: {1}'.format(full_path, size))
        break
    

    Execute the script once more and check the output:

    $ python sizes.py
    Full path: /private/tmp/venv/pyvenv.cfg, size: 75
    Full path: /private/tmp/venv/sizes.py, size: 288
    

    The sizes are in bytes, which seems fine for these small files but can get out of hand with larger files. Let’s not worry about that for now, and create a dictionary to have each key be the full path, and the value be the size. The idea is to capture the ten largest files so the script should not process anything more than that:

    import os
    from os.path import abspath, join, getsize
    
    sizes = {}
    
    for top_dir, directories, files in os.walk('.'):
        for _file in files:
            full_path = abspath(join(top_dir, _file))
            size = getsize(full_path)
            sizes[full_path] = size
    

    Now that the dictionary sizes is defined, it is used instead of printing the information. The sizes[full_path] = size line means that the path is the key, and the size is the value. The script, as it is, will no longer print information; it is processing all of the paths. Modify it one last time to sort by size and print the top ten results:

    sorted_results = sorted(sizes, key=sizes.get, reverse=True)
    
    for path in sorted_results[:10]:
        print("Path: {0}, size: {1}".format(path, sizes[path]))
    

    The sorted built-in looks into the dictionary, and sort the keys based on the values as defined by key=sizes.get. Use reverse=True so that the largest values are first. In the final loop, instead of going through every path, slice the list for the first ten items, and lastly, when printing use the path provided but query the dictionary again for the size value of that path. Asking the dictionary again for the value is needed because sorted provides a list with the paths in the order needed, but it doesn’t provide the values. Those need to be asked again.

    Running again the script gives this result:

    Path: /private/tmp/venv/lib/python3.8/site-packages/pip/_vendor/certifi/cacert.pem, size: 282085
    Path: /private/tmp/venv/lib/python3.8/site-packages/pip/_vendor/pyparsing.py, size: 245385
    Path: /private/tmp/venv/lib/python3.8/site-packages/setuptools/_vendor/pyparsing.py, size: 232055
    Path: /private/tmp/venv/lib/python3.8/site-packages/pkg_resources/_vendor/pyparsing.py, size: 232055
    Path: /private/tmp/venv/lib/python3.8/site-packages/pip/_vendor/idna/uts46data.py, size: 198292
    Path: /private/tmp/venv/lib/python3.8/site-packages/pip/_vendor/html5lib/html5parser.py, size: 118963
    Path: /private/tmp/venv/lib/python3.8/site-packages/pkg_resources/__init__.py, size: 108309
    Path: /private/tmp/venv/lib/python3.8/site-packages/pip/_vendor/pkg_resources/__init__.py, size: 107910
    Path: /private/tmp/venv/lib/python3.8/site-packages/pip/_vendor/distlib/t64.exe, size: 102912
    Path: /private/tmp/venv/lib/python3.8/site-packages/pip/_vendor/distlib/w64.exe, size: 99840
    

    The biggest file is cacert.pem, which is about 275 kilobytes in my system.

    Sets #

    Sets are only useful when trying to ensure unique items are preserved. Before sets where available, it was common to process items and check if they exist in a list (or dictionary) before adding them. For example:

    >>> unique = []
    >>> for name in ['alfredo', 'alfredo', 'noah', 'alfredo']:
    ...     if name not in unique:
    ...         unique.append(name)
    ...
    >>> unique
    ['alfredo', 'noah']
    

    There is no need to do this when using sets. Instead of appending you add to a set:

    >>> for name in ['alfredo', 'alfredo', 'noah', 'alfredo']:
    ...     unique.add(name)
    ...
    >>> unique
    {'noah', 'alfredo'}
    

    Just like tuples and lists, interacting with sets have some differences on how to access their items. You can’t index them like lists and tuples, but you can iterate over them without issues. The only reason I use sets is to ensure there aren’t any duplicates. If that is not needed, a list is preferable.