Chapter12 Essays

If you find this content useful, consider buying this book:

If you enjoyed this book considering buying a copy

Chapter 12: Essays #

Noah Gift

This section exposes essays I have written at some point on testing. For essays that have been written many years in the past, they are good historical records of ideas, but the libraries may be different than libraries referred to in the book.

Writing clean, testable, high quality code in Python #

And how to measure scientifically the difference

originally published on IBM Developerworks, 2010

Introduction #

Writing software is among the most complicated endeavors a human can undertake. Brian Kernigan, co-author of the AWK programming language and “K and R C”, sumed up the true nature of software development in the book, Software Tools, when he stated, “Controlling complexity is the essence of software development.” The harsh reality of real world software development is that software is often created with intentional, or unintentional, complexity and a disregard for maintainability, testability, and quality. The end result of this unfortunate reality is software that can become increasingly difficult and expensive to maintain and that fails sporadically and even spectacularly.

The first step in the process of writing high quality code is to re-examine the entire thought process of how an individual or team develops software. Often in failed, or troubled, software development projects, the software was developed in a reactionary stream of consciousness where the focus of the software development was on getting a problem solved in any manner possible. In a successful software project, the developer is thinking not only about how to solve the problem at hand, but additionally about the process involved in solving the problem.

A succesful software developer will devise a way to run the tests in an easily automated fashion, so they can continuously prove the software works. They are aware of the dangers of needless complexity. They are humble in their approach, seek critical review, and expect refactoring at every step of the way. They continuously think about how they can ensure their software is testable, readable, and maintainable. Although Python the language, and Python the community, are heavily influenced by desire to write clean, maintainable code that works, it is still quite easy to do the exact opposite. In this article, we will tackle this problem head on and explore how to write clean, testable, high quality code in Python.

A clean code hypothetical problem #

The best way to demonstrate this style of development is to solve a hypothetical problem. Let’s suppose you are a back-end web developer at a company that allows users to generate reviews, and you need to come up with a way to show and highlight small snippets of those reviews. One way to approach the problem would be to write a large function that takes a snippet of text, and query parameters, and returns back a character limited snippet with the query parameters highlighted. All of the logic needed to solve the problem would be included in the one “mega” function, and you would simply need to keep rerunning your script, until you got the result you wanted. The format would probably look like the code example below and would often be developed with a combination of print statements, or logging statements, and an interactive shell.

{caption: “Listing 1. Messy code}

def my_mega_function(snippet, query):
    """This takes a snippet of text, and a query parameter and returns """

    # Logic goes here, and often runs on for several hundred lines
    # There are often deeply nested conditional statements and loops
    # Function could reach several hundred, if not thousands of lines

    return result

With a dynamic language like Python, Perl, or Ruby, it is easy to develop software by simply banging away at the problem, often interactively, until you get what seems to be the correct result and calling it a day. Unfortunately, this approach, while tempting, often leads to a false sense of accomplishment that is fraught with danger. Much of the danger lies in not designing a solution to be testable, and part lies in not properly controlling the complexity of the software written.

How can you say this function even works? You can have faith that it works because it worked the last time you ran it during development, but are you sure it doesn’t contain subtle errors of logic or syntax? What happens if you need to change the code? Would it still work, and how would you know it still worked? What if that code needed to be maintained by another developer, and he needed to make changes to it? How would he know his changes didn’t cause something subtle to break? How hard would it be for him to understand what the code does?

The short answer is: if you don’t have tests, you don’t know if your software works. If you stack together enough guesses, you may eventually build something that appears to function, but that no human could ever say with certainty ever worked properly. This is a bad place to be, and I have both written this software and helped debug software written this way. Fortunately, this condition is easily avoidable. Writing tests before, such as the case of Test Driven Development, or while you write your logic actually shapes the way code is written. It leads to modular, extensible code that is easy to test, understand, and maintain. It is immediately apparent to the experienced developer when software was developed with testing in mind, and when it was not. The software itself looks dramatically different to the trained eye.

Without simply taking my word for it, or visually inspecting code, there are ways to measure scientifically the difference between these two different styles. The first way is to actually measure the lines of code that are tested. Nose is a popular extension of Python’s unit test framework that includes an easy way to run automatically a batch of tests and plug-ins, such as code coverage. By measuring code coverage during development, it becomes quickly apparent that it is almost impossible to get 100 percent test coverage for code that is composed of large functions, with highly nested logic, that are built in an ad hoc manner.

The second way to measure the difference is to use static analysis tools. There are several popular Python tools that measure various metrics for Python developers, ranging from general code quality to specific metrics, like duplicate code or complexity. You can measure the cyclomatic complexity of your code with either pygenie or pymetrics (see resources on the right). Here is an example of what it looks like when we run pygenie on “clean” code that is relatively simple:

{caption: “Listing 2. Pygenie output of cyclomatic complexity”}

% python pygenie.py complexity ‑‑verbose highlight spy
File: /Users/ngift/Documents/src/highlight.py
Type Name                                                                   Complexity 
‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑
M    HighlightDocumentOperations.createsnippit                                  3
M    HighlightDocumentOperations._reconstruct_document_string                     3
M    HighlightDocumentOperations._doc_to_sentences                                2
M    HighlightDocumentOperations._querystring_to_dict                             2
M    HighlightDocumentOperations._word_frequency_sort                             2
M    HighlightDocumentOperations.highlight_doc                                    2
X    /Users/ngift/Documents/src/highlight.py 1          
C    HighlightDocumentOperations                                                  1
M    HighlightDocumentOperations.__init                                         1
M    HighlightDocumentOperations._custom_highlight_tag                            1
M    HighlightDocumentOperations._score_sentences                                 1
M    HighlightDocumentOperations._multiple_string_replace                         1

As you can tell from the example, every method is extremely simple and contains a complexity rating under 10, which is desirable according to McCabe’s research. In my experiences, I have seen “mega” functions written without testing that had complexity ratings over 140 and have stretched over 1200 lines. Suffice to say, it is literally impossible to test code like this. There is actually no way to ever know it works and refactoring it is impossible. If the author of the code kept testing in mind, and wrote the same logic with 100 percent test coverage, it is highly unlikely it would have such a high complexity rating.

What is cyclomatic complexity? #

Cyclomatic complexity is a software metric, developed by Thomas J. McCabe in 1976, to determine a program’s complexity. The metric measures the number of linearly independent paths, or branches, through source code. According to McCabe, it is best to keep the complexity of a method below 10. This is important because research into human memory has determined that 7 (plus or minus 2) is the magical number of items that a human can hold in short term memory.

If a developer is working on code that has 50 linearly independent paths, then they are roughly exceeding fives times the capacity of short term memory in keeping track of what is occurring in that method. Simpler methods that don’t tax all of a human’s short term memory are easier to work with and have been proven to be less error prone. A 2008 study by Enerjy found a strong correlation between cyclomatic complexity and faultiness. Classes that had a complexity of 11 had a probability of being fault-prone of 0.28 but rose to 0.98 with classes of a complexity of 74.

A clean code hypothetical solution #

Let’s now take a look at a complete source code example with accompanying unit tests and functional tests and see what it actually does, and why this code is considered clean. One reasonable definition of clean, using strictly metrics, is that it fulfills the following requirements: it has close to 100 percent test coverage; it has a cyclomatic complexity rating of under 10 for all classes and methods; and it scores close to a 10.0 rating with pylint. Here is an example of using nose to test unit test and doctest coverage on the highlight module:

{caption: “Listing 3. Running nosetests with coverage reporting: 100 percent coverage”}

% nosetests ‑v ‑‑with‑coverage ‑‑cover‑package=highlight ‑‑with‑doctest\
     ‑‑cover‑erase ‑‑exe

Doctest: highlight.HighlightDocumentOperations._custom_highlight_tag ... ok
test_functional.test_snippit_algorithm ... ok
test_custom_highlight_tag (test_highlight.TestHighlight) ... ok
Consumes the generator, and then verifies the result[0] ... ok
Verifies highlighted text is what we expect ... ok
test_multi_string_replace (test_highlight.TestHighlight) ... ok
Verifies the yielded results are what is expected ... ok

Name        Stmts   Exec  Cover   Missing
‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑
highlight      71     71   100%   
‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑
Ran 7 tests in 4.223s

OK

As you can see from the above snippet, the nosetests command was run with several options, and there was 100 percent test coverage for the highlight spy script. The only thing of real note to point out is that –cover-package=highlight is a way of telling nose to show only the coverage report on a specified module. This is very useful to isolate the output of a coverage report to the module or packages you want to observe coverage reporting on. One thing you may want to try is to download the source code from this article and to comment out some of the tests to see how the coverage reporting mechanism really works.

{caption: “Listing 4. highlight spy”}

# /usr/bin/python
# -*- coding: utf-8 -*-

"""
:mod:`highlight` -- Highlight Methods
===================================

.. module:: highlight
   :platform: Unix, Windows
   :synopsis: highlight document snippets that match a query.
.. moduleauthor:: Noah Gift <noah.gift@gmail.com>

Requirements::
    1.  You will need to install the ntlk library to run this code.
        http://www.nltk.org/download
    2.  You will need to download the data for the ntlk:
        See http://www.nltk.org/data::
        
        import nltk
        nltk.download()

"""

import re
import logging

import nltk

# Globals
logging.basicConfig()
LOG = logging.getLogger("highlight")
LOG.setLevel(logging.INFO)


class HighlightDocumentOperations(object):

    """Highlight Operations for a Document"""

    def __init__(self, document=None, query=None):
        """
        Kwargs:
            document (str):
            query (str):
            
        """
        self._document = document
        self._query = query

    @staticmethod
    def _custom_highlight_tag(phrase, start="<strong>", end="</strong>"):

        """Injects an open and close highlight tag after a word

        Args:
            phrase (str) - A word or phrase.
        Kwargs:
            start (str) - An opening tag.  Defaults to <strong>
            end (str) - A closing tag.  Defaults to </strong>
        Returns:
            (str) word or phrase with custom opening and closing tags
            
        >>> h = HighlightDocumentOperations()
        >>> h._custom_highlight_tag("foo")
        '<strong>foo</strong>'
        >>>
        
        """
        tagged_phrase = "{0}{1}{2}".format(start, phrase, end)
        return tagged_phrase

    def _doc_to_sentences(self):
        """Takes a string document and converts it into a list of sentences
        
        Unfortunately, this approach might be a tad naive for production
        because some segments that are split on a period are really an
        abbreviation, and to make things even more complicated, an
        abbreviation can also be the end of a sentence::
            http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html
        
        Returns:
            (generator) A generator object of a tokenized sentence tuple,
            with the list position of sentence as the first portion of
            the tuple, such as:  (0, "This was the first sentence")
        
        """

        tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
        sentences = tokenizer.tokenize(self._document)
        for sentence in enumerate(sentences):
            yield sentence

    @staticmethod
    def _score_sentences(sentence, querydict):
        """Creates a scoring system for each sentence by substitution analysis
        
        Tokenizes each sentence, counts characters
        in sentence, and pass it back as nested tuple
    
        Returns:
            (tuple) - (score (int), (count (int), position (int),
                    raw sentence (str))
            
        """

        position, sentence = sentence
        count = len(sentence)
        regex = re.compile("|".join(map(re.escape, querydict)))
        score = len(re.findall(regex, sentence))
        processed_score = (score, (count, position, sentence))
        return processed_score

    def _querystring_to_dict(self, split_token="+"):
        """Converts query parameters into a dictionary
        
        Returns:
            (dict)- dparams, a dictionary of query parameters
            
        """

        params = self._query.split(split_token)
        dparams = dict([(key, self._custom_highlight_tag(key)) for key in params])
        return dparams

    @staticmethod
    def _word_frequency_sort(sentences):
        """Sorts sentences by score frequency, yields sorted result
        
        This will yield the highest score count items first.
        
        Args:
            sentences (list) - a nested tuple inside of list
            [(0, (90, 3, "The crust/dough was just way too effin' dry for me.
            Yes, I know what 'cornmeal' is, thanks."))]

        """

        sentences.sort()
        while sentences:
            yield sentences.pop()

    def _create_snippit(self, sentences, max_characters=175):
        """Creates a snippet from a sentence while keeping it under max_chars 
        
        Returns a sorted list with max characters.  The sort is an attempt
        to rebuild the original document structure as close as possible,
        with the new sorting by scoring and the limitation of max_chars.
        
        Args:
            sentences (generator) - sorted object to turn into a snippit
            max_characters (int) - optional max characters of snippit
           
        Returns:
            snippit (list) - returns a sorted list with a nested tuple that
            has the first index holding the original position of the list::
            
            [(0, (90, 3, "The crust/dough was just way too effin' dry for me.
            Yes, I know what 'cornmeal' is, thanks."))]
            
        """

        snippit = []
        total = 0
        for sentence in self._word_frequency_sort(sentences):
            LOG.debug("Creating snippit", sentence)
            score, (count, position, raw_sentence) = sentence
            total += count
            if total < max_characters:
                # position now gets converted to index 0 for sorting later
                snippit.append(((position), score, count, raw_sentence))

        # try to reassemble document by original order by doing a simple sort
        snippit.sort()
        return snippit

    @staticmethod
    def _multiple_string_replace(string_to_replace, dict_patterns):
        """Performs a multiple replace in a string with dict pattern.
        
        Borrowed from Python Cookbook.
        
        Args:
            string_to_replace (str) - String to be multi-replaced
            dict_patterns (dict) - A dict full of patterns
            
        Returns:
            (str) - Multiple replaced string.
        
        """

        regex = re.compile("|".join(map(re.escape, dict_patterns)))

        def one_xlat(match):
            """Closure that is called repeatedly during multi-substitution.
            
            Args:
                match (SRE_Match object)
            Returns:
                partial string substitution (str)
            
            """

            return dict_patterns[match.group(0)]

        return regex.sub(one_xlat, string_to_replace)

    def _reconstruct_document_string(self, snippit, querydict):
        """Reconstructs string snippit, build tags, and return string
        
        A helper function for highlight_doc.
        
        Args:
            string_to_replace (list) - A list of nested tuples, containting
            this pattern::
            
            [(0, (90, 3, "The crust/dough was just way too effin' dry for me.
            Yes, I know what 'cornmeal' is, thanks."))]
            
            dict_patterns (dict) - A dict full of patterns
        
        Returns:
            (str) The most relevant snippet with the query terms highlighted.
        
        """

        snip = []
        for entry in snippit:
            score = entry[1]
            sent = entry[3]
            # if we have matches, now do the multi-replace
            if score:
                sent = self._multiple_string_replace(sent, querydict)
            snip.append(sent)
        highlighted_snip = " ".join(snip)

        return highlighted_snip

    def highlight_doc(self):
        """Finds the most relevant snippit with the query terms highlighted
        
        Returns:
            (str) The most relevant snippet with the query terms highlighted.
        
        """

        # tokenize to sentences, and convert query to a dict
        sentences = self._doc_to_sentences()
        querydict = self._querystring_to_dict()

        # process and score sentences
        scored_sentences = []
        for sentence in sentences:
            scored = self._score_sentences(sentence, querydict)
            scored_sentences.append(scored)

        # fit into max characters, and sort by original position
        snippit = self._create_snippit(scored_sentences)
        # assemble back into string
        highlighted_snip = self._reconstruct_document_string(snippit, querydict)

        return highlighted_snip

{caption: “Listing 6. test_functional.py”}

"""Functional Test That Performs Some Basic Sanity Checks"""

from highlight import HighlightDocumentOperations


def test_snippit_algorithm():
    document1 = """
        This place has awesome deep dish pizza.
        I have been getting delivery through Waiters on wheels for years.
        It is classic, deep dish  Chicago style pizza.
        Now I found out they also have half-baked to pick-up and cook at home.
        This is a great benefit. I am having it tonight. Yum.
        """
    document2 = """Review for their take-out only.
Tried their large Classic (sausage, mushroom, peppers and onions) deep dish;\
and their large Pesto Chicken thin crust pizzas.
Pizza = I've had better.  The crust/dough was just way too effin' dry for me.\
Yes, I know what 'cornmeal' is, thanks.  But it's way too dry.\
I'm not talking about the bottom of the pizza...I'm talking about the dough \
that's in between the sauce and bottom of the pie...it was like cardboard, sorry!
Wings = spicy and good.   Bleu cheese dressing only...hmmm, but no alternative\
of ranch dressing, at all.  Service = friendly enough at the counters.  
Decor = freakin' dark.  I'm not sure how people can see their food.  
Parking = a real pain.  Good luck."""

    h1 = HighlightDocumentOperations(document1, "deep+dish+pizza")
    actual = h1.highlight_doc()
    print "Raw Document1: %s" % document1
    print " Formatted Document1: %s" % actual
    assert len(actual) < 500
    assert "<strong>" in actual

    h2 = HighlightDocumentOperations(document2, "deep+dish+pizza")
    actual = h2.highlight_doc()
    print "Raw Document2: %s" % document2
    print " Formatted Document2: %s" % actual
    assert len(actual) < 500
    assert "<strong>" in actual


if __name__ == "__main__":
    test_snippit_algorithm(

{caption: “Listing 7. test_unittest.py”}

# /usr/bin/python
# -*- coding: utf-8 -*-
"""
Tests this query searchs a document, highlights a snippit and returns it
http://www.example.com/search?find_desc=deep+dish+pizza&ns=1&rpp=10&find_loc=\
                                                        San+Francisco%2C+CA

Contains both unit and functional tests.

"""


import unittest
from highlight import HighlightDocumentOperations


class TestHighlight(unittest.TestCase):
    def setUp(self):

        self.document = """
Review for their take-out only.
Tried their large Classic (sausage, mushroom, peppers and onions) deep dish;\
and their large Pesto Chicken thin crust pizzas.
Pizza = I've had better.  The crust/dough was just way too effin' dry for me.\
Yes, I know what 'cornmeal' is, thanks.  But it's way too dry.\
I'm not talking about the bottom of the pizza...I'm talking about the dough \
that's in between the sauce and bottom of the pie...it was like cardboard, sorry!
Wings = spicy and good.   Bleu cheese dressing only...hmmm, but no alternative\
of ranch dressing, at all.  Service = friendly enough at the counters.  
Decor = freakin' dark.  I'm not sure how people can see their food.  
Parking = a real pain.  Good luck.        
        
        """
        self.query = "deep+dish+pizza"
        self.hdo = HighlightDocumentOperations(self.document, self.query)

    def test_custom_highlight_tag(self):

        actual = self.hdo._custom_highlight_tag("foo", start="[BAR]", end="[ENDBAR]")
        expected = "[BAR]foo[ENDBAR]"
        self.assertEqual(actual, expected)

    def test_query_string_to_dict(self):
        """Verifies the yielded results are what is expected"""

        result = self.hdo._querystring_to_dict()
        expected = {
            "deep": "<strong>deep</strong>",
            "dish": "<strong>dish</strong>",
            "pizza": "<strong>pizza</strong>",
        }

        self.assertEqual(result, expected)

    def test_multi_string_replace(self):

        query = """pizza = I've had better"""
        expected = """<strong>pizza</strong> = I've had better"""
        query_dict = self.hdo._querystring_to_dict()
        result = self.hdo._multiple_string_replace(query, query_dict)
        self.assertEqual(expected, result)

    def test_doc_to_sentences(self):
        """Consumes the generator, and then verifies the result[0]"""

        results = []
        expected = (0, "\nReview for their take-out only.")

        for sentence in self.hdo._doc_to_sentences():
            results.append(sentence)
        self.assertEqual(results[0], expected)

    def test_highlight(self):
        """Verifies highlighted text is what we expect"""

        expected = """Tried their large Classic (sausage, mushroom, peppers and onions) <strong>deep</strong> <strong>dish</strong>;and their large Pesto Chicken thin crust <strong>pizza</strong>s."""
        actual = self.hdo.highlight_doc()
        self.assertEqual(expected, actual)

    def tearDown(self):

        del self.query
        del self.hdo
        del self.document


if __name__ == "__main__":
    unittest.main()

Concerning the above code samples, if you would like to run it, you will need to download the Natural Language Toolkit source and download the nltk data according to the instructions. Since this article is not about the code sample shown but about how it was created, and how to test it, I won’t go into any detail explaining what the code actually does. Instead, let’s finish up by running the static code analysis tool pylint on our source code:

{caption: “Listing 8. Pylint”}

% pylint highlight spy 
No config file found, using default configuration
∗∗∗∗∗∗∗∗∗∗∗∗∗ Module highlight
E: 89:HighlightDocumentOperations._doc_to_sentences: Instance of 'unicode' has no 
    'tokenize' member (but some types could not be inferred)
E: 89:HighlightDocumentOperations._doc_to_sentences: Instance of 'ContextFreeGrammar' 
    has no 'tokenize' member (but some types could not be inferred)
W:108:HighlightDocumentOperations._score_sentences: Used builtin function 'map'
W:192:HighlightDocumentOperations._multiple_string_replace: Used builtin function 'map'
R: 34:HighlightDocumentOperations: Too few public methods (1/2)

Report
======
69 statements analysed.

Global evaluation
‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑
Your code has been rated at 8.12/10 (previous run: 8.12/10)

The code scored an 8.12 out of 10 and was nicked down for a few items. Pylint is configurable, so it is very likely that you may need to configure it to meet your needs on your project. You can refer to the official pylint document (see resources on the right). For this specific example, there are two errors on line 89 that can be attributed to the external library nltk, and there are two warnings that could be changed by a configuration change to pylint. In general, you will never want to allow pylint errors in your source code, but there are some times, such as in the example above, that you may need to make an executive decision. It isn’t a perfect tool, but I have found it to be very useful in the real world.

Conclusion #

In this article, we explored how merely thinking about testing influences the structure of software, and how a lack of thought toward testing can prove fatally harmful to a project. We showed a complete code example, that included both functional and unit tests, and ran it against both code coverage analysis with nose and two static analysis tools, pylint, and pygenie. One thing we didn’t have time to cover was how to automate this with some form of continuous integration testing. Fortunately, this is quite simple with the open source Java™ Continuous Integration System, Hudson. I would encourage you to consult the Hudson documentation (see resources on the right) and experiment with setting up an automated tests for your project that runs all of your tests, including static code analysis.

Finally, testing isn’t a panacea, nor are static analysis tools. Software development is hard work. To get the chance even to be successful, we have to always be mindful of the real goal. It is not only to solve a problem, but also to create something we can prove works. If you agree with this premise, then this means that overly complex code, arrogance in design, and lack of respect for the power of Python, directly interfere with this goal.

Thanks to Kennedy Behrman, of Imagemovers Digital, for the technical review of this article.

Increase reliability in data science and machine learning projects with CircleCI #

Data science is no longer a niche topic at companies. Everyone from the CEO to the intern knows about how valuable it is to take a scientific approach to dealing with data. Consequently, many people not directly in software engineering fields are starting to write more code, often in the form of interactive notebooks, such as Jupyter. Software engineers have typically been huge advocates of build systems, static analysis of code, and generating repeatable processes to enforce quality. What about business people who are writing code Jupyter Notebooks? What processes can they use to make their data science, machine learning, and AI code more reliable?

Data science project quality #

originally published on CircleCI blog, Nov 1, 2018

One way to improve software quality for Data Science is to create a project structure that ensures quality and repeatability. To do this, some ideas can be taken from the traditional software engineering world. Brian Kernigan, co-author of the AWK programming language and “K and R C”, summarized the true nature of software development in the book, Software Tools, when he stated, “Controlling complexity is the essence of software development.”

In a previous article I wrote on code quality about software engineering project quality in Python, I said the following:

“The first step in the process of writing high quality code is to re-examine the entire thought process of how an individual or team develops software. Often in failed, or troubled, software development projects, the software was developed in a reactionary stream of consciousness where the focus of the software development was on getting a problem solved in any manner possible. In a successful software project, the developer is thinking not only about how to solve the problem at hand, but additionally about the process involved in solving the problem.

A successful software developer will devise a way to run tests in an easily automated fashion, so they can continuously prove the software works. They are aware of the dangers of needless complexity. They are humble in their approach, seek critical review, and expect refactoring at every step of the way. They continuously think about how they can ensure their software is testable, readable, and maintainable.”

The same statement is true of data science projects; there needs to be an automated way to ensure quality is enforced. Fortunately, with a service like CircleCI and open source libraries this is easily achievable. In the sections below, this will be demonstrated step by step.

Data science project automated testing setup #

One of the best ways to have a proper automated testing setup for a Data Science project is to set it up properly from the start. What does that look like?

  • Create a GitHub project. Create a new project like this example repo.
  • Create a .circleci directory with a config.yml file in it. This is an example config.yml you could refer to.
  • Create a .gitignore file. It is important to ignore non-essential files.
  • Create a README.md file. A good README.md should be able to show how a user builds the project and what the project does. Including a badge that shows the status of the CircleCI build is very helpful as well, like this example.
  • Create a Makefile. A Makefile is a common way to run steps in a build process and has been around for decades… for a reason… they just work. We will cover how to set this up for a data science project.
  • Other important files and directories that are optional are: library directory, command-line tools, requirements.txt and tests directory.

A good place to start is to look at a Makefile as a template. The contents of myrepo/Master/Makefile are shown below and can be found here:

setup:
	python3 -m venv ~/.myrepo

install:
	pip install -r requirements.txt

test:
	python -m pytest -vv --cov=myrepolib tests/*.py
	python -m pytest --nbval notebook.ipynb

lint:
	pylint --disable=R,C myrepolib cli web

all: install lint test

The key steps are: setup, install, test, lint, and all (runs everything). The setup step creates an optional virtual environment, which could later be sourced by running the command:

source ~/.myrepo/bin/activate`

The install step, which can be run as make install, installs the packages listed in the requirements.txt file. An example is found here.

The lint step, which makes sense if libraries, command-line tools, or web apps are created, can be run with: make install. It wouldn’t make sense to run on just Jupyter notebooks, but it does help to maintain the quality of the code associated with the project. Below is an example output from lint that can also be found here:

(.myrepo) ➜  myrepo git:(master) ✗ make lint
pylint --disable=R,C myrepolib cli web
No config file found, using default configuration

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)

The final and most important step is running make test. This uses pytest along with the nbval plugin. The output is shown below and can also be found here.

(.myrepo) ➜  myrepo git:(master) ✗ make test
python -m pytest -vv --cov=myrepolib tests/*.py
============================================================ test session starts ============================================================
platform darwin -- Python 3.6.4, pytest-3.3.0, py-1.5.2, pluggy-0.6.0 -- /Users/noahgift/.myrepo/bin/python
cachedir: .cache
rootdir: /Users/noahgift/src/myrepo, inifile:
plugins: cov-2.5.1, nbval-0.7
collected 1 item                                                                                                                            

tests/test_myrepo.py::test_func PASSED                                                                                                [100%]

---------- coverage: platform darwin, python 3.6.4-final-0 -----------
Name                    Stmts   Miss  Cover
-------------------------------------------
myrepolib/__init__.py       1      0   100%
myrepolib/repomod.py       11      4    64%
-------------------------------------------
TOTAL                      12      4    67%

========================================================= 1 passed in 0.02 seconds ==========================================================
python -m pytest --nbval notebook.ipynb
============================================================ test session starts ============================================================
platform darwin -- Python 3.6.4, pytest-3.3.0, py-1.5.2, pluggy-0.6.0
rootdir: /Users/noahgift/src/myrepo, inifile:
plugins: cov-2.5.1, nbval-0.7
collected 4 items                                                                                                                           

notebook.ipynb ....                                                                                                                                 [100%]

===================================================================== warnings summary ======================================================================
notebook.ipynb::Cell 0
  /Users/noahgift/.myrepo/lib/python3.6/site-packages/jupyter_client/connect.py:157: RuntimeWarning: Failed to set sticky bit on '/var/folders/vl/sskrtrf17nz4nww5zr1b64980000gn/T': [Errno 1] Operation not permitted: '/var/folders/vl/sskrtrf17nz4nww5zr1b64980000gn/T'
    RuntimeWarning,

-- Docs: http://doc.pytest.org/en/latest/warnings.html
=========================================================== 4 passed, 1 warnings in 2.08 seconds ============================================================

It is worth explaining how the nbval plugin works. In a nutshell, it runs the Jupyter notebook for you and ensures that all of the cells execute. There are two modes that can be used: one that actually checks the output of each cell, and one that doesn’t. The method that checks the output of each cell can be a bit tricky to get to work because many times random images or output are in cells, and the tests will fail on each subsequent run.

With all of that out of the way, there is very little left to do to get CircleCI running. That information is covered in the CircleCI docs. A final cherry on the sundae would be to get a badge working, which is also covered in the official documentation here, and there is an example in the repo shared for this article.

Summary #

This article showed how to bootstrap a data science project, setup the Github structure, run tests, and then send it off to CircleCI to do the build. There is a video I created on YouTube that shows exactly how to setup and test this project here. Another great resource to follow would be to read about how I use CircleCI throughout the book Pragmatic AI: An Introduction to Cloud-based Machine Learning. Links to that are in the references.