Chapter 07: Managed Machine Learning Systems #
Jupyter Notebook Workflow #
Jupyter notebooks are increasingly the hub in both Data Science and Machine Learning projects. All major vendors have some form of Jupyter integration. Some tasks orient in the direction of engineering, and others in the order of science.
An excellent example of a science-focused workflow is the traditional notebook based Data Science workflow. Data is collected; it could be anything from a SQL query to a CSV file hosted in Github. Next, the EDA (exploratory data analysis) using visualization, statistics, and unsupervised machine learning. Finally, an ML model, and then a conclusion.
This methodology often fits very well into a markdown based workflow where each section is a Markdown heading. Often that Jupyter notebook is in source control. Is this notebook source control or a document? This distinction is an important consideration, and it is best to treat it as both.
DevOps for Jupyter Notebooks #
DevOps is a popular technology best practice, and it is often used in combination with Python. The center of the universe for DevOps is the build server. This guardian of the software galaxy facilitates automation. This automation includes linting, testing, reporting, building, and deploying code. This process is called continuous delivery.
The benefits of continuous delivery are many. First, tested code is always in a deployable state. Automation of best practices then creates a cycle of constant improvement in a software project. A question should crop up if you are a data scientist. Isn’t Jupyter notebook source code too? Wouldn’t it benefit from these same practices? The answer is yes.
This diagram exposes a proposed best practices directory structure for a Jupyter based project in source control. The
Makefile holds the recipes to build, run and deploy the project via make commands:
make test, etc. The
Dockerfile contains the actual runtime logic, which makes the project genuinely portable.
FROM python:3.7.3-stretch # Working Directory WORKDIR /app # Copy source code to working directory COPY . app.py /app/ # Install packages from requirements.txt # hadolint ignore=DL3013 RUN pip install --upgrade pip &&\ pip install --trusted-host pypi.python.org -r requirements.txt # Logic to run Jupyter could go here... # Expose port 8888 #EXPOSE 8888 # Run app.py at container launch #CMD ["jupyter", "notebook"]
The Jupyter notebook itself can be tested via the
nbval plugin as shown.
python -m pytest --nbval notebook.ipynb
The requirements for the project are in a
requirements.txt file. Every time the project is changed, the build server picks up the change and runs tests on the Jupyter notebook cells themselves.
DevOps isn’t just for software-only projects. DevOps is a best practice that fits well with the ethos of Data Science. Why guess if your notebook works, your data is reproducible or that it can deploy?
AWS Sagemaker Overview #
One of the most prevalent managed ML Systems is AWS Sagemaker. This platform is a complete solution for an organization that wants to build and maintain large-scale Machine Learning projects. Sagemaker makes heavy use of the concept of MLOPs(Machine Learning Operations).
AWS Sagemaker Elastic Architecture #
There is a lot involved in large scale Sagemaker architecture. Take a look at how each component serves a purpose in a Cloud-native fashion in the diagram.
For further analysis of this architecture, use this reference to analyze US census data for population segmentation using Amazon SageMaker
Finally, you can learn to use AWS Sagemaker to perform County Census Clustering in the following screencast.
Video Link: https://www.youtube.com/watch?v=H3AcLM_8P4s
Topic: Build a Sagemaker based Data Science project
Estimated time: 45 minutes
People: Individual or Final Project Team
Slack Channel: #noisy-exercise-chatter
Part A: Get the airline data into your own Sagemaker.
Part B: Performance the Data Science workflow:
- Ingest: Process the data
- EDA: Visualize and Explore data
- Model: Create some form of a model
Part C: Consider trying multiple visualization libraries: Plotly, Vega, Bokeh, and Seaborn
Part D: Download notebook and upload into Colab, then check notebook in a Github portfolio repo. *Hints: You may want to truncate the data and upload a small version into Github using unix
shuf -n 100000 en.openfoodfacts.org.products.tsv\ > 10k.sample.en.openfoodfacts.org.products.tsv 1.89s user 0.80s system 97% cpu 2.748 total
Azure ML Studio Overview #
Another ML platform that has compelling features is Azure ML Studio. It shares many of the same ideas as AWS Sagemaker.
Learn to use Azure ML Studio to perform AutoML in the following screencast.
Video Link: https://www.youtube.com/watch?v=bJHk0ZOVm4s
Google AutoML Computer Vision #
Another compelling, managed platform is Google AutoML Computer Vision. It can automatically classify images and then later export the model in
tflite format to an edge device like Coral TPU USB stick.
Learn to use Google AutoML to perform Computer Vision in the following screencast.
Video Link: https://www.youtube.com/watch?v=LjERy-I5lyI
This chapter dives into the role of platform technology in Machine Learning, especially at scale. All major cloud platforms have compelling Machine Learning Platforms that can significantly reduce the complexity of problems for organizations doing Machine Learning. An emerging trend is the use of MLOPs and DevOps in solving these Machine Learning problems.