Chapter05 Cloud Storage

If you find this content useful, consider buying this book:

  • Take the duke/coursera specialization
  • Chapter 5: Cloud Storage #

    Unlike a desktop computer, the cloud offers many choices for storage. These options range from object storage to flexible network file systems. This chapter covers these different storage types as well as methods to deal with them.

    Learn why Cloud Storage is essential in the following screencast.

    why Cloud Storage is important

    Video Link: https://www.youtube.com/watch?v=4ZbPAzlmpcI

    Cloud Storage Types #

    AWS is an excellent starting point to discuss the different storage options available in the cloud. You can see a list of the various storage options they provide here. Let’s address these options one by one.

    Object Storage #

    Amazon S3 is object storage with (11 9’s) of durability, meaning yearly downtime measures in milliseconds. It is ideal for storing large objects like files, images, videos, or other binary data. It is often the central location used in a Data Processing workflow. A synonym for an object storage system is a “Data Lake.”

    Learn how to use Amazon S3 in the following screencast.

    Use Amazon S3

    Video Link: https://www.youtube.com/watch?v=BlWfOMmPoPg

    Learn what a Data Lake is in the following screencast.

    what a Data Lake

    Video Link: https://www.youtube.com/watch?v=fmsG91EgbBk

    File Storage #

    Many cloud providers now offer scalable, elastic File Systems. AWS provides the Amazon Elastic File System (EFS) and Google delivers the Filestore. These file systems offer high performance, fully managed file storage than can be mounted by multiple machines. These can serve as the central component of NFSOPS or Network File System Operations, where the file system stores the source code, the data, and the runtime.

    Learn about Cloud Databases and Cloud Storage in the following screencast.

    Cloud Databases and Cloud Storage

    Video Link: https://www.youtube.com/watch?v=-68k-JS_Y88

    Another option with Cloud Databases is to use serverless databases, such as AWS Aurora Serverless. Many databases in the Cloud work in a serverless fashion, including Google BigQuery and AWS DynamoDB. Learn to use AWS Aurora Serverless in the following screencast.

    AWS Aurora Serverless

    Video Link: https://www.youtube.com/watch?v=UqHz-II2jVA

    Block Storage #

    Block storage is similar to the hard drive storage on a workstation or laptop but virtualized. This virtualization allows for the storage to increase in size and performance. It also means a user can “snapshot” storage and use it for backups or operating system images. Amazon offers block storage through a service, Amazon Block Store, or EBS.

    Other Storage #

    There are various other storage types in the cloud, including backup systems, data transfer systems, and edge computing services like AWS Snowmobile can transfer 100 PB, yes petabyte, of data in a shipping container.

    Data Governance #

    What is Data Governance? It is the ability to “govern” the data. Who can access the data and what can they do with it are essential questions in data governance. Data Governance is a new emerging job title due to the importance of storing data securely in the cloud.

    Learn about Data Governance in the following screencast.

    Data Governance

    *Video Link: https://www.youtube.com/watch?v=cCUiHBP7Bts

    Learn about AWS Security in the following screencast.

    AWS Security

    Video Link: https://www.youtube.com/watch?v=I8FeP_FY9Rg

    Learn about AWS Cloud Security IAM in the following screencast.

    AWS Cloud Security IAM

    Video Link: https://www.youtube.com/watch?v=_Xf93LSCECI

    Highlights of a Data Governance strategy include the following.

    PLP (Principle of Least Privilage) #

    Are you limiting the permissions by default vs. giving access to everything? This security principle is called the PLP, and it refers to only providing a user what they need. An excellent real-life analogy is not giving the mail delivery person access to your house, only giving them access to the mailbox.

    Learn about PLP in the following screencast.

    PLP

    Video Link: https://www.youtube.com/watch?v=cIRa4P24sf4

    Audit #

    Is there an automated auditing system? How do you know when a security breach has occurred?

    PII (Personally Identifiable Information) #

    Is the system avoiding the storage of Personally Identifiable Information?

    Data Integrity #

    How are you ensuring that your data is valid and not corrupt? Would you know when it tampering occurred?

    Disaster Recovery #

    What is your disaster recovery plan, and how do you know it works? Did you test the backups through a reoccurring restore process?

    Encrypt #

    Do you encrypt data at transit and rest? Who has access to the encryption keys? Do you audit encryption events such as decryption of sensitive data?

    Model Explainability #

    Are you sure you could recreate the model? Do you know how it works, and is it explainable?

    Data Drift #

    Do you measure the “drift” of the data used to create Machine Learning models? Microsoft Azure has a good set of documentation about data drift that is a good starting point to learn about the concept.

    Cloud Databases #

    A big takeaway in the cloud is you don’t have to start with a relational database. The CTO of Amazon, Werner Vogel’s brings up some of the options available in the blog post A one size fits all database doesn’t serve anyone.

    all things distributed source: allthingsdistributed.com

    Learn about one size doesn’t fit all in the following screencast.

    one size doesn’t fit all

    Video Link: https://www.youtube.com/watch?v=HkequkfOIE8

    Key-Value Databases #

    An excellent example of a serverless key/value database is Dynamodb. Another famous example is MongoDB.

    alt text

    How could you query this type of database in pure Python?

    def query_police_department_record_by_guid(guid):
        """Gets one record in the PD table by guid
        
        In [5]: rec = query_police_department_record_by_guid(
            "7e607b82-9e18-49dc-a9d7-e9628a9147ad"
            )
        
        In [7]: rec
        Out[7]: 
        {'PoliceDepartmentName': 'Hollister',
         'UpdateTime': 'Fri Mar  2 12:43:43 2018',
         'guid': '7e607b82-9e18-49dc-a9d7-e9628a9147ad'}
        """
        
        db = dynamodb_resource()
        extra_msg = {"region_name": REGION, "aws_service": "dynamodb", 
            "police_department_table":POLICE_DEPARTMENTS_TABLE,
            "guid":guid}
        log.info(f"Get PD record by GUID", extra=extra_msg)
        pd_table = db.Table(POLICE_DEPARTMENTS_TABLE)
        response = pd_table.get_item(
            Key={
                'guid': guid
                }
        )
        return response['Item']
    

    Notice that there are only a couple of lines to retrieve data from the database without the logging code!

    Learn to use AWS DynamoDB in the following screencast.

    AWS DynamoDB

    Video Link: https://www.youtube.com/watch?v=gTHE6X5fce8

    Graph Databases #

    Another specialty database is a Graph Database. When I was the CTO of a Sports Social Network, we used a Graph Database, Neo4J, to make social graph queries more feasible. It also allowed us to build products around data science more quickly.

    Why Not Relational Databases instead of a Graph Database? #

    Relationship data is not a good fit for relational databases. Here are some examples (ideas credit to Joshua Blumenstock-UC Berkeley).

    • Think about SQL query of social network used to select all third-degree connections of the individual.

      • Imagine the number of joins needed.
    • Think about SQL query used to get a full social network of the individual.

      • Imagine the number of recursive joins required.

    Relational databases are good at representing one-to-many relationships, in which one table connects to multiple tables. Mimicking real-life relationships, like friends or followers in a social network, is much more complicated and a better fit for a Graph Database.

    AWS Neptune #

    The Amazon Cloud also has a Graph database called Amazon Neptune, which has similar properties to Neo4J.

    Neptune

    Neo4j #

    You can learn more about Neo4j by experimenting in the sandbox they provide. The following graph tutorial is HEAVILY based on their official documentation, which you can find in the link below.

    Graph Database Facts #

    Let’s dive into some of the critical Graph Database facts.

    Graph Database can store:

    • Nodes - graph data records
    • Relationships - connect nodes
    • Properties - named data values
    Simplest Graph #

    The Simplest Graph is as follows.

    • One node
    • Has some properties
    1. Start by drawing a circle for the node
    2. Add the name, Emil
    3. Note that he is from Sweden
    • Nodes are the name for data records in a graph
    • Data is stored as Properties
    • Properties are simple name/value pairs

    alt text

    Labels #

    “Nodes” group together by applying a Label to each member. In our social graph, we’ll label each node that represents a Person.

    1. Apply the label “Person” to the node we created for Emil
    2. Color “Person” nodes red
    • A node can have zero or more labels
    • Labels do not have any properties

    Nodes

    More Nodes #

    Like any database, storing data in Neo4j can be as simple as adding more records. We’ll add a few more nodes:

    1. Emil has a Klout score of 99
    2. Johan, from Sweden, who is learning to surf
    3. Ian, from England, who is an author
    4. Rik, from Belgium, has a cat named Orval
    5. Allison, from California, who surfs
    • Similar nodes can have different properties
    • Properties can be strings, numbers, or booleans
    • Neo4j can store billions of nodes

    more_nodes

    Relationships #

    The real power of Neo4j is in connected data. To associate any two nodes, add a Relationship that describes how the records are related.

    In our social graph, we simply say who KNOWS whom:

    1. Emil KNOWS Johan and Ian
    2. Johan KNOWS Ian and Rik
    3. Rik and Ian KNOWS Allison
    • Relationships always have direction
    • Relationships still have a type
    • Relationships form patterns of data

    relationships

    Relationship Properties #

    In a property graph, relationships are data records that can also** contain properties**. Looking more closely at Emil’s relationships, note that:

    • Emil has known Johan since 2001
    • Emil rates Ian 5 (out of 5)
    • Everyone else can have similar relationship properties

    relationships

    Key Graph Algorithms (With neo4j) #

    An essential part of graph databases is the fact that they have different descriptive statistics. Here are these unique descriptive statistics.

    • Centrality - What are the most critical nodes in the network? PageRank, Betweenness Centrality, Closeness Centrality

    • Community detection - How can the graph be partitioned? Union Find, Louvain, Label Propagation, Connected Components

    • Pathfinding - What are the shortest paths or best routes available given the cost? Minimum Weight Spanning Tree, All Pairs- and Single Source- Shortest Path, Dijkstra

    Let’s take a look at the Cypher code to do this operation.

    CALL dbms.procedures()
    YIELD name, signature, description
    WITH * WHERE name STARTS WITH "algo"
    RETURN *
    

    Russian Troll Walkthrough [Demo] #

    One of the better sandbox examples on the Neo4J website is the Russian Troll dataset. To run through an example, run this cipher code in their sandbox.

    :play https://guides.neo4j.com/sandbox/twitter-trolls/index.html
    
    Finding top Trolls with Neo4J #

    You can proceed to find the “trolls”, i.e., foreign actors causing trouble in social media, in the example below.

    The list of prominent people who tweeted out links from the account, @Ten_GOP, which Twitter shut down in August, includes political figures such as Michael Flynn and Roger Stone, celebrities such as Nicki Minaj and James Woods, and media personalities such as Anne Coulter and Chris Hayes. Note that at least two of these people were also convicted of a Felony and then pardoned, making the data set even more enjoyable.

    A screenshot of the Neo4J interface for the phrase “thanks obama.” Screen Shot 2020-02-29 at 4 06 34 PM

    Pagerank score for Trolls #

    Here is a walkthrough of code in a colab notebook you can reference called social network theory.

    def enable_plotly_in_cell():
      import IPython
      from plotly.offline import init_notebook_mode
      display(IPython.core.display.HTML('''
            <script src="/static/components/requirejs/require.js"></script>
      '''))
      init_notebook_mode(connected=False)
    

    The trolls export from Neo4j and they load into Pandas.

    import pandas as pd
    import numpy as np
    
    df = pd.read_csv("https://raw.githubusercontent.com/noahgift/essential_machine_learning/master/pagerank_top_trolls.csv")
    df.head()
    

    Screen Shot 2020-02-29 at 4 02 33 PM

    Next up, the data graphs with Plotly.

    import plotly.offline as py
    import plotly.graph_objs as go
    
    from plotly.offline import init_notebook_mode
    enable_plotly_in_cell()
    init_notebook_mode(connected=False)
    
    
    fig = go.Figure(data=[go.Scatter(
        x=df.pagerank,
        text=df.troll,
        mode='markers',
        marker=dict(
            color=np.log(df.pagerank),
            size=df.pagerank*5),
    )])
    py.iplot(fig, filename='3d-scatter-colorscale')
    

    Screen Shot 2020-02-29 at 4 09 56 PM

    Top Troll Hashtags

    import pandas as pd
    import numpy as np
    
    df2 = pd.read_csv("https://raw.githubusercontent.com/noahgift/essential_machine_learning/master/troll-hashtag.csv")
    df2.columns = ["hashtag", "num"]
    df2.head()
    

    Screen Shot 2020-02-29 at 4 11 42 PM

    Now plot these troll hashtags.

    import plotly.offline as py
    import plotly.graph_objs as go
    
    from plotly.offline import init_notebook_mode
    enable_plotly_in_cell()
    init_notebook_mode(connected=False)
    
    
    fig = go.Figure(data=[go.Scatter(
        x=df.pagerank,
        text=df2.hashtag,
        mode='markers',
        marker=dict(
            color=np.log(df2.num),
            size=df2.num),
    )])
    py.iplot(fig)
    

    You can see these trolls love to use the hashtag #maga.

    Screen Shot 2020-02-29 at 4 14 47 PM

    Graph Database References #

    The following are additional references.

    The Three “V’s” of Big Data: Variety, Velocity, and Volume #

    There are many ways to define Big Data. One way of describing Big Data is that it is too large to process on your laptop. Your laptop is not the real world. Often it comes as a shock to students when they get a job in the industry that the approach they learned in school doesn’t work in the real world!

    Learn what Big Data is in the following screencast.

    what Big Data is

    Video Link: https://www.youtube.com/watch?v=2-MrUUj0E-Q

    Another method is the Three “V’s” of Big Data: Variety, Velocity, and Volume.

    Big Data Challenges

    Learn the three V’s of Big Data is in the following screencast.

    what Big Data is

    Video Link: https://www.youtube.com/watch?v=qXBcDqSy5GY

    Variety #

    Dealing with many types of data is a massive challenge in Big Data. Here are some examples of the types of files dealt with in a Big Data problem.

    • Unstructured text
    • CSV files
    • binary files
    • big data files: Apache Parquet
    • Database files
    • SQL data

    Velocity #

    Another critical problem in Big Data is the velocity of the data. Some questions to include the following examples. Are data streams written at 10’s of thousands of records per second? Are there many streams of data written at once? Does the velocity of the data cause performance problems on the nodes collecting the data?

    Volume #

    Is the actual size of the data more extensive than what a workstation can handle? Perhaps your laptop cannot load a CSV file into the Python pandas package. This problem could be Big Data, i.e., it doesn’t work on your laptop. One Petabyte is Big Data, and 100 GB could be big data depending on its processing.

    Batch vs. Streaming Data and Machine Learning #

    One critical technical concern is Batch data versus Stream data. If data processing occurs in a Batch job, it is much easier to architect and debug Data Engineering solutions. If the data is streaming, it increases the complexity of architecting a Data Engineering solution and limits its approaches.

    Impact on ML Pipeline #

    One aspect of Batch vs. Stream is that there is more control of model training in batch (can decide when to retrain). On the other hand, continuously retraining the model could provide better prediction results or worse results. For example, did the input stream suddenly get more users or fewer users? How does an A/B testing scenario work?

    Batch #

    What are the characteristics of Batch data?

    • Data is batched at intervals
    • Simplest approach to creating predictions
    • Many Services on AWS Capable of Batch Processing including, AWS Glue, AWS Data Pipeline, AWS Batch, and EMR.

    Streaming #

    What are the characteristics of Streaming data?

    • Continuously polled or pushed
    • More complex method of prediction
    • Many Services on AWS Capable of Streaming, including Kinesis, IoT, and Spark EMR.

    Cloud Data Warehouse #

    The advantage of the cloud is infinite compute and infinite storage. Cloud-native data warehouse systems also allow for serverless workflows that can directly integrate Machine Learning on the data lake. They are also ideal for developing Business Intelligence solutions.

    GCP BigQuery #

    There is a lot to like about GCP BigQuery. It is serverless, it has integrated Machine Learning, and it is easy to use. This next section has a walkthrough of a k-means clustering tutorial.

    The interface, when queried, intuitively gives back results. A key reason for this is the use of SQL and the direct integration with both Google Cloud and Google Data Studio

    Screen Shot 2020-03-07 at 12 52 15 PM

    Learn to use Google BigQuery in the following screencast.

    what Big Data is

    Video Link: https://www.youtube.com/watch?v=eIec2DXqw3Q

    Even better, you can directly train Machine Learning models using a SQL statement. This workflow shows an emerging trend with Cloud Database services in that they let you both query the data and train the model. In this example, the kmeans section is where the magic happens.

    CREATE OR REPLACE MODEL
      bqml_tutorial.london_station_clusters OPTIONS(model_type='kmeans',
        num_clusters=4) AS
    WITH
      hs AS (
      SELECT
        h.start_station_name AS station_name,
      IF
        (EXTRACT(DAYOFWEEK
          FROM
            h.start_date) = 1
          OR EXTRACT(DAYOFWEEK
          FROM
            h.start_date) = 7,
          "weekend",
          "weekday") AS isweekday,
        h.duration,
        ST_DISTANCE(ST_GEOGPOINT(s.longitude,
            s.latitude),
          ST_GEOGPOINT(-0.1,
            51.5))/1000 AS distance_from_city_center
      FROM
        `bigquery-public-data.london_bicycles.cycle_hire` AS h
      JOIN
        `bigquery-public-data.london_bicycles.cycle_stations` AS s
      ON
        h.start_station_id = s.id
      WHERE
        h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
        AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ),
      stationstats AS (
      SELECT
        station_name,
        isweekday,
        AVG(duration) AS duration,
        COUNT(duration) AS num_trips,
        MAX(distance_from_city_center) AS distance_from_city_center
      FROM
        hs
      GROUP BY
        station_name, isweekday)
    SELECT
      * EXCEPT(station_name, isweekday)
    FROM
      stationstats
    

    Finally, when the k-means clustering model trains, the evaluation metrics appear as well in the console.

    Screen Shot 2020-03-07 at 1 00 00 PM

    Often a meaningful final step is to take the result and then export it to their Business Intelligence (BI) tool, data studio.

    Screen Shot 2020-03-07 at 1 04 54 PM

    The following is an excellent example of what a cluster visualization could look like in Google Big Query exported to Google Data Studio. Screen Shot 2020-12-28 at 4 14 09 PM

    You can view the report using this direct URL.

    Summary of GCP BigQuery #

    In a nutshell, GCP BigQuery is a useful tool for Data Science and Business Intelligence. Here are the key features.

    • Serverless
    • Large selection of Public Datasets
    • Integrated Machine Learning
    • Integration with Data Studio
    • Intuitive
    • SQL based

    AWS Redshift #

    AWS Redshift is a Cloud data warehouse designed by AWS. The key features of Redshift include the ability to query exabyte data in seconds through the columnar design. In practice, this means excellent performance regardless of the size of the data.

    Learn to use AWS Redshift in the following screencast.

    AWS Redshift

    Video Link: https://www.youtube.com/watch?v=vXSH24AJzrU

    Key actions in a Redshift Workflow #

    In general, the key actions are as described in the Redshift getting started guide. These are the critical steps to setup a workflow.

    • Cluster Setup

    • IAM Role configuration (what can role do?)

    • Setup Security Group (i.e. open port 5439)

    • Setup Schema

      create table users(
      userid integer not null distkey sortkey,
      username char(8),
    
    • Copy data from S3
      copy users from 's3://awssampledbuswest2/tickit/allusers_pipe.txt'
      credentials 'aws_iam_role=<iam-role-arn>'
      delimiter '|' region 'us-west-2';
    
    • Query
      SELECT firstname, lastname, total_quantity
      FROM
      (SELECT buyerid, sum(qtysold) total_quantity
      FROM  sales
      GROUP BY buyerid
      ORDER BY total_quantity desc limit 10) Q, users
      WHERE Q.buyerid = userid
      ORDER BY Q.total_quantity desc;
    

    Summary of AWS Redshift #

    The high-level takeaway for AWS Redshift is the following.

    • Mostly managed
    • Deep Integration with AWS
    • Columnar
    • Competitor to Oracle and GCP Big Query
    • Predictable performance on massive datasets

    Summary #

    This chapter covers storage, including object, block, filesystem, and Databases. A unique characteristic of Cloud Computing is the ability to use many tools at once to solve a problem. This advantageous trait is heavily at play with the topic of Cloud Storage and Cloud Databases.