If you enjoyed this book considering buying a copy

Chapter 7: Do some data science with Pandas #

Noah Gift

I am writing this much of the United States is locked down due to the Coronavirus Pandemic (Covid-19). One thing is clear about this pandemic, making decisions with data, not a hunch, has life and death consequences. This chapter focuses on giving you the skills to learn the basics. Everyone can learn foundational components of Data Science if you follow the same simple approach of YAGNI (You Ain’t Gonna Need It).

Data Science Workflow #

The way to simplify doing data science is to start with a streamlined workflow. I call the “data science workflow,” the following: Ingest, EDA (Exploratory Data Analysis), Modeling, and Conclusion. You can see the workflow in the screenshot as well as the colab notebook for this chapter here.

Screen Shot 2020-03-27 at 6 19 20 PM

Let’s briefly discuss what each step does.

Ingest: This is where data pulls into a project. It could be as simple as pulling in a file from a Github project or as complex as a SQL query against a cloud provider.
EDA (Exploratory Data Analysis): This is where you ask general questions like, “what is the highest number of cases,” or generate a plot. Often this is where much of the real investigation in Data Science takes place.
Modeling: This is where you would create some type of prediction. It could involve Machine Learning, or it could mean a simple time-series plot. It is optional as well; you don’t always want to create a prediction; you may want to understand the data only.
Conclusion: This is where you would generate a conclusion and give the results to a stakeholder.

Let walk through each phase of a dataset from the New York Times on the coronavirus.

Ingest #

First, the data is grabbed from Github and pulled into a library called Pandas, which can operate on CSV style data. Notice that the variable df refers to a DataFrame that creates this value. A DataFrame is a fancy way of saying it can read a CSV style structure. A python dictionary operates on a key/value pairs, but a DataFrame allows you to act on rows and columns, just like you would do if you were clicking around on an Excel file.

import pandas as pd
df = "https://raw.githubusercontent.com/"\
     "nytimes/covid-19-data/master/us-states.csv"
df = pd.read_csv(df)
df.head()

The output of df.head() is a few rows of the CSV file, which are not in the form of a DataFrame.

Screen Shot 2020-03-27 at 6 34 25 PM

TIP: Notice how you can refer to a CSV file in Github and directly load that into your notebook. This workflow is often one of the most straightforward ways to load data. Do get the full path to the URL of any CSV file in a public Github repo, you first “click on the file” and then you will see it creates a link that starts with “raw.githhubusercontent.xxxx”. This example is the type of URL that will work when you refer to it from Pandas

What about the last five rows? You can grab those by using the built-in tail method.

df.tail()

EDA #

Next up is to “describe” the data. EDA (Exploratory Data Analysis) is about “exploring.” There is no idea that is a bad idea to try in EDA workflow.

df.describe()

Screen Shot 2020-03-27 at 6 37 15 PM

Now it can be plotted. This example shows the spread of cases and deaths by states. Note that particular arguments can use color with the icons in the plot or make them larger.

import seaborn as sns
sns.scatterplot(x="cases", y="deaths",
                hue="deaths",size="deaths", data=df)

Screen Shot 2020-03-27 at 6 37 10 PM

Date-based EDA #

What about using time? One way to do this in Python is to use an index specifically dedicated to dealing with times and dates. This example is very similar to using Excel with a column that has a date column.

To do that you use df.set_index and apply the result of df["date"] = pd.to_datetime(df["date"]). This rewrites the column date to be in the correct format.

Dealing with dates by setting index

url = "https://raw.githubusercontent.com/"\
      "nytimes/covid-19-data/master/us-states.csv"
df = pd.read_csv(url)
df["date"] = pd.to_datetime(df["date"])
df = df.set_index("date")
print(f"Columns: {df.columns}")
df.index

The pandas DataFrame now has an index which shows the format of the dates when printed out via df.index.

Columns: Index(['state', 'fips', 'cases', 'deaths'], dtype='object')
DatetimeIndex(['2020-01-21', '2020-01-22', '2020-01-23', '2020-01-24',
               '2020-01-24', '2020-01-25', '2020-01-25', '2020-01-25',
               '2020-01-26', '2020-01-26',
               ...
               '2020-04-04', '2020-04-04', '2020-04-04', '2020-04-04',
               '2020-04-04', '2020-04-04', '2020-04-04', '2020-04-04',
               '2020-04-04', '2020-04-04'],
              dtype='datetime64[ns]', name='date', length=1829, freq=None)

What works differently now? One useful thing is the ability to language familiar to a non-programmer to filter by date ranges. This example shows that a variable today and a variable daybefore filter via a time delta.

Filter results by date range

from datetime import date, timedelta
today = date.today()
daybefore = today - timedelta(days = 2)
print(f"Today {today}")
print(f"Two days ago {daybefore}")

Today 2020-04-05
Two days ago 2020-04-03

Additionally, you can filter the entire pandas DataFrame now by using the following syntax.

df.loc[daybefore:today].head()

The ability to slice the DataFrame by a specific date range is incredibly useful in exploring a dataset.

            state    fips    cases    deaths
date
2020-04-03    Alabama        1    1535    38
2020-04-03    Alaska        2    156        2
2020-04-03    Arizona        4    1769    41
2020-04-03    Arkansas    5    738        12
2020-04-03    California    6    12569    282

Another thing you can do is filter the distribution of data based on a date range. This example can be beneficial to get a feel in which states have a more significant problem with deaths. It shows that during this particular date range there are some states with dramatically more deaths.

sns.kdeplot(df.loc[daybefore:today]["deaths"], shade=True)

Screen Shot 2020-04-05 at 3 47 23 PM

Now that it is apparent that some states have a worse initial outbreak let’s sort the DataFrame and grab the top ten states.

Sort DataFrame in place by states with highest deaths and cases and show first 10 results

current_df = df.loc[daybefore:today].sort_values(
  by=["deaths", "cases"], ascending=False)
current_df.head(10)

There are a few things to be aware of. One, the output is first sorted by deaths then by cases. Also, because there are two days, states can appear twice if they lead in deaths on both dates.

state    fips    cases    deaths
date
2020-04-04    New York    36    114996    3568
2020-04-03    New York    36    102870    2935
2020-04-04    New Jersey    34    34124    847
2020-04-03    New Jersey    34    29895    647
2020-04-04    Michigan    26    14225    540
2020-04-03    Michigan    26    12670    478
2020-04-04    Louisiana    22    12496    412
2020-04-03    Louisiana    22    10297    370
2020-04-04    California    6    13796    323
2020-04-04    Washington    53    7498    318

There should be 50 states per date: i.e. 2020-04-04, 2020-04-03

current_df.shape

(110, 4)

State-Based Analysis #

Another way to do exploration is to look closely at states.

Get ten occurrences by states and subset

top_ten_states = list(current_df["state"].head(10).values)
top_ten_states

['New York',
 'New York',
 'New Jersey',
 'New Jersey',
 'Michigan',
 'Michigan',
 'Louisiana',
 'Louisiana',
 'California',
 'Washington']

Now drop the fips column.

top_states_df = df[df['state'].isin(top_ten_states)].drop(columns="fips")

Verify the unique states left is the same 10

The set function will deduplicate the names of the states. It determines the values left.

set(top_states_df.state.values)

{‘California’, ‘Louisiana’, ‘Michigan’, ‘New Jersey’, ‘New York’, ‘Washington’}

Next, up with this information, a faceted plot can be created of these top states using seaborn. Note, that sns.lmplot itself takes keyword arguments that change the style of the plot. In particular, the col=state creates a separate plot for each state.

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(15,5))
ax = sns.lmplot(x="cases", y="deaths",
                    height=3.5,
                    col="state",
                    col_wrap=5,
                    hue="state",
                    palette="Set2",
                    data=top_states_df)
ax.fig.subplots_adjust(wspace=.2)

Notice also the hue="state"

Screen Shot 2020-04-05 at 6 13 54 PM

Next up, an interactive plot is created using the plotly library. The date example filters to when cases start to appear around 2020-03-08.

Interactive plot of top states

top_states_march_df = top_states_df.loc["2020-03-08":today].sort_values(
  by=["deaths", "cases"], ascending=True)
top_states_march_df.head()

state    cases    deaths
date
2020-03-09    Louisiana    1    0
2020-03-10    Michigan    2    0
2020-03-11    Michigan    2    0
2020-03-08    New Jersey    6    0
2020-03-10    Louisiana    6    0

The plotly graph can be generated as follows in colab notebook.

import plotly.io as pio
pio.renderers.default = 'colab'
import plotly.express as px
fig = px.scatter(
  top_states_march_df,
  x=top_states_march_df.index,
  y="cases", size="deaths", color="state", facet_col="state")
fig.show()

The chart shows this dramatic rise in cases in New York, and the size of the bubble reflects the total death count.

Screen Shot 2020-04-05 at 6 33 33 PM

Search for Features: Political, Health and Finance #

What about a different approach? Often in the exploratory phase, it can be useful to try many different ideas. Let’s look at some other variables like sugar intake and political affiliation.

Sugar Intake By State

Let’s read in some CDC (Center for Disease Control) data on sugar intake by state. Note that in the real-world, you often need to rename columns. In this case, this is done using the .rename functionality.

url ="https://raw.githubusercontent.com/"\
     "noahgift/sugar/master/data/education_sugar_cdc_2003.csv"
cdc_2013 = pd.read_csv(url)
cdc_2013.set_index("State", inplace=True)
for column in cdc_2013.columns:
  cdc_2013[column]=cdc_2013[column].str.replace(r"\(.*\)","")
  cdc_2013[column]=pd.to_numeric(cdc_2013[column])

cdc_2013.reset_index(inplace=True)
cdc_2013.rename(
  columns={
      "State": "state",
      "Employed": "employed-sugar-intake-daily"
  },
  inplace=True)
cdc_2013.head()

Screen Shot 2020-04-05 at 6 41 06 PM

Combine Sugar Data and Covid-19 Data

Next, the sugar data merges with the original data.

cdc_employed_df = cdc_2013[["employed-sugar-intake-daily", "state"]]
sugar_covid_df = df.merge(cdc_employed_df, how="inner", on="state")
sugar_covid_df.head()

state    fips    cases    deaths    employed-sugar-intake-daily
0    California    6    1    0    22.9
1    California    6    2    0    22.9
2    California    6    2    0    22.9
3    California    6    2    0    22.9
4    California    6    2    0    22.9

What about data from the 2016 Election?

Notice that this additional data set on what state voted for, which imports into a pandas DataFrame. Note that the pd.get_dummies method will split the values into two columns.

election_df = pd.read_csv(
  "https://raw.githubusercontent.com/"\
  "paiml/minimal-python/master/data/2016-Electoral-Votes.csv")
election_df.rename(columns={"State": "state"},inplace=True)
election_df.drop(columns="Votes", inplace=True)
election_df = pd.concat(
  [election_df, pd.get_dummies(election_df["Winning Party"])], axis=1)
election_df.head()

The output is as follows.

    state        Winning Party    Democrats    Republicans
0    Alabama        Republicans        0            1
1    Alaska        Republicans        0            1
2    Arizona        Republicans     0            1
3    Arkansas    Republicans        0            1
4    California    Democrats        1            0

Let’s look to see if we can find any features. This example is done by looking for variables that correlate with each other.

Generate Heatmap

sugar_covid_df.corr()

The output is here. It shows the correlated values. A positive correlation shows this could later create a predictive model.

Screen Shot 2020-04-05 at 6 59 12 PM

Let’s improve the output by making an interactive heatmap with Plotly.

import plotly.graph_objects as go

fig = go.Figure(data=go.Heatmap(
                   z=sugar_covid_df.corr(),
                   x=['cases', 'deaths',
                      'employed-sugar-intake-daily',
                      'Democrats', 'Republicans'],
                   y=['cases', 'deaths',
                      'employed-sugar-intake-daily',
                      'Democrats', 'Republicans'],
                   hoverongaps = False))
fig.show()

The output of the chart shows that so far, Democratic-leaning states are leading the way in cases and deaths. Will this hold up?

Screen Shot 2020-04-05 at 7 08 17 PM