If you enjoyed this book considering buying a copy
- Buy a copy of the book on Lean Pub
- Buy a copy of the book on Kindle
- Buy a hard copy of the book on Amazon
- Buy the bundle Master Python on Lean Pub
Chapter 7: Do some data science with Pandas #
Noah Gift
I am writing this much of the United States is locked down due to the Coronavirus Pandemic (Covid-19). One thing is clear about this pandemic, making decisions with data, not a hunch, has life and death consequences. This chapter focuses on giving you the skills to learn the basics. Everyone can learn foundational components of Data Science if you follow the same simple approach of YAGNI (You Ain’t Gonna Need It).
Data Science Workflow #
The way to simplify doing data science is to start with a streamlined workflow. I call the “data science workflow,” the following: Ingest, EDA (Exploratory Data Analysis), Modeling, and Conclusion. You can see the workflow in the screenshot as well as the colab notebook for this chapter here.
Let’s briefly discuss what each step does.
-
Ingest: This is where data pulls into a project. It could be as simple as pulling in a file from a Github project or as complex as a SQL query against a cloud provider.
-
EDA (Exploratory Data Analysis): This is where you ask general questions like, “what is the highest number of cases,” or generate a plot. Often this is where much of the real investigation in Data Science takes place.
-
Modeling: This is where you would create some type of prediction. It could involve Machine Learning, or it could mean a simple time-series plot. It is optional as well; you don’t always want to create a prediction; you may want to understand the data only.
-
Conclusion: This is where you would generate a conclusion and give the results to a stakeholder.
Let walk through each phase of a dataset from the New York Times on the coronavirus.
Ingest #
First, the data is grabbed from Github and pulled into a library called Pandas, which can operate on CSV style data. Notice that the variable df
refers to a DataFrame
that creates this value. A DataFrame
is a fancy way of saying it can read a CSV style structure. A python dictionary operates on a key/value pairs, but a DataFrame
allows you to act on rows and columns, just like you would do if you were clicking around on an Excel file.
import pandas as pd
df = "https://raw.githubusercontent.com/"\
"nytimes/covid-19-data/master/us-states.csv"
df = pd.read_csv(df)
df.head()
The output of df.head()
is a few rows of the CSV file, which are not in the form of a DataFrame.
TIP: Notice how you can refer to a CSV file in Github and directly load that into your notebook. This workflow is often one of the most straightforward ways to load data. Do get the full path to the URL of any CSV file in a public Github repo, you first “click on the file” and then you will see it creates a link that starts with “raw.githhubusercontent.xxxx”. This example is the type of URL that will work when you refer to it from Pandas
What about the last five rows? You can grab those by using the built-in tail
method.
df.tail()
EDA #
Next up is to “describe” the data. EDA (Exploratory Data Analysis) is about “exploring.” There is no idea that is a bad idea to try in EDA workflow.
df.describe()
Now it can be plotted. This example shows the spread of cases and deaths by states. Note that particular arguments can use color with the icons in the plot or make them larger.
import seaborn as sns
sns.scatterplot(x="cases", y="deaths",
hue="deaths",size="deaths", data=df)
Date-based EDA #
What about using time? One way to do this in Python is to use an index specifically dedicated to dealing with times and dates. This example is very similar to using Excel with a column that has a date column.
To do that you use df.set_index
and apply the result of df["date"] = pd.to_datetime(df["date"])
. This rewrites the column date
to be in the correct format.
Dealing with dates by setting index
url = "https://raw.githubusercontent.com/"\
"nytimes/covid-19-data/master/us-states.csv"
df = pd.read_csv(url)
df["date"] = pd.to_datetime(df["date"])
df = df.set_index("date")
print(f"Columns: {df.columns}")
df.index
The pandas DataFrame now has an index which shows the format of the dates when printed out via df.index
.
Columns: Index(['state', 'fips', 'cases', 'deaths'], dtype='object')
DatetimeIndex(['2020-01-21', '2020-01-22', '2020-01-23', '2020-01-24',
'2020-01-24', '2020-01-25', '2020-01-25', '2020-01-25',
'2020-01-26', '2020-01-26',
...
'2020-04-04', '2020-04-04', '2020-04-04', '2020-04-04',
'2020-04-04', '2020-04-04', '2020-04-04', '2020-04-04',
'2020-04-04', '2020-04-04'],
dtype='datetime64[ns]', name='date', length=1829, freq=None)
What works differently now? One useful thing is the ability to language familiar to a non-programmer to filter by date ranges. This example shows that a variable today
and a variable daybefore
filter via a time delta.
Filter results by date range
from datetime import date, timedelta
today = date.today()
daybefore = today - timedelta(days = 2)
print(f"Today {today}")
print(f"Two days ago {daybefore}")
Today 2020-04-05
Two days ago 2020-04-03
Additionally, you can filter the entire pandas DataFrame now by using the following syntax.
df.loc[daybefore:today].head()
The ability to slice the DataFrame by a specific date range is incredibly useful in exploring a dataset.
state fips cases deaths
date
2020-04-03 Alabama 1 1535 38
2020-04-03 Alaska 2 156 2
2020-04-03 Arizona 4 1769 41
2020-04-03 Arkansas 5 738 12
2020-04-03 California 6 12569 282
Another thing you can do is filter the distribution of data based on a date range. This example can be beneficial to get a feel in which states have a more significant problem with deaths. It shows that during this particular date range there are some states with dramatically more deaths.
sns.kdeplot(df.loc[daybefore:today]["deaths"], shade=True)
Now that it is apparent that some states have a worse initial outbreak let’s sort the DataFrame and grab the top ten states.
Sort DataFrame in place by states with highest deaths and cases and show first 10 results
current_df = df.loc[daybefore:today].sort_values(
by=["deaths", "cases"], ascending=False)
current_df.head(10)
There are a few things to be aware of. One, the output is first sorted by deaths
then by cases
. Also, because there are two days, states can appear twice if they lead in deaths
on both dates.
state fips cases deaths
date
2020-04-04 New York 36 114996 3568
2020-04-03 New York 36 102870 2935
2020-04-04 New Jersey 34 34124 847
2020-04-03 New Jersey 34 29895 647
2020-04-04 Michigan 26 14225 540
2020-04-03 Michigan 26 12670 478
2020-04-04 Louisiana 22 12496 412
2020-04-03 Louisiana 22 10297 370
2020-04-04 California 6 13796 323
2020-04-04 Washington 53 7498 318
There should be 50 states per date: i.e. 2020-04-04, 2020-04-03
current_df.shape
(110, 4)
State-Based Analysis #
Another way to do exploration is to look closely at states.
Get ten occurrences by states and subset
top_ten_states = list(current_df["state"].head(10).values)
top_ten_states
['New York',
'New York',
'New Jersey',
'New Jersey',
'Michigan',
'Michigan',
'Louisiana',
'Louisiana',
'California',
'Washington']
Now drop the fips
column.
top_states_df = df[df['state'].isin(top_ten_states)].drop(columns="fips")
Verify the unique states left is the same 10
The set
function will deduplicate the names of the states
. It determines the values left.
set(top_states_df.state.values)
{‘California’, ‘Louisiana’, ‘Michigan’, ‘New Jersey’, ‘New York’, ‘Washington’}
Next, up with this information, a faceted plot can be created of these top states using seaborn
. Note, that sns.lmplot
itself takes keyword arguments that change the style of the plot. In particular, the col=state
creates a separate plot for each state.
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(15,5))
ax = sns.lmplot(x="cases", y="deaths",
height=3.5,
col="state",
col_wrap=5,
hue="state",
palette="Set2",
data=top_states_df)
ax.fig.subplots_adjust(wspace=.2)
Notice also the hue="state"
Next up, an interactive plot is created using the plotly
library. The date
example filters to when cases start to appear around 2020-03-08
.
Interactive plot of top states
top_states_march_df = top_states_df.loc["2020-03-08":today].sort_values(
by=["deaths", "cases"], ascending=True)
top_states_march_df.head()
state cases deaths
date
2020-03-09 Louisiana 1 0
2020-03-10 Michigan 2 0
2020-03-11 Michigan 2 0
2020-03-08 New Jersey 6 0
2020-03-10 Louisiana 6 0
The plotly
graph can be generated as follows in colab notebook.
import plotly.io as pio
pio.renderers.default = 'colab'
import plotly.express as px
fig = px.scatter(
top_states_march_df,
x=top_states_march_df.index,
y="cases", size="deaths", color="state", facet_col="state")
fig.show()
The chart shows this dramatic rise in cases in New York, and the size of the bubble reflects the total death count.
Search for Features: Political, Health and Finance #
What about a different approach? Often in the exploratory phase, it can be useful to try many different ideas. Let’s look at some other variables like sugar intake and political affiliation.
Sugar Intake By State
Let’s read in some CDC (Center for Disease Control) data on sugar intake by state. Note that in the real-world, you often need to rename columns. In this case, this is done using the .rename
functionality.
url ="https://raw.githubusercontent.com/"\
"noahgift/sugar/master/data/education_sugar_cdc_2003.csv"
cdc_2013 = pd.read_csv(url)
cdc_2013.set_index("State", inplace=True)
for column in cdc_2013.columns:
cdc_2013[column]=cdc_2013[column].str.replace(r"\(.*\)","")
cdc_2013[column]=pd.to_numeric(cdc_2013[column])
cdc_2013.reset_index(inplace=True)
cdc_2013.rename(
columns={
"State": "state",
"Employed": "employed-sugar-intake-daily"
},
inplace=True)
cdc_2013.head()
Combine Sugar Data and Covid-19 Data
Next, the sugar data merges with the original data.
cdc_employed_df = cdc_2013[["employed-sugar-intake-daily", "state"]]
sugar_covid_df = df.merge(cdc_employed_df, how="inner", on="state")
sugar_covid_df.head()
state fips cases deaths employed-sugar-intake-daily
0 California 6 1 0 22.9
1 California 6 2 0 22.9
2 California 6 2 0 22.9
3 California 6 2 0 22.9
4 California 6 2 0 22.9
What about data from the 2016 Election?
Notice that this additional data set on what state voted for, which imports into a pandas DataFrame. Note that the pd.get_dummies
method will split the values into two columns.
election_df = pd.read_csv(
"https://raw.githubusercontent.com/"\
"paiml/minimal-python/master/data/2016-Electoral-Votes.csv")
election_df.rename(columns={"State": "state"},inplace=True)
election_df.drop(columns="Votes", inplace=True)
election_df = pd.concat(
[election_df, pd.get_dummies(election_df["Winning Party"])], axis=1)
election_df.head()
The output is as follows.
state Winning Party Democrats Republicans
0 Alabama Republicans 0 1
1 Alaska Republicans 0 1
2 Arizona Republicans 0 1
3 Arkansas Republicans 0 1
4 California Democrats 1 0
Let’s look to see if we can find any features. This example is done by looking for variables that correlate with each other.
Generate Heatmap
sugar_covid_df.corr()
The output is here. It shows the correlated values. A positive correlation shows this could later create a predictive model.
Let’s improve the output by making an interactive heatmap with Plotly.
import plotly.graph_objects as go
fig = go.Figure(data=go.Heatmap(
z=sugar_covid_df.corr(),
x=['cases', 'deaths',
'employed-sugar-intake-daily',
'Democrats', 'Republicans'],
y=['cases', 'deaths',
'employed-sugar-intake-daily',
'Democrats', 'Republicans'],
hoverongaps = False))
fig.show()
The output of the chart shows that so far, Democratic-leaning states are leading the way in cases and deaths. Will this hold up?