This is a walkthough of the Data Science Pipeline, using Python as the language of choice. Before we begin, let's first talk about our dependencies:
This tutorial was created on a system running macOS.
Depending on your operating system, you may already have version of Python installed.
However, any operating system should work. For more information, visit https://www.python.org/downloads/operating-systems/
If you don't already know, this github page is actually a Jupyter (IPython) notebook that allows you to document both the input and output of code with boxes of text (known as "Markdown cells" - http://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html#).
These cells support comment styling augmented with a lightweight and easy-to-use markup langugae (https://daringfireball.net/projects/markdown/syntax).
Although not technically required, using an application that supports interactive literate programming makes the entire pipeline process far easier to document, understand, and benefit from; using one is highly recommended (especially for beginners and newcomers).
You may download Jupyter through Anaconda: http://jupyter.readthedocs.io/en/latest/install.html
I am using Python version 3.6.2 and IPython version 6.1.0. You may verify your versions via
Help ---> About
Once you have your environment set up, you can retrieve the data. For this walkthough we will use:
IGN's game review data for the last 20 years (up to 2016), found here: https://www.kaggle.com/egrinstein/20-years-of-games/
Video games with sales greater than 100,000 (from vgchartz.com), found here: https://www.kaggle.com/gregorut/videogamesales/
We start by importing the necessary packages, reading the data in, and parsing it into a dataframe. Now is a good time to perform any other sanitization-related tasks before we begin to form perceptions and make assumptions about the data set. For example, we will search for missing entries (usually recorded as 'nan') and impute values (if necessary).
import pandas as pd
import seaborn as sns
from datetime import datetime
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
ign = pd.read_csv('/home/jovyan/notebooks/Final Project/dkanney.github.io/data/ign.csv')
vgsales = pd.read_csv('/home/jovyan/notebooks/Final Project/dkanney.github.io/data/vgsales_ratings.csv')
# Restrict to games made from 2006-2016 inclusive
ign = ign[ign['release_year'] > 2005]
# display('ign: ', ign[ign['title'] == 'Sonic the Hedgehog'], 'len: ', len(ign), 'vgsales: ', vgsales.head(10), 'len: ', len(df2))
# Join the two dataframes based on video game title/Name, getting rid of anything that we don't have release date.
raw_data = pd.merge(ign, vgsales, left_on='title', right_on='Name', how='inner')
# Using release year, release month, and release day columns to create a new column for release date as a datetime object
for i, row in raw_data.iterrows():
raw_data.loc[i, 'release_date'] = pd.to_datetime(str(row['release_year']) + '-' + str(row['release_month']) + '-' + str(row['release_day']))
# Getting rid of unsued/duplicate columns
del raw_data['Year_of_Release']
del raw_data['release_month']
del raw_data['release_year']
del raw_data['Critic_Count']
del raw_data['Critic_Score']
del raw_data['release_day']
del raw_data['Unnamed: 0']
del raw_data['User_Count']
del raw_data['User_Score']
del raw_data['Developer']
del raw_data['genre']
del raw_data['Name']
del raw_data['url']
# Get a list of all columns that aren't numerical/quantitative/datetime, and strip them of whitespace
for col in list(set(raw_data.columns.ravel()) - set(raw_data.describe().columns.ravel()) - set(['release_date'])):
raw_data[col] = raw_data[col].str.strip()
display('raw_data: ', raw_data.sample(20), '# of rows: ', len(raw_data))
Let's create a function in order to assist us with finding missing (nan) values in our dataframe:
# Check for NaNs in the dataframe, printing 'True' (or 'False') followed by the columns with 'nan'.
# If the user specifies True for 'printrows', all entries containing 'nan' will be printed as well.
def get_nan(raw_data):
print(raw_data.isnull().sum())
return raw_data[raw_data.isnull().T.any().T]
Running the next cell returns the remaining records with 'NaN' entries
get_nan(raw_data)
print('\nUnique Ratings: ', raw_data['Rating'].unique())
We now see that many of these rows contain missing/'NaN' values under 'Rating'.
According to our function, our dataset contains more than 1400 missing game ratings. Obviously, we can't simply impute each one with 100% accuracy without examining each individual row.
Therefore, I decided to replace the missing game ratings with a pending rating of "RP" based on this rating guide from the ESRB: https://www.esrb.org/ratings/ratings_guide.aspx
raw_data['Rating'].fillna('RP', inplace=True)
missing_set = []
for i, row in get_nan(raw_data).iterrows():
missing_set.append(i)
missing_set
Let's start tidying the data by looking at the missing Publishers and filling in the missing data points (see specific entries above). Since there doesn't seem to be a clear pattern or reason for why this data is missing, we will assume for now that it is Missing Completely at Random (MCAR). Read more about the three types of missing data at
https://en.wikipedia.org/wiki/Missing_data#Techniques_of_dealing_with_missing_data
Due to the small amount of missing Publisher names, we can use this moment to learn a bit more about our data set while imputing the missing data.
# Publisher --> 'Sega'
# (SOURCE: https://en.wikipedia.org/wiki/Sonic_the_Hedgehog_(2006_video_game))
raw_data.loc[missing_set[0], 'Publisher'] = 'Sega'
raw_data.loc[missing_set[1], 'Publisher'] = 'Sega'
# Publisher --> 'Nintendo'
# (SOURCE: https://en.wikipedia.org/wiki/Mario_Tennis)
raw_data.loc[missing_set[2], 'Publisher'] = 'Nintendo'
# Publisher --> 'Wargaming'
# (SOURCE: https://en.wikipedia.org/wiki/World_of_Tanks)
raw_data.loc[missing_set[3], 'Publisher'] = 'Wargaming' # Initial release
raw_data.loc[missing_set[4], 'Publisher'] = 'Wargaming' # Console edition
# Publisher --> '7Sixty'
# (SOURCE: https://en.wikipedia.org/wiki/Stronghold_3)
raw_data.loc[missing_set[5], 'Publisher'] = '7Sixty'
# Publisher --> 'Gearbox'
# (SOURCE: http://store.steampowered.com/app/244160/Homeworld_Remastered_Collection/)
raw_data.loc[missing_set[6], 'Publisher'] = 'Gearbox'
# There appears to be Publishers with similar strings (i.e. 'Valve' and 'Valve Software').
# We can group these together to simplify analysis.
for i in raw_data[raw_data['Publisher'] == 'Valve Software'].index:
raw_data.loc[i, 'Publisher'] = 'Valve'
Running the next cell verifies that there are no more remaining records with 'NaN' entries.
get_nan(raw_data);
Let's look at some visualizations of our data set. We can try several plots of differnt types in order to obtain a better understanding of our data set.
display(ign['genre'].value_counts()[:15])
ign['genre'].value_counts()[:15].plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0.15,0.12,0.15,0.18,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.18,0.15])
plt.title('Genres IGN Reviewed the Most')
fig = plt.gcf().set_size_inches(9,9)
plt.show()
Although this pie chart is helpful for visualizing our data, it's certainly not the only way we could visualize this data. Let's try a horizontal bar plot.
ign_rev = ign['genre'].value_counts()[:15].to_frame()
ax = sns.barplot(x='genre',y=ign_rev.index,data=ign_rev)
ax.set(xlabel='Total Reviews on IGN', ylabel='Genre')
plt.show()
This bar plot shows each genre's individual ranking much clearer.
Based on these plots, the three most-reviewed genres (by IGN) appear to be Action, Sports, and Shooter.
Let's continue exploring this data by plotting other potentially important features, such as sales.
NA_Sales = raw_data.groupby('Genre').sum()['NA_Sales'].copy().sort_values()[::-1]
display(NA_Sales)
NA_Sales.plot(kind='pie',shadow=True,autopct='%1.1f%%',explode=[0.15,0.12,0.15,0.18,0.2,0.2,0.2,0.2,0.2,0.2,0.18,0.15])
plt.title('Distribution of Sales in North America')
fig = plt.gcf().set_size_inches(9,9)
plt.show()
NA_Sales = NA_Sales.to_frame()
ax = sns.stripplot(x=NA_Sales.index,y='NA_Sales',data=NA_Sales,size=15)
ax.set(xlabel='Genre', ylabel='North American Sales (million)')
plt.show()
ax = sns.barplot(x='NA_Sales',y=NA_Sales.index,data=NA_Sales)
ax.set(xlabel='North American Sales (million)', ylabel='Genre')
plt.show()
Let's take a look at how these genres performed globally.
world_sales = vgsales.groupby('Genre').sum()['Global_Sales'].copy().sort_values()[::-1]
display(world_sales)
world_sales.plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0.15,0.12,0.15,0.18,0.2,0.2,0.2,0.2,0.2,0.2,0.25,0.3])
plt.title('Distribution of Global Video Game Sales')
fig = plt.gcf().set_size_inches(9,9)
plt.show()
world_sales = world_sales.to_frame().reset_index()
ax = sns.barplot(x='Global_Sales',y='Genre',data=world_sales)
ax.set(xlabel='Worldwide Sales (million)', ylabel='Genre')
plt.show()
In addition to being the top 3 most reviewed game genres, Action, Sports, and Shooter games seem to sell more than the other genres as well (Worldwide).
Is it possible that sales and scores are related? Or are they independent of each other?
We can create a visual "guide" this question by plotting the relationship between global video game sales and review scores.
We can model this relationship with a line of best fit (obtained via linear regression http://www.statsmodels.org/stable/regression.html).
s = raw_data.sample(6000)
ax = sns.stripplot(x='score',y='Global_Sales',hue='score_phrase',data=s, alpha=.5, size=5.5)
sns.regplot(x='score',y='Global_Sales',data=s,scatter=False)
ax.set(xlabel='IGN Review Score (0 - 10)', ylabel='Global Sales (Millions)')
plt.show()
Interesting! This plot is skewed left with much of its data residing more towards the right-hand side than the left-hand side. As a result, this scatter plot has a trend line that increases linearly, implying a positive linear relationship between the values plotted on the axis.
Our earlier observations suggested that sales for a video game's genre tend to do better when the genre has more reviews. However, each of these games has a rating from the ESRB that sets a lower bound the age of the buyer/consumer. This rating must have some effect on sales (and possibly reviews) since children and teens aren't lawfully allowed to purchase Rated-M games (not until they reach the set age threshold).
How well do these genres sell when instead considering ESRB ratings?
Let's take a look at how genres performed in each ESRB rating category.
# Games grouped by their rating (E, E10+, T, M, AO, RP)
ratings = list(raw_data.Rating.unique())
raw_data.groupby('Rating').groups;
world_sales = raw_data[raw_data['Rating'] == ratings[0]].groupby('Genre').sum()['Global_Sales'].copy().sort_values()[::-1]
world_sales.plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0.15,0.12,0.15,0.18,0.2,0.2,0.2,0.2,0.2,0.25,0.3])
plt.title('Distribution of Global Video Game Sales (Rated '+str(ratings[0])+')')
fig = plt.gcf().set_size_inches(9,9)
plt.show()
world_sales = raw_data[raw_data['Rating'] == ratings[1]].groupby('Genre').sum()['Global_Sales'].copy().sort_values()[::-1]
world_sales.plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0.1,0.1,0.12,0.15,0.18,0.18,0.18,0.18,0.15,0.15,0.12,0.3])
plt.title('Distribution of Global Video Game Sales (Rated '+ratings[1]+')')
fig = plt.gcf().set_size_inches(9,9)
plt.show()
world_sales = raw_data[raw_data['Rating'] == ratings[2]].groupby('Genre').sum()['Global_Sales'].copy().sort_values()[::-1]
world_sales.plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0.15,0.12,0.15,0.18,0.2,0.2,0.2,0.2,0.2,0.2,0.25,0.3])
plt.title('Distribution of Global Video Game Sales (Rated '+ratings[2]+')')
fig = plt.gcf().set_size_inches(9,9)
plt.show()
world_sales = raw_data[raw_data['Rating'] == ratings[3]].groupby('Genre').sum()['Global_Sales'].copy().sort_values()[::-1]
world_sales.plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0.1,0.12,0.15,0.18,0.15,0.15,0.15,0.15,0.15,0.15,0.15,0.15])
plt.title('Distribution of Global Video Game Sales (Rated '+ratings[3]+')')
fig = plt.gcf().set_size_inches(9,9)
plt.show()
world_sales = raw_data[raw_data['Rating'] == ratings[4]].groupby('Genre').sum()['Global_Sales'].copy().sort_values()[::-1]
world_sales.plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0.1,0.12,0.15,0.18,0.15,0.15,0.15,0.15,0.15,0.15,0.15])
plt.title('Distribution of Global Video Game Sales (Rated '+ratings[4]+')')
fig = plt.gcf().set_size_inches(9,9)
plt.show()
world_sales = raw_data[raw_data['Rating'] == ratings[5]].groupby('Genre').sum()['Global_Sales'].copy().sort_values()[::-1]
world_sales.plot(kind='pie',autopct='%1.1f%%',explode=[0])
plt.title('Distribution of Global Video Game Sales (Rated '+ratings[5]+')')
fig = plt.gcf().set_size_inches(9,9)
plt.show()
These charts show us that, unlike the positive trend between sales and review scores, video game genres experience different sales numbers when considering video game ratings.
ax = sns.swarmplot(x='Rating',y='Global_Sales',hue='Genre',data=raw_data.sample(1000),size=6, order=['E','E10+','T','M','RP'])
ax.set(xlabel='ESRB Rating', ylabel='Global Sales (Millions)')
sns.violinplot
plt.show()
This swarmplot was used because we have a categorical x-axis, and many y-axis entries; this kind of plot prevents points from overlapping too much (as they may naturally do in a normal scatter plot). Since we'll be able to see our datapoints much more clearly, colors are assigned to the points in order to assist our analysis.
This plot shows a rough distribution of game genres among their respective ESRB ratings, as well as the game's global sales. We can clearly see the Sports/Racing/Platform games that are rated-E. As the age prerequisite required to purchase and play these games increases (ESRB rating), we can see a significant shift in the type of game (genre) players prefer to buy.
The next factorplot affirms our previous swarmplot. Each ESRB Rating has its own plot to make analyzing each infividual Rating's Genres much clearer.
sns.factorplot(x="Global_Sales", y="Genre",col="Rating",data=raw_data[raw_data['Rating'].isin(['E','E10+','T','M'])], kind="bar",size=10);
plt.show()
After analyzing a dataset, one would want to know if they could test the results of their findings, using them to predict the results of future findings that fit the same model. In order to accomplish this, we can create a linear regression model using the OLS function in statsmodels.api (http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html)
After creating this model, we'll be able to predict (with a reasonable amount of error) how video game sales change worldwide based solely on sales from North America (NA_Sales
), Europe (EU_Sales
), and Japan (JP_Sales
). Furthermore, we'd like to determine if IGN's score
s are statically significant in the context of our model. The steps to make a simple linear regression model for Mature (Rated M
) Action
games are shown below:
data = raw_data[raw_data['Rating'] == 'M']
data = data[data['Genre'] == 'Action']
data.head()
result = sm.ols(formula="Global_Sales ~ score + NA_Sales + EU_Sales + JP_Sales", data=data).fit()
result.summary()
Based on the OLS model we created above, you can see the relationship between score
, NA_Sales
, EU_Sales
, JP_Sales
, and Global_Sales
. Determining if the model fits out data well starts at the "R-squared" value being somewhat close to 1 (usually, this is not the only way to determine the model's fit, so a low R-squared does not necessarily mean something is wrong).
In our case, the R-sqaured value is high -- implying that our model explains most of the error
This means we can reject the null Hypothesis of:
no correlation
; There is clearly a correlation betweenscore
,NA_Sales
,EU_Sales
,JP_Sales
, andGlobal_Sales
.
As we examine the rest of the OLS Regression Results, we'll look at the coef and P>|t|, also known as the "coefficient" and the "p-value", respectively.
You can determine the correlation(positive/negative) on global sales by the parity of the coefficient term (+/-).
In order to see if the result is statistically significant, we must also examine the "F-statistic".
In general, increased
Action
video game sales in different areas (North America, Europe, Japan) boosts overall sales worldwide for theAction
genres of games.Game sales for teens and younger (ESRB rating E to T) are influenced more significatly other by genres; ESRB-Rated
M
games' sales rely much more heavily onAction
titlesThe
score
feature appears to have a small effect on global sales. Since its p-value is greater than 0.05, this means that our addedscore
feature was not statistically significant in our model. Thus, you may conclude thatscore
has minimal/no significant effect on global sales.In conclusion, we can assume that for ESRB-Rated
M
Action
games, most of the global sales can be attributed to simply three areas: North America, Europe, and Japan. We can also conclude that, although incorporatingscore
into our model increased its R-squared value, its p-value shows it has a much smaller statistical signifcance in our model than our main 3 predictors.
By calculating these statistics of the regression model, we can get a better understanding of the quality of the model's fit. Previously, our plotted regression model was simply just that -- the line of best fit. However, much more occurs "behind the scenes" that makes analysis and prediction much more nuanced than simply plugging in data points and getting a perfect fit every time. Thus, an important part of the Data Science pipeline involves finding out which techniques work best with your dataset.
Thanks for reading!