The Data Science Pipeline: Exploring Video Game Genres Over the Past 10 Years

1. Introduction

Hello and welcome!

This is a walkthough of the Data Science Pipeline, using Python as the language of choice. Before we begin, let's first talk about our dependencies:

This tutorial was created on a system running macOS.

Depending on your operating system, you may already have version of Python installed.

However, any operating system should work. For more information, visit https://www.python.org/downloads/operating-systems/

Jupyter & IPython

If you don't already know, this github page is actually a Jupyter (IPython) notebook that allows you to document both the input and output of code with boxes of text (known as "Markdown cells" - http://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html#).

These cells support comment styling augmented with a lightweight and easy-to-use markup langugae (https://daringfireball.net/projects/markdown/syntax).

Although not technically required, using an application that supports interactive literate programming makes the entire pipeline process far easier to document, understand, and benefit from; using one is highly recommended (especially for beginners and newcomers).

You may download Jupyter through Anaconda: http://jupyter.readthedocs.io/en/latest/install.html

Python and IPython Versions

I am using Python version 3.6.2 and IPython version 6.1.0. You may verify your versions via

Help ---> About

Data

Once you have your environment set up, you can retrieve the data. For this walkthough we will use:

IGN's game review data for the last 20 years (up to 2016), found here: https://www.kaggle.com/egrinstein/20-years-of-games/

Video games with sales greater than 100,000 (from vgchartz.com), found here: https://www.kaggle.com/gregorut/videogamesales/

2. Getting Started - Looking at our data

We start by importing the necessary packages, reading the data in, and parsing it into a dataframe. Now is a good time to perform any other sanitization-related tasks before we begin to form perceptions and make assumptions about the data set. For example, we will search for missing entries (usually recorded as 'nan') and impute values (if necessary).

In [3]:
import pandas as pd
import seaborn as sns
from datetime import datetime
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm


ign = pd.read_csv('/home/jovyan/notebooks/Final Project/dkanney.github.io/data/ign.csv')
vgsales = pd.read_csv('/home/jovyan/notebooks/Final Project/dkanney.github.io/data/vgsales_ratings.csv')


# Restrict to games made from 2006-2016 inclusive
ign = ign[ign['release_year'] > 2005]
# display('ign: ', ign[ign['title'] == 'Sonic the Hedgehog'], 'len: ', len(ign), 'vgsales: ', vgsales.head(10), 'len: ', len(df2))


# Join the two dataframes based on video game title/Name, getting rid of anything that we don't have release date.
raw_data = pd.merge(ign, vgsales, left_on='title', right_on='Name', how='inner')


# Using release year, release month, and release day columns to create a new column for release date as a datetime object
for i, row in raw_data.iterrows():
    raw_data.loc[i, 'release_date'] = pd.to_datetime(str(row['release_year']) + '-' + str(row['release_month']) + '-' + str(row['release_day']))

    
# Getting rid of unsued/duplicate columns
del raw_data['Year_of_Release']
del raw_data['release_month']
del raw_data['release_year']
del raw_data['Critic_Count']
del raw_data['Critic_Score']
del raw_data['release_day']
del raw_data['Unnamed: 0']
del raw_data['User_Count']
del raw_data['User_Score']
del raw_data['Developer']
del raw_data['genre']
del raw_data['Name']
del raw_data['url']


# Get a list of all columns that aren't numerical/quantitative/datetime, and strip them of whitespace
for col in list(set(raw_data.columns.ravel()) - set(raw_data.describe().columns.ravel()) - set(['release_date'])):
    raw_data[col] = raw_data[col].str.strip()


display('raw_data: ', raw_data.sample(20), '# of rows: ', len(raw_data))
'raw_data: '
score_phrase title platform score editors_choice Platform Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Rating release_date
611 Awful Final Fight: Streetwise PlayStation 2 3.6 N PS2 Action Capcom 0.03 0.02 0.00 0.01 0.05 M 2006-02-28
11297 Great LEGO Harry Potter: Years 5-7 PC 8.0 N PSP Action Warner Bros. Interactive Entertainment 0.10 0.10 0.00 0.06 0.26 E10+ 2011-11-18
7837 Good G-Force Wii 7.3 N DS Action Disney Interactive Studios 0.26 0.21 0.00 0.05 0.53 E10+ 2009-07-21
13543 Great TrackMania Turbo PlayStation 4 8.4 N XOne Action Ubisoft 0.02 0.04 0.00 0.00 0.06 E 2016-03-24
1961 Mediocre Avatar: The Last Airbender PlayStation 2 5.1 N DS Adventure THQ 0.30 0.01 0.00 0.03 0.33 E 2006-10-16
214 Great Machinarium iPad 8.5 N PC Adventure Daedalic 0.00 0.09 0.00 0.02 0.11 NaN 2011-09-14
6818 Bad Six Flags Fun Park Wii 4.5 N Wii Misc Ubisoft 0.27 0.09 0.00 0.04 0.40 E10+ 2009-04-14
4669 Painful The History Channel: Battle for the Pacific Wii 2.5 N X360 Shooter Activision 0.05 0.00 0.00 0.00 0.06 T 2008-01-03
9299 Great Green Day: Rock Band Xbox 360 8.3 N X360 Misc MTV Games 0.24 0.13 0.00 0.04 0.41 T 2010-06-08
3085 Good Castlevania NES 7.5 N NES Platform Konami Digital Entertainment 0.54 0.06 0.62 0.01 1.23 NaN 2007-04-30
6841 Bad Secret Service Xbox 360 4.5 N X360 Action Activision Value 0.04 0.00 0.00 0.00 0.05 T 2008-12-08
3504 Good Madden NFL 08 PlayStation 2 7.9 N PS3 Sports Electronic Arts 0.89 0.00 0.00 0.08 0.97 E 2007-08-14
2834 Good Virtua Tennis 3 PlayStation 3 7.3 N PS3 Sports Sega 0.23 0.58 0.03 0.21 1.05 E 2007-03-21
7666 Great Red Faction: Guerrilla Xbox 360 8.0 N PS3 Shooter THQ 0.33 0.27 0.02 0.12 0.74 M 2009-05-28
7087 Great Street Fighter IV PC 8.9 Y PC Fighting Capcom 0.07 0.05 0.00 0.02 0.13 T 2009-07-15
11931 Mediocre Remember Me Xbox 360 5.9 N PC Action Capcom 0.00 0.05 0.00 0.00 0.06 M 2013-06-02
10529 Okay Bust-A-Move Universe Nintendo 3DS 6.0 N 3DS Puzzle Square Enix 0.08 0.15 0.06 0.03 0.31 E 2011-04-01
11155 Amazing Rayman Origins Wii 9.5 Y PC Platform Ubisoft 0.00 0.06 0.00 0.01 0.08 E10+ 2011-11-09
6335 Great Tom Clancy's EndWar PlayStation 3 8.0 N X360 Strategy Ubisoft 0.58 0.24 0.01 0.09 0.91 T 2008-11-04
5530 Good Guitar Hero: Aerosmith PlayStation 3 7.6 N X360 Misc Activision 1.00 0.13 0.00 0.10 1.23 T 2008-06-29
'# of rows: '
13646

Missing Data

Let's create a function in order to assist us with finding missing (nan) values in our dataframe:

In [4]:
# Check for NaNs in the dataframe, printing 'True' (or 'False') followed by the columns with 'nan'. 
# If the user specifies True for 'printrows', all entries containing 'nan' will be printed as well.
def get_nan(raw_data):  
    print(raw_data.isnull().sum())
    
    return raw_data[raw_data.isnull().T.any().T]

Running the next cell returns the remaining records with 'NaN' entries

In [5]:
get_nan(raw_data)
print('\nUnique Ratings: ', raw_data['Rating'].unique())
score_phrase         0
title                0
platform             0
score                0
editors_choice       0
Platform             0
Genre                0
Publisher            7
NA_Sales             0
EU_Sales             0
JP_Sales             0
Other_Sales          0
Global_Sales         0
Rating            1441
release_date         0
dtype: int64

Unique Ratings:  ['E' 'E10+' 'T' nan 'M' 'AO']

We now see that many of these rows contain missing/'NaN' values under 'Rating'.

3. Tidying our data

Missing Ratings

According to our function, our dataset contains more than 1400 missing game ratings. Obviously, we can't simply impute each one with 100% accuracy without examining each individual row.
Therefore, I decided to replace the missing game ratings with a pending rating of "RP" based on this rating guide from the ESRB: https://www.esrb.org/ratings/ratings_guide.aspx

In [6]:
raw_data['Rating'].fillna('RP', inplace=True)
missing_set = []

for i, row in get_nan(raw_data).iterrows():
    missing_set.append(i)
    
missing_set
score_phrase      0
title             0
platform          0
score             0
editors_choice    0
Platform          0
Genre             0
Publisher         7
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
Rating            0
release_date      0
dtype: int64
Out[6]:
[2453, 2457, 9285, 10864, 10866, 11231, 13192]

Missing Publisher names

Let's start tidying the data by looking at the missing Publishers and filling in the missing data points (see specific entries above). Since there doesn't seem to be a clear pattern or reason for why this data is missing, we will assume for now that it is Missing Completely at Random (MCAR). Read more about the three types of missing data at

https://en.wikipedia.org/wiki/Missing_data#Techniques_of_dealing_with_missing_data

Due to the small amount of missing Publisher names, we can use this moment to learn a bit more about our data set while imputing the missing data.

In [7]:
# Publisher --> 'Sega'
# (SOURCE: https://en.wikipedia.org/wiki/Sonic_the_Hedgehog_(2006_video_game))
raw_data.loc[missing_set[0], 'Publisher'] = 'Sega'
raw_data.loc[missing_set[1], 'Publisher'] = 'Sega'


# Publisher --> 'Nintendo'
# (SOURCE: https://en.wikipedia.org/wiki/Mario_Tennis)
raw_data.loc[missing_set[2], 'Publisher'] = 'Nintendo'


# Publisher --> 'Wargaming'
# (SOURCE: https://en.wikipedia.org/wiki/World_of_Tanks)
raw_data.loc[missing_set[3], 'Publisher'] = 'Wargaming' # Initial release
raw_data.loc[missing_set[4], 'Publisher'] = 'Wargaming' # Console edition


# Publisher --> '7Sixty'
# (SOURCE: https://en.wikipedia.org/wiki/Stronghold_3)
raw_data.loc[missing_set[5], 'Publisher'] = '7Sixty'


# Publisher --> 'Gearbox'
# (SOURCE: http://store.steampowered.com/app/244160/Homeworld_Remastered_Collection/)
raw_data.loc[missing_set[6], 'Publisher'] = 'Gearbox'


# There appears to be Publishers with similar strings (i.e. 'Valve' and 'Valve Software'). 
# We can group these together to simplify analysis.
for i in raw_data[raw_data['Publisher'] == 'Valve Software'].index:
    raw_data.loc[i, 'Publisher'] = 'Valve'

Running the next cell verifies that there are no more remaining records with 'NaN' entries.

In [10]:
get_nan(raw_data);
score_phrase      0
title             0
platform          0
score             0
editors_choice    0
Platform          0
Genre             0
Publisher         0
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
Rating            0
release_date      0
dtype: int64

4. Exploratory Data Analysis - Plotting Our Data

Let's look at some visualizations of our data set. We can try several plots of differnt types in order to obtain a better understanding of our data set.

The Most Reviewed Genres (IGN)

Pie chart

In [11]:
display(ign['genre'].value_counts()[:15])

ign['genre'].value_counts()[:15].plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0.15,0.12,0.15,0.18,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.18,0.15])
plt.title('Genres IGN Reviewed the Most')
fig = plt.gcf().set_size_inches(9,9)
plt.show()
Action               2283
Shooter              1105
Sports                911
Adventure             886
RPG                   668
Strategy              579
Puzzle                557
Racing                528
Platformer            509
Action, Adventure     443
Simulation            354
Music                 324
Fighting              312
Action, RPG           243
Puzzle, Action        170
Name: genre, dtype: int64

Although this pie chart is helpful for visualizing our data, it's certainly not the only way we could visualize this data. Let's try a horizontal bar plot.

Bar plot

In [12]:
ign_rev = ign['genre'].value_counts()[:15].to_frame()

ax = sns.barplot(x='genre',y=ign_rev.index,data=ign_rev)
ax.set(xlabel='Total Reviews on IGN', ylabel='Genre')
plt.show()

This bar plot shows each genre's individual ranking much clearer.
Based on these plots, the three most-reviewed genres (by IGN) appear to be Action, Sports, and Shooter.
Let's continue exploring this data by plotting other potentially important features, such as sales.

The Genres with the Highest Sales in North America

Pie chart

In [13]:
NA_Sales = raw_data.groupby('Genre').sum()['NA_Sales'].copy().sort_values()[::-1]
display(NA_Sales)

NA_Sales.plot(kind='pie',shadow=True,autopct='%1.1f%%',explode=[0.15,0.12,0.15,0.18,0.2,0.2,0.2,0.2,0.2,0.2,0.18,0.15])
plt.title('Distribution of Sales in North America')
fig = plt.gcf().set_size_inches(9,9)
plt.show()
Genre
Action          1622.26
Sports          1057.78
Shooter         1048.23
Misc             476.00
Platform         389.72
Role-Playing     332.78
Racing           297.36
Fighting         208.28
Puzzle           168.30
Simulation       139.92
Adventure         87.36
Strategy          53.36
Name: NA_Sales, dtype: float64

Scatter plot

In [14]:
NA_Sales = NA_Sales.to_frame()
ax = sns.stripplot(x=NA_Sales.index,y='NA_Sales',data=NA_Sales,size=15)
ax.set(xlabel='Genre', ylabel='North American Sales (million)')
plt.show()

Bar plot

In [15]:
ax = sns.barplot(x='NA_Sales',y=NA_Sales.index,data=NA_Sales)
ax.set(xlabel='North American Sales (million)', ylabel='Genre')
plt.show()

Let's take a look at how these genres performed globally.

The Genres with the Highest Sales Worldwide

Pie chart

In [16]:
world_sales = vgsales.groupby('Genre').sum()['Global_Sales'].copy().sort_values()[::-1]
display(world_sales)

world_sales.plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0.15,0.12,0.15,0.18,0.2,0.2,0.2,0.2,0.2,0.2,0.25,0.3])
plt.title('Distribution of Global Video Game Sales')
fig = plt.gcf().set_size_inches(9,9)
plt.show()
Genre
Action          1745.27
Sports          1332.00
Shooter         1052.94
Role-Playing     934.40
Platform         828.08
Misc             803.18
Racing           728.90
Fighting         447.48
Simulation       390.42
Puzzle           243.02
Adventure        237.69
Strategy         174.50
Name: Global_Sales, dtype: float64

Bar plot

In [17]:
world_sales = world_sales.to_frame().reset_index()

ax = sns.barplot(x='Global_Sales',y='Genre',data=world_sales)
ax.set(xlabel='Worldwide Sales (million)', ylabel='Genre')
plt.show()

In addition to being the top 3 most reviewed game genres, Action, Sports, and Shooter games seem to sell more than the other genres as well (Worldwide).

Is it possible that sales and scores are related? Or are they independent of each other?

We can create a visual "guide" this question by plotting the relationship between global video game sales and review scores.

We can model this relationship with a line of best fit (obtained via linear regression http://www.statsmodels.org/stable/regression.html).

Global Video Game Sales-Scores Relationship

In [18]:
s = raw_data.sample(6000)

ax = sns.stripplot(x='score',y='Global_Sales',hue='score_phrase',data=s, alpha=.5, size=5.5)
sns.regplot(x='score',y='Global_Sales',data=s,scatter=False)
ax.set(xlabel='IGN Review Score (0 - 10)', ylabel='Global Sales (Millions)')

plt.show()

Interesting! This plot is skewed left with much of its data residing more towards the right-hand side than the left-hand side. As a result, this scatter plot has a trend line that increases linearly, implying a positive linear relationship between the values plotted on the axis.

Our earlier observations suggested that sales for a video game's genre tend to do better when the genre has more reviews. However, each of these games has a rating from the ESRB that sets a lower bound the age of the buyer/consumer. This rating must have some effect on sales (and possibly reviews) since children and teens aren't lawfully allowed to purchase Rated-M games (not until they reach the set age threshold).

How well do these genres sell when instead considering ESRB ratings?

Let's take a look at how genres performed in each ESRB rating category.

Global Genre Sales by ESRB Rating

In [39]:
# Games grouped by their rating (E, E10+, T, M, AO, RP)
ratings = list(raw_data.Rating.unique())
raw_data.groupby('Rating').groups;

Pie chart

In [21]:
world_sales = raw_data[raw_data['Rating'] == ratings[0]].groupby('Genre').sum()['Global_Sales'].copy().sort_values()[::-1]

world_sales.plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0.15,0.12,0.15,0.18,0.2,0.2,0.2,0.2,0.2,0.25,0.3])
plt.title('Distribution of Global Video Game Sales (Rated '+str(ratings[0])+')')
fig = plt.gcf().set_size_inches(9,9)
plt.show()
In [22]:
world_sales = raw_data[raw_data['Rating'] == ratings[1]].groupby('Genre').sum()['Global_Sales'].copy().sort_values()[::-1]

world_sales.plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0.1,0.1,0.12,0.15,0.18,0.18,0.18,0.18,0.15,0.15,0.12,0.3])
plt.title('Distribution of Global Video Game Sales (Rated '+ratings[1]+')')
fig = plt.gcf().set_size_inches(9,9)
plt.show()
In [23]:
world_sales = raw_data[raw_data['Rating'] == ratings[2]].groupby('Genre').sum()['Global_Sales'].copy().sort_values()[::-1]

world_sales.plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0.15,0.12,0.15,0.18,0.2,0.2,0.2,0.2,0.2,0.2,0.25,0.3])
plt.title('Distribution of Global Video Game Sales (Rated '+ratings[2]+')')
fig = plt.gcf().set_size_inches(9,9)
plt.show()
In [24]:
world_sales = raw_data[raw_data['Rating'] == ratings[3]].groupby('Genre').sum()['Global_Sales'].copy().sort_values()[::-1]

world_sales.plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0.1,0.12,0.15,0.18,0.15,0.15,0.15,0.15,0.15,0.15,0.15,0.15])
plt.title('Distribution of Global Video Game Sales (Rated '+ratings[3]+')')
fig = plt.gcf().set_size_inches(9,9)
plt.show()
In [25]:
world_sales = raw_data[raw_data['Rating'] == ratings[4]].groupby('Genre').sum()['Global_Sales'].copy().sort_values()[::-1]

world_sales.plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0.1,0.12,0.15,0.18,0.15,0.15,0.15,0.15,0.15,0.15,0.15])
plt.title('Distribution of Global Video Game Sales (Rated '+ratings[4]+')')
fig = plt.gcf().set_size_inches(9,9)
plt.show()
In [37]:
world_sales = raw_data[raw_data['Rating'] == ratings[5]].groupby('Genre').sum()['Global_Sales'].copy().sort_values()[::-1]

world_sales.plot(kind='pie',autopct='%1.1f%%',explode=[0])
plt.title('Distribution of Global Video Game Sales (Rated '+ratings[5]+')')
fig = plt.gcf().set_size_inches(9,9)
plt.show()

These charts show us that, unlike the positive trend between sales and review scores, video game genres experience different sales numbers when considering video game ratings.

Global Video Game Sales by ESRB Rating

In [27]:
ax = sns.swarmplot(x='Rating',y='Global_Sales',hue='Genre',data=raw_data.sample(1000),size=6, order=['E','E10+','T','M','RP'])
ax.set(xlabel='ESRB Rating', ylabel='Global Sales (Millions)')
sns.violinplot
plt.show()

This swarmplot was used because we have a categorical x-axis, and many y-axis entries; this kind of plot prevents points from overlapping too much (as they may naturally do in a normal scatter plot). Since we'll be able to see our datapoints much more clearly, colors are assigned to the points in order to assist our analysis.

This plot shows a rough distribution of game genres among their respective ESRB ratings, as well as the game's global sales. We can clearly see the Sports/Racing/Platform games that are rated-E. As the age prerequisite required to purchase and play these games increases (ESRB rating), we can see a significant shift in the type of game (genre) players prefer to buy.

The next factorplot affirms our previous swarmplot. Each ESRB Rating has its own plot to make analyzing each infividual Rating's Genres much clearer.

Global Video Game Sales by Genre

In [35]:
sns.factorplot(x="Global_Sales", y="Genre",col="Rating",data=raw_data[raw_data['Rating'].isin(['E','E10+','T','M'])], kind="bar",size=10);
plt.show()

5. Making Predictions

After analyzing a dataset, one would want to know if they could test the results of their findings, using them to predict the results of future findings that fit the same model. In order to accomplish this, we can create a linear regression model using the OLS function in statsmodels.api (http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html)

After creating this model, we'll be able to predict (with a reasonable amount of error) how video game sales change worldwide based solely on sales from North America (NA_Sales), Europe (EU_Sales), and Japan (JP_Sales). Furthermore, we'd like to determine if IGN's scores are statically significant in the context of our model. The steps to make a simple linear regression model for Mature (Rated M) Action games are shown below:

In [117]:
data = raw_data[raw_data['Rating'] == 'M']
data = data[data['Genre'] == 'Action']
data.head()
Out[117]:
score_phrase title platform score editors_choice Platform Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Rating release_date
12 Mediocre Way of the Samurai 4 PlayStation 3 5.5 N PS3 Action Nippon Ichi Software 0.00 0.03 0.16 0.01 0.19 M 2012-09-03
94 Good Darksiders II PC 7.5 N PS3 Action THQ 0.35 0.35 0.01 0.14 0.84 M 2012-08-17
95 Good Darksiders II PC 7.5 N X360 Action THQ 0.45 0.27 0.00 0.07 0.79 M 2012-08-17
97 Good Darksiders II PC 7.5 N PC Action Nordic Games 0.03 0.10 0.00 0.02 0.15 M 2012-08-17
98 Good Darksiders II PC 7.5 N WiiU Action THQ 0.07 0.07 0.00 0.01 0.15 M 2012-08-17

OLS and Linear Regression Analysis

In [122]:
result = sm.ols(formula="Global_Sales ~ score + NA_Sales + EU_Sales + JP_Sales",  data=data).fit()
result.summary()
Out[122]:
OLS Regression Results
Dep. Variable: Global_Sales R-squared: 0.940
Model: OLS Adj. R-squared: 0.940
Method: Least Squares F-statistic: 4653.
Date: Sat, 16 Dec 2017 Prob (F-statistic): 0.00
Time: 07:13:54 Log-Likelihood: -896.94
No. Observations: 889 AIC: 1802.
Df Residuals: 885 BIC: 1821.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -0.3797 0.114 -3.329 0.001 -0.603 -0.156
score 0.0485 0.015 3.181 0.002 0.019 0.078
NA_Sales 1.8838 0.021 90.830 0.000 1.843 1.925
JP_Sales 3.0927 0.160 19.337 0.000 2.779 3.407
Omnibus: 556.165 Durbin-Watson: 2.292
Prob(Omnibus): 0.000 Jarque-Bera (JB): 16384.180
Skew: 2.338 Prob(JB): 0.00
Kurtosis: 23.505 Cond. No. 56.3

Hypothesis Testing

Based on the OLS model we created above, you can see the relationship between score, NA_Sales, EU_Sales, JP_Sales, and Global_Sales. Determining if the model fits out data well starts at the "R-squared" value being somewhat close to 1 (usually, this is not the only way to determine the model's fit, so a low R-squared does not necessarily mean something is wrong).

In our case, the R-sqaured value is high -- implying that our model explains most of the error

This means we can reject the null Hypothesis of: no correlation; There is clearly a correlation between score, NA_Sales, EU_Sales, JP_Sales, and Global_Sales.



As we examine the rest of the OLS Regression Results, we'll look at the coef and P>|t|, also known as the "coefficient" and the "p-value", respectively.

You can determine the correlation(positive/negative) on global sales by the parity of the coefficient term (+/-).

In order to see if the result is statistically significant, we must also examine the "F-statistic".

  • In general, increased Action video game sales in different areas (North America, Europe, Japan) boosts overall sales worldwide for the Action genres of games.

  • Game sales for teens and younger (ESRB rating E to T) are influenced more significatly other by genres; ESRB-Rated M games' sales rely much more heavily on Action titles

  • The score feature appears to have a small effect on global sales. Since its p-value is greater than 0.05, this means that our added score feature was not statistically significant in our model. Thus, you may conclude that score has minimal/no significant effect on global sales.

  • In conclusion, we can assume that for ESRB-Rated M Action games, most of the global sales can be attributed to simply three areas: North America, Europe, and Japan. We can also conclude that, although incorporating score into our model increased its R-squared value, its p-value shows it has a much smaller statistical signifcance in our model than our main 3 predictors.

6. Conclusion

By calculating these statistics of the regression model, we can get a better understanding of the quality of the model's fit. Previously, our plotted regression model was simply just that -- the line of best fit. However, much more occurs "behind the scenes" that makes analysis and prediction much more nuanced than simply plugging in data points and getting a perfect fit every time. Thus, an important part of the Data Science pipeline involves finding out which techniques work best with your dataset.

Thanks for reading!