Business Task: To analyze the trends in the video game industry from 1977-2020 by investigating the factors that impact unit sales such as platform, release year, user rating, critic rating, and combined rating (average user + critic).
This analysis explores the video game industry and its sales trends. The industry has consistently grown, peaking in 2008 with sales of 1954.78 million, but has since seen a decline. The impact of release date on game sales is significant, with games released closer to platform release date generally performing better. User and critic ratings have an impact on sales, with critic ratings having a greater influence. The top selling game of all time is Wii Sports with 82.9 million units sold.
For this project, I started with a public Kaggle data set, created formatted SQL queries in BigQuery, and then analyzed the data creating graphs in both R and Python. With additional resources, I would further explore other data sets to determine the factors contributing to the decline in sales from 2014-2019. In addition, this data set lacks data on mobile game sales. This is a large sector of the market and the limitations of the data also skew sales before mobile gaming came to prevalence.
summary(data)
## game platform games_sold year
## Length:41062 Length:41062 Min. : 0.0100 Min. :1977
## Class :character Class :character 1st Qu.: 0.0700 1st Qu.:2005
## Mode :character Mode :character Median : 0.2100 Median :2009
## Mean : 0.6189 Mean :2009
## 3rd Qu.: 0.5800 3rd Qu.:2013
## Max. :82.9000 Max. :2020
##
## critic_score user_score combined_score
## Min. : 0.8 Min. : 1.00 Min. : 2.30
## 1st Qu.: 6.3 1st Qu.: 6.10 1st Qu.: 6.55
## Median : 7.4 Median : 7.10 Median : 7.30
## Mean : 7.1 Mean : 6.86 Mean : 7.19
## 3rd Qu.: 8.2 3rd Qu.: 7.90 3rd Qu.: 8.00
## Max. :10.0 Max. :10.00 Max. :10.00
## NA's :17272 NA's :35743 NA's :35766
The first part of my project was to run a SQL query to join the two
tables game_sales.csv
and review.csv
. This
join was crucial to (1) combining the sales data with ratings data and
(2) creating a combined_score
column which averaged the two
review scores. For this analysis, I only included games with both user
and critic scores. This does add a data limitation based on the number
of null values in the user_score
column. With some
additional resources, getting a data set with additional review
information would increase the validity of this analysis.
SELECT
a.game,
a.platform,
a.games_sold,
a.year,
b.critic_score,
b.user_score,
( b.critic_score + b.user_score ) / 2 AS combined_score
FROM `portfolioproject-384218.game_sales_and_reviews.game_sales` a
LEFT JOIN
`portfolioproject-384218.game_sales_and_reviews.reviews` b
ON a.game = b.game
GROUP BY
a.game,
a.platform,
a.games_sold,
a.year,
b.critic_score,
b.user_score
Additionally, I put together the top 10 game by year for the Python project later in this portfolio. THe objective was to output long data I could use to create the animated “race” between top games by year.
WITH rank_cte AS
(
SELECT
game,
games_sold,
year,
ROW_NUMBER() OVER(
PARTITION BY year
ORDER BY games_sold DESC ) AS sold_rank
FROM `portfolioproject-384218.game_sales_and_reviews.game_sales`
)
SELECT
game,
games_sold,
year
FROM
rank_cte
WHERE
sold_rank <= 10
ORDER BY
year,
sold_rank
Furthermore, I did transform the ranked sales data in Excel. This
data transformation used VLOOKUP
to match game title by
year. I followed the below formula to transform the long data above into
the format required for the Sjvisualizer:
=IF(VLOOKUP($A2,bquxjob_7dc90f6_187f89c35f8!$A:$C,3,FALSE)<=Sheet1!B$1,VLOOKUP($A2,bquxjob_7dc90f6_187f89c35f8!$A:$C,2,FALSE),0)
Finally, I used a TRANSPOSE
function to switch the years
and game titles columns and rows to prepare for the Python code.
The video game industry has seen growth over this data set peaking with the highest number of sales in 2008 at 1954.78 million in sales. There appears to be a decline starting at 2014, however, since this data only includes release date of titles and unit sold per title, there is some data missing. The longer a game has been out, the more time there is have to sell copies. Additional data would need to be explored to determine how large of a factor this is to the decline from 2014-2019. Additionally, this data set lacks data on mobile game sales. This is a large sector of the market and the limitations of the data also skew sales before mobile gaming came to prevalence.
## Mapping year and game units sold
ggplot(data = data, aes(x = year, y = games_sold)) +
geom_col(data = subset(data, year != 2008), fill = "gray") +
geom_col(data = subset(data, year == 2008), fill = "blue") +
labs(title = "Video Game Units Sold by Year", subtitle = "1977-2020, in Millions",
caption = "Data Source: https://bit.ly/43JPCBB",
x = "Year", y = "Games Sold") +
annotate("text", x = 2015, y = 1800,
label = paste(sum(data$games_sold[data$year == 2008]), "million (2008)"),
vjust = -1.5, color = "blue", fontface = "bold") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
The first relationship I was interested in digging into more was the ratings vs. sales. I created 3 graphs that show units sold compared to user rating, critic rating, and combined rating.
The correlation between user rating and games sold appears to be the weakest of the 3.
## Mapping user rating vs units sold
ggplot( data = data ) + geom_jitter( mapping=aes( x=user_score, y=games_sold ) )+
geom_smooth( mapping=aes( x=user_score, y=games_sold ) )+
xlim(0, 10) + ylim(0, 20)+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
labs( title="Video Game Units Sold vs. User Rating", subtitle="1977-2020, in Millions",
caption="Data Source: https://bit.ly/43JPCBB",
x = "User Rating", y = "Games Sold")
The critic rating has a higher impact on units sold than user rating, showing higher sales at higher ratings (especially in the 9-10 range).
Finally, the combined user and critic rating has the strongest correlation between score and units sold.
Finally, I wanted to animate where titles landed comparing user rating to critic rating over time to see if titles skewed one way or another. Here, you can see a skew towards higher user ratings.
## Gganimate scatterplot
ggplot( data=data , aes( critic_score, user_score, size = games_sold, color = platform)) +
geom_point() +
scale_x_log10() +
theme_bw() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.position = "none") +
labs(title = 'Year: {frame_time}', x = 'User Rating', y = 'Critic Rating') +
transition_time(year) +
ease_aes('linear')
I also created a density plot of data from all years. In this plot, you can see that a rating around 7.5 was the most common for both scores.
## Density plot of user rating v critic rating
ggplot(data=data, aes(user_score, critic_score)) + geom_density_2d_filled(show.legend = FALSE, bins=155) +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
labs( title="Video Games User Rating vs. Critic Rating", subtitle="1977-2020",
caption="Data Source: https://bit.ly/43JPCBB",
x = "User Rating", y = "Critic Rating" )+
coord_cartesian(expand = FALSE)+
scale_colour_brewer(palette="Blues")
Finally, I made histograms of each of the rating scores to show the distributions.
ggplot(data=data, aes(x=critic_score)) + geom_histogram() +
labs( title="Critic Rating Histogram", subtitle="1977-2020",
caption="Data Source: https://bit.ly/43JPCBB",
x = "Critic Rating", y = "Count" )
Next, I wanted to look at the sales by platform and year. In these trend lines, you can see the increase in sales within the first year of platform release followed by a steady decline over time. Games released in the first year of their respective platform typically perform well. The exception of this is PC, since there is not one specific release date. In addition, this data set lacks mobile game sales data. I would like to see the visualization of that.
## Mapping platform units to year
ggplot(data, aes(x = year, y = games_sold)) +
geom_line() +
theme(panel.grid.major = element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
panel.grid.minor = element_blank()) +
labs( title="Video Game Sales by Platform", subtitle="1977-2020, in Millions",
caption="Data Source: https://bit.ly/43JPCBB",
x = "Year", y = "Games Sold" ) +
facet_wrap(~platform, ncol = 6)