Summary of Findings

Business Task: To analyze the trends in the video game industry from 1977-2020 by investigating the factors that impact unit sales such as platform, release year, user rating, critic rating, and combined rating (average user + critic).

This analysis explores the video game industry and its sales trends. The industry has consistently grown, peaking in 2008 with sales of 1954.78 million, but has since seen a decline. The impact of release date on game sales is significant, with games released closer to platform release date generally performing better. User and critic ratings have an impact on sales, with critic ratings having a greater influence. The top selling game of all time is Wii Sports with 82.9 million units sold.

For this project, I started with a public Kaggle data set, created formatted SQL queries in BigQuery, and then analyzed the data creating graphs in both R and Python. With additional resources, I would further explore other data sets to determine the factors contributing to the decline in sales from 2014-2019. In addition, this data set lacks data on mobile game sales. This is a large sector of the market and the limitations of the data also skew sales before mobile gaming came to prevalence.

summary(data)
##      game             platform           games_sold           year     
##  Length:41062       Length:41062       Min.   : 0.0100   Min.   :1977  
##  Class :character   Class :character   1st Qu.: 0.0700   1st Qu.:2005  
##  Mode  :character   Mode  :character   Median : 0.2100   Median :2009  
##                                        Mean   : 0.6189   Mean   :2009  
##                                        3rd Qu.: 0.5800   3rd Qu.:2013  
##                                        Max.   :82.9000   Max.   :2020  
##                                                                        
##   critic_score     user_score    combined_score 
##  Min.   : 0.8    Min.   : 1.00   Min.   : 2.30  
##  1st Qu.: 6.3    1st Qu.: 6.10   1st Qu.: 6.55  
##  Median : 7.4    Median : 7.10   Median : 7.30  
##  Mean   : 7.1    Mean   : 6.86   Mean   : 7.19  
##  3rd Qu.: 8.2    3rd Qu.: 7.90   3rd Qu.: 8.00  
##  Max.   :10.0    Max.   :10.00   Max.   :10.00  
##  NA's   :17272   NA's   :35743   NA's   :35766

Data Transformation

The first part of my project was to run a SQL query to join the two tables game_sales.csv and review.csv. This join was crucial to (1) combining the sales data with ratings data and (2) creating a combined_score column which averaged the two review scores. For this analysis, I only included games with both user and critic scores. This does add a data limitation based on the number of null values in the user_score column. With some additional resources, getting a data set with additional review information would increase the validity of this analysis.

SELECT
  a.game,
  a.platform,
  a.games_sold,
  a.year,
  b.critic_score,
  b.user_score,
  ( b.critic_score + b.user_score ) / 2 AS combined_score
FROM `portfolioproject-384218.game_sales_and_reviews.game_sales` a
LEFT JOIN
  `portfolioproject-384218.game_sales_and_reviews.reviews` b
ON a.game = b.game
GROUP BY
  a.game,
  a.platform,
  a.games_sold,
  a.year,
  b.critic_score,
  b.user_score

Additionally, I put together the top 10 game by year for the Python project later in this portfolio. THe objective was to output long data I could use to create the animated “race” between top games by year.

WITH rank_cte AS
  (
  SELECT
   game,
   games_sold,
   year,
   ROW_NUMBER() OVER(
     PARTITION BY year
     ORDER BY games_sold DESC ) AS sold_rank
  FROM `portfolioproject-384218.game_sales_and_reviews.game_sales`
  )

SELECT
  game,
  games_sold,
  year
FROM
  rank_cte
WHERE
  sold_rank <= 10
ORDER BY
  year,
  sold_rank

Furthermore, I did transform the ranked sales data in Excel. This data transformation used VLOOKUP to match game title by year. I followed the below formula to transform the long data above into the format required for the Sjvisualizer:

=IF(VLOOKUP($A2,bquxjob_7dc90f6_187f89c35f8!$A:$C,3,FALSE)<=Sheet1!B$1,VLOOKUP($A2,bquxjob_7dc90f6_187f89c35f8!$A:$C,2,FALSE),0)

Finally, I used a TRANSPOSE function to switch the years and game titles columns and rows to prepare for the Python code.

Plotting the Data

The video game industry has seen growth over this data set peaking with the highest number of sales in 2008 at 1954.78 million in sales. There appears to be a decline starting at 2014, however, since this data only includes release date of titles and unit sold per title, there is some data missing. The longer a game has been out, the more time there is have to sell copies. Additional data would need to be explored to determine how large of a factor this is to the decline from 2014-2019. Additionally, this data set lacks data on mobile game sales. This is a large sector of the market and the limitations of the data also skew sales before mobile gaming came to prevalence.

## Mapping year and game units sold
ggplot(data = data, aes(x = year, y = games_sold)) + 
  geom_col(data = subset(data, year != 2008), fill = "gray") +
  geom_col(data = subset(data, year == 2008), fill = "blue") +
  labs(title = "Video Game Units Sold by Year", subtitle = "1977-2020, in Millions", 
       caption = "Data Source: https://bit.ly/43JPCBB",
       x = "Year", y = "Games Sold") +
  annotate("text", x = 2015, y = 1800, 
           label = paste(sum(data$games_sold[data$year == 2008]), "million (2008)"),
           vjust = -1.5, color = "blue", fontface = "bold") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

The first relationship I was interested in digging into more was the ratings vs. sales. I created 3 graphs that show units sold compared to user rating, critic rating, and combined rating.

The correlation between user rating and games sold appears to be the weakest of the 3.

## Mapping user rating vs units sold
ggplot( data = data ) + geom_jitter( mapping=aes( x=user_score, y=games_sold ) )+
  geom_smooth( mapping=aes( x=user_score, y=games_sold ) )+
  xlim(0, 10) + ylim(0, 20)+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  labs( title="Video Game Units Sold vs. User Rating", subtitle="1977-2020, in Millions",
  caption="Data Source: https://bit.ly/43JPCBB",
  x = "User Rating", y = "Games Sold")

The critic rating has a higher impact on units sold than user rating, showing higher sales at higher ratings (especially in the 9-10 range).

Finally, the combined user and critic rating has the strongest correlation between score and units sold.

Finally, I wanted to animate where titles landed comparing user rating to critic rating over time to see if titles skewed one way or another. Here, you can see a skew towards higher user ratings.

## Gganimate scatterplot
ggplot( data=data , aes( critic_score, user_score, size = games_sold, color = platform)) +
  geom_point() +
  scale_x_log10() +
  theme_bw() +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.position = "none") +
  labs(title = 'Year: {frame_time}', x = 'User Rating', y = 'Critic Rating') +
  transition_time(year) +
  ease_aes('linear')

I also created a density plot of data from all years. In this plot, you can see that a rating around 7.5 was the most common for both scores.

## Density plot of user rating v critic rating
ggplot(data=data, aes(user_score, critic_score)) + geom_density_2d_filled(show.legend = FALSE, bins=155) +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  labs( title="Video Games User Rating vs. Critic Rating", subtitle="1977-2020",
        caption="Data Source: https://bit.ly/43JPCBB",
        x = "User Rating", y = "Critic Rating" )+
    coord_cartesian(expand = FALSE)+
    scale_colour_brewer(palette="Blues")

Finally, I made histograms of each of the rating scores to show the distributions.

ggplot(data=data, aes(x=critic_score)) + geom_histogram() +
    labs( title="Critic Rating Histogram", subtitle="1977-2020",
        caption="Data Source: https://bit.ly/43JPCBB",
        x = "Critic Rating", y = "Count" )

Next, I wanted to look at the sales by platform and year. In these trend lines, you can see the increase in sales within the first year of platform release followed by a steady decline over time. Games released in the first year of their respective platform typically perform well. The exception of this is PC, since there is not one specific release date. In addition, this data set lacks mobile game sales data. I would like to see the visualization of that.

## Mapping platform units to year
ggplot(data, aes(x = year, y = games_sold)) +
  geom_line() +
  theme(panel.grid.major = element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        panel.grid.minor = element_blank()) +
    labs( title="Video Game Sales by Platform", subtitle="1977-2020, in Millions",
        caption="Data Source: https://bit.ly/43JPCBB",
        x = "Year", y = "Games Sold" ) +
  facet_wrap(~platform, ncol = 6)