18. June 2019

Kaggle Movie data Part 1.1

What’s going to happen?

As a first real post, I want to show a short analysis I was conducting for a job interview. So, this might be interesting for those, currently applying for a job as a data analyst. I think it is a good topic to start off this blog. Although the interview was for a company, near to the construction industry, the data set I got, again, was about movies from a kaggle page. For future posts, I try to look out for data, around building and construction.

I will split this post into different pieces. For the first part, I will closely stick to the tasks I received for the Job-Interview. Let’s get started.

Task One

The first task was simply about getting a short overview about the data. The data itself is separated into five .csv files.

1
2
3
4
5
6
7


# Reading Movie-Data
library(tidyverse)
Movies <- read_csv("ressources/Movie_Movies.csv")
Actors <- read_csv("ressources/Movie_Actors.csv")
Genres <- read_csv("ressources/Movie_Genres.csv")
Writer <- read_csv("ressources/Movie_Writer.csv")
AdditionalRating <- read_csv("ressources/Movie_AdditionalRating.csv")

Except Writer and AdditionalRating the data is referencing movie observations, even though their names imply otherwise. Because it is all about movies, the Movies data frame contains the most information. The function summary() provides the first quick overview on each variable individually.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


summary(Movies)
                                                                                  
       Awards            Country              DVD                 Director        
    Length:178687      Length:178687      Length:178687         Length:178687     
    Class :character   Class :character   Class :character      Class :character  
    Mode  :character   Mode  :character   Mode  :character      Mode  :character  
                                                            
                                                            
                                                            
                                         
      Language             Plot              Poster              Production     
    Length:178687      Length:178687      Length:178687         Length:178687   
    Class :character   Class :character   Class :character      Class :character
    Mode  :character   Mode  :character   Mode  :character      Mode  :character
                                                            
                                                            
                                                            
                                                            
       Rated             Released           Runtime                Title        
    Length:178687      Length:178687      Length:178687         Length:178687   
    Class :character   Class :character   Class :character      Class :character
    Mode  :character   Mode  :character   Mode  :character      Mode  :character
                                                                                                                                                                                                                                                                                                                                                                                    
                                                            
        Type             Website               Year         imdbID         
    Length:178687      Length:178687      Min.   :1889   Length:178687     
    Class :character   Class :character   1st Qu.:1989   Class :character  
    Mode  :character   Mode  :character   Median :2007   Mode  :character  
                                          Mean   :1993                     
                                          3rd Qu.:2013                     
                                          Max.   :2023                     
                                          NAs    :5                        

      imdbRating       imdbVotes      
    Min.   : 1.00    Min.   :      5  
    1st Qu.: 5.60    1st Qu.:     12  
    Median : 6.50    Median :     39  
    Mean   : 6.42    Mean   :   7704  
    3rd Qu.: 7.30    3rd Qu.:    263  
    Max.   :10.00    Max.   :1929977  
    NAs    :116614   NAs    :116658

As could have been expected, there aren’t many variables, which are purely numeric. But besides the three variables Year, imdbRating, imdbVotes I assumed something like Released, Runtime and Awards to be numeric, so lets take a quick look, at these values.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


Movies[!is.na(Movies$Awards),"Awards"]

    ## # A tibble: 20,340 x 1
    ##    Awards                 
    ##    <chr>                  
    ##  1 1 win.                 
    ##  2 1 nomination.          
    ##  3 1 nomination.          
    ##  4 4 wins & 5 nominations.
    ##  5 1 win & 2 nominations. 
    ##  6 1 win.                 
    ##  7 Nominated for 1 Oscar. Another 34 wins & 29 nominations.      
    ##  8 Nominated for 3 Golden Globes. Another 2 wins & 3 nominations.
    ##  9 2 wins.                
    ## 10 1 nomination.          
    ## # … with 20,330 more rows

There seems to be several information inside of Awards. Specific awards are emphasized, such as Oscars, or Golden Globes. Further, there are “other” awards, which are summarised and it is always distinguished between wins and nominations. In order to have a tidy data frame, which includes all of this informations, specific variables for, e.g. Oscars, Golden Globes or Other wins and nominations should be created. Then, a more comprehensive analysis, including the awards, would be possible.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


Movies[!is.na(Movies$Released),"Released"]
    ## # A tibble: 116,974 x 1
    ##    Released   
    ##    <chr>      
    ##  1 26 Apr 2012
    ##  2 27 Mar 1926
    ##  3 27 May 2013
    ##  4 20 Mar 2014
    ##  5 07 Sep 1915
    ##  6 14 Jan 1966
    ##  7 10 Aug 2002
    ##  8 14 Feb 2015
    ##  9 01 Feb 2014
    ## 10 23 Nov 2004
    ## # … with 116,964 more rows

The releases appear to be in an inappropriate date format, this should easily be fixed, using the anytime-package. The character values, will become native date values.

1
2
3
4
5
6


rt = Movies[!is.na(Movies$Runtime),"Runtime"]
sample(rt$Runtime, 20)

    # [1] "6 min"      "89 min"     "1 h 23 min" "54 min"     "86 min"     "24 min"     "10 min"     "84 min"    
    # [9] "5 min"      "110 min"    "75 min"     "72 min"     "4 min"      "4 min"      "6 min"      "14 min"    
    # [17] "82 min"     "8 min"      "84 min"     "133 min"

For the runtime, there appears to be different formats, when a movie is longer 60 minutes. For further analysis, I should transform every entry, which explicitly considers hours into minutes. This should lead to one numeric variable, Runtime_min completely in minutes.

Additionally, there are some striking things to me in this summary.

The release year ranges up to 2023, which seems a little bit high
- take a closer look at those values, where the release date is above the current date

1
2
3
4
5


Now <-  as.numeric(format(Sys.Date(), "%Y"))
Movies %>% 
  filter(Year > Now) %>% 
  select(Title, imdbID, Year) %>% 
  arrange(desc(Year))

# A tibble: 38 x 3
   Title                                                  imdbID     Year
   <chr>                                                  <chr>     <dbl>
 1 Inside Me                                              tt6195774  2023
 2 The Zero Century: Harlock                              tt7110972  2023
 3 Austin Powers Meets Monty Python Meets the Whole World tt4977768  2022
 4 Grace AKA Bubbie                                       tt5891084  2022
 5 Fast & Furious 10                                      tt5433140  2021
 6 The Silver Unknown                                     tt5235900  2021
 7 Power Hungry Animals 3                                 tt6521788  2021
 8 Star Trek 5                                            tt4756234  2021
 9 Spooky Jack                                            tt7208370  2021
10 The Boss Baby 2                                        tt6932874  2021

When opening the corresponding imdb page of the first three entries, we can observe some peculiarities of the data. The first Movie doesn’t have a current release date and a new original title. The second movie, also doesn’t have a release date and the third movie doesn’t even exist. All in all, it doesn’t promote the quality of the date and I would exclude unreleased movies from the data.

Votes & Ratings

As observable above, there is a huge number of NA's among imdbRating and imdbVotes. I’m not sure, whether it is due to the data’s quality or if really only a smaller part of all movies are rated at all. I should take a look at non-rated movies and check with the real entries on imdb.com. For now, this appears as too much work for me. Maybe later, I will take a closer look. But for now, I will look closer at the distribution and correlation of rating and votes.

For this, I want only observation, which have been voted on. Based on the imdb rules, only movies can get a mean rating, if there are more than 25.000 votes. So lets filter the data for these requirements.

1
2
3
4
5


RatingVote <- Movies %>% 
  filter(!is.na(imdbVotes),
         imdbVotes > 25000) %>%  # no necessary to be relevant in imdb lists
  select(Title, imdbVotes, imdbRating) %>% 
  arrange(desc(imdbVotes))

A simple correlation scatter plot should reveal a possible connection between these two features. But based on the large number of votes for the top movie, I assume several outliers. I don’t want to look for them in the data separately but I’d like to see them right in the plot. The function geom_tex() from ggplot2 can put their labels next to the points. But I only want the outliers to be displayed, so I define further restrictions. Number of votes has to be above 1.500.000 and the mean rating should be higher 9.0

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


ggplot(RatingVote, aes(imdbVotes, imdbRating))+
  geom_point(alpha = 0.1)+
  geom_text(data = RatingVote[RatingVote$imdbVotes > 1500000 |
                                RatingVote$imdbRating > 9,],
            aes(label = Title),
            size = 4)+
  scale_y_continuous(breaks = seq(0,10,0.5))+
  xlim(0,2200000)+
  labs(title = "Connection between Votes and Rating",
       subtitle = paste("Correlation is ",round(cor(Movies$imdbVotes, Movies$imdbRating, use = "pairwise.complete.obs"), digits = 4), sep = ""),
       x = "Votes per Movies",
       y = "Mean Rating per Movie",
       caption = paste("Total of ", nrow(RatingVote), "Movies", sep = ""))

The correlation between votes and rating is very small, but some kind of pattern is evolving nevertheless. Movies with a very low rating (below 5.0) don’t cross the 250.000 votes mark and are generally quite low. The contrary for a high number of votes. Above 500.000 votes, the min rating is at around 6.5 Above 1.000.000, the worst rating is at around 8.25. I would interpret this with a feedback loop. If a movie has a low rating, less people care about/watch the movie and hence rate it on imdb. But if a movie already has a high rating, it is marked as a good movie, more people are attracted to watch it and give it a good rating. Based on this data, a movie can only have a high number of votes, if it has a high rating. Focused shitstorms, like on could assume with cpt. Marvel, i.e. high number of votes and a very low rating, can’t be observed.

Next to the great movies, with many votes we can also see another group, which has a good rating, but doesn’t get that many votes. I would consider The Mountain II and The Chaos Class as hidden gems. How does this fit into my feedback loop theory. There are two movies, with an outstanding rating, shouldn’t they get much more attention/votes? Here’s the thing, both movies are not listed in the top 250 charts, despite their unmatched rating. I don’t know why, maybe both movies miss some properties needed for the top list. Maybe it is a conspiracy to promote American movies ;). But i think, this is a very interesting point and should be further investigated, maybe by doing some web scraping on the imdb-database myself.

Most often occurred writers

Finally, I wanted to take a look at the writer file and see, which writers are credited the most. This is very straight forward. The only problem is, that several writers have multiple credits per movie. So I have to create a data frame with distinct combinations of writer and movie and then group by individual writers. The results can be plotted afterwards.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


Writer %>% 
  group_by(Person, imdbID) %>% 
  summarize(CreditsPerMovie= n()) %>% 
  group_by(Person) %>% 
  summarize(No_Movies = n()) %>% 
  arrange(desc(No_Movies)) %>% 
  head(n = 15) %>% 
  ggplot(aes(reorder(Person,No_Movies), No_Movies)) +
    geom_col() +
    coord_flip()+
    labs(title = "Writers in Movies",
         subtitle = "Sorted by distinct credited movies",
         y = "Number of Movies",
         x = "Writer",
         caption = paste("From a total of ", n_distinct(Writer$Person), "unique Writers and ", n_distinct(Writer$imdbID), "unique Movies"))

From a quick look, the most writer aren’t explicit writer of movie screenplay but traditional book authors. So in this dataset the most writers are credited based on their back catalogue and corresponding movie adaptations. I should validate this, by looking into some movie writers. Maybe I can find a way to filter some thing like original screenplay.

This is it, for the first task, I got for my latest job application. The task sheet contained two additional tasks, which I will show in my next post. This will be about finding the director, directing the most movies and my first touches of network analyses.

The LatestT