100 Best – Summer Songs 1/4

I just achieved 50% Completion in Datacamp and in order to get my head around slightly different topic I wanted to get another small project started. The local Berlin radio station radioeins is doing it’s summer Sunday specials. A jury of around 100 people voted for the best songs of a specific topic. Every member of the jury created an individual Top-10, which was evaluated and put into a final Top-100. The final #1 will be revealed at 7 p.m. but the every members vote will be available at 9 a.m. So I thought, it might be fun to a little web-scraping and figure out the top song.

Methods

Rvest and the SelectorGadget are the perfect combination to figure out website elements and gain their information. First, I have to look at the website and think about the details I want to know. After wards I extract the information by telling the rvest functions which elements are of interest and transform then into a data frame, I am able to analyse.

Getting the webpages

The 100 Best website shows a list of all jury members. Their Top-10 is hidden as a link behind their names. All pictures are contained inside a main css element .containerMain and can be identified as the element .beitrag, which can be translated as something like contribution. So after loading the packages my first code snippet looks like:

1
2
3
4
5
6
7
8
library(rvest)
library(tidyverse)

data <- read_html("https://www.radioeins.de/musik/die-100-besten-2019/sommer-songs/") %>%
    html_nodes(".containerMain .beitrag") %>% 
    html_attr("href") %>% 
    unique() %>% 
    str_subset("/musik/")

Because I’m just interested in the link, which leads to a Top-10, I only extract the href attribute. Sadly the link is behind the persons name and its picture, so I always have double entries, which will be excluded by unique() and because there are much more links, than I’m interested in, I only filter for links, that leads to a chart-subpage.

Getting the votes

At first I want to define, how the final data frame should look like. The score derives from the top-10 place in each list. So I need the place, artist, song and because it might be interesting for future analyses, the jury member’s name.

1
2
3
4
5
Results_df <- data.frame(place = numeric(),
                         artist = character(),
                         song = character(),
                         jury_member = character(),
                         stringsAsFactors = FALSE)

I want to obtain the data from each voting and pass it into a data.frame. Here, I encountered a problem. The top-10 tables aren’t always formatted the same way. Practically, this means, I need different functions for each formatting type or I get error messages. This led me to a chapter of Advanced R, which revolves exactly around this topic. It is a great resource to dig deeper into programming with R.

I wanted to catch the error, by using tryCatch(), and execute the alternative code. Unfortunately, this threw some different errors as well, which I couldn’t make sense of. So instead, I simply save all exception cases and extract the data for those separately.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
get_df <- function(input){
  tryCatch(
    {
      output <- read_html(paste("https://www.radioeins.de", input, sep = "")) %>% 
        html_nodes("table") %>% 
        html_table() %>% 
        extract2(2) %>% 
        as.data.frame()
      
      return(output)
    },
    
    error = function(e){
      print(i)	# print out where the exceptions appear
    }
  )
}

jury_ColName <- names(Results_df)[-4] # I have to figure out if there is a way to leave this out. 
exception <- vector(mode = "numeric") # creating the exception vector

The strategy is, that if I get a good table, I’m going to add this table to my previously defined data.frame. If not, I recognize, what page has a different format and save it to my exceptions vector.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
for (i in 1:length(data)) {
  jury_i <- get_df(data[i])
  
  if (class(jury_i) == "data.frame") {
    
    names(jury_i) <- jury_ColName  
    jury_i$jury_member <- read_html(paste("https://www.radioeins.de", data[i], sep = "")) %>% 
      html_nodes(".TitleText") %>% 
      html_text()
    Results_df <- rbind(Results_df, jury_i)
    
  }else{
    exception <- append(exception, jury_i)
  }
}

Exception Handling

Exception mainly consists of additional tables on the jury members sub-page. You can see below, the code is pretty similar to the code above. After extracting the “exceptional data”, I add it to the Results data.frame.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
for (j in 1:length(exception)) {
  
  jury_j <- read_html(paste("https://www.radioeins.de", data[exception[j]], sep = "")) %>% 
    html_nodes("table") %>%
    html_table(fill = TRUE) %>% 
    extract2(3) %>% 
    as.data.frame()
  
  names(jury_j) <- jury_ColName  
  jury_j$jury_member <- read_html(paste("https://www.radioeins.de", data[exception[j]], sep = "")) %>% 
    html_nodes(".TitleText") %>% 
    html_text()
  Results_df <- rbind(Results_df, jury_j)
}

Counting the votes

Counting the votes is an interesting part, because it demands implementing the shows complete set of rules into the summarising system. But first things first, checking for accidental NAs.

1
2
3
Results_df$place <- as.numeric(Results_df$place)
Results_df_nona <- na.omit(Results_df)
Results_df_nona$score = 0

If a song is listed in a top-10 it gets a score, corresponding to it’s place. Because this dependency isn’t linear, I will define a small dictionary, where every place can be identified with a score. place_dict shows the place in a top-10 and score_dict depicts the score. Afterwards I will filter for all songs, that were voted on the same place and assign the score to the now empty score column of the data.frame.

1
2
3
4
5
place_dict = c(1,2,3,4,5,6,7,8,9,10)
score_dict = c(12,10,8,7,6,5,4,3,2,1)
for (i in 1:length(place_dict)) {
  Results_df_nona[Results_df_nona$place == place_dict[i],5] = score_dict[i]
}

In order to generate the final top-100 chart position, I group by a unique combination of artist and song and summarise the score for each combination. In the end I have to order it by score first. Afterwards, by how many time the song was mentioned and then in alphabetical order of artist and song.

1
2
3
4
5
6
Charts <- Results_df_nona %>% 
  group_by(artist, song) %>% 
  summarise(avg_place = mean(place),
            score = sum(score),
            mentioned = n()) %>% 
  arrange(desc(score), desc(mentioned), desc(artist), desc(song))

Results

Finally I can present you my web-scraping results for the best summer songs of all time. As I was listening to the show today, I realized, I still have some problems to fix. For Example, same songs have a different spelling and hence, are considered as different songs. Also, sometimes bands/artists are named, including an article and sometime not. Further, the article can make a difference, if it is considered for alphabetically ordering by artist. I think I should exclude every “the”, no matter what.

Nevertheless, I could successfully generate the #1 title. It is interesting, that its average place in all top-10 is just at around place 6. So the sheer amount of mentions - 26 time - led to such a high position. It will be interesting to consider different scoring systems, for example ordering by average place for every title mentioned 5 times or more. I definitely have several questions to answer and methods to try, in order to improve the predictions. But for now, have fun at web scraping and listening to some summer songs in the official spotify playlist.

UPDATE 2019-06-30: updated the summer song top list after new insights, described in the follow up post

place artist song score mentioned average place
1 Lovin'Spoonful SummerInCity 153 27 5.740741
2 Beatles HereComesSun 128 17 4.058823
3 MungoJerry InSummertime 120 19 5.157895
4 Kinks SunnyAfternoon 99 14 4.428571
5 Ramones RockawayBeach 97 18 5.666667
6 EddieCochran SummertimeBlues 94 15 5.000000
7 Undertones HereComesSummer 86 12 4.416667
8 Weezer IslandInSun 77 15 6.133333
9 DonHenley BoysOfSummer 71 12 5.416667
10 EllaFitzgerald&LouisArmstrong Summertime 69 9 3.888889
11 JanisJoplin Summertime 66 9 4.444444
12 BeachBoys GoodVibrations 62 10 5.300000
13 DieFantastischenVier TagAmMeer 59 11 6.181818
14 BryanAdams SummerOf'69 44 10 7.100000
15 DJJazzyJeff&FreshPrince Summertime 44 7 6.000000
16 Beatsteaks Summer 42 9 8.000000
17 ViolentFemmes BlisterInSun 41 7 6.714286
18 Caribou Sun 40 7 6.857143
19 LaidBack SunshineReggae 39 8 7.625000
20 BillWithers LovelyDay 39 7 6.285714
21 LeeHazlewood&NancySinatra SummerWine 38 10 7.600000
22 BobMarley&Wailers SunIsShining 38 10 8.100000
23 LanaDelRey SummertimeSadness 37 9 8.222222
24 MarthaReeves&Vandellas DancingInStreet 34 6 10.333333
25 Bananarama CruelSummer 34 6 7.833333
The LatestT