23. June 2019

100 Best – Summer Songs 1/4

I just achieved 50% Completion in Datacamp and in order to get my head around slightly different topic I wanted to get another small project started. The local Berlin radio station radioeins is doing it’s summer Sunday specials. A jury of around 100 people voted for the best songs of a specific topic. Every member of the jury created an individual Top-10, which was evaluated and put into a final Top-100. The final #1 will be revealed at 7 p.m. but the every members vote will be available at 9 a.m. So I thought, it might be fun to a little web-scraping and figure out the top song.

Methods

Rvest and the SelectorGadget are the perfect combination to figure out website elements and gain their information. First, I have to look at the website and think about the details I want to know. After wards I extract the information by telling the rvest functions which elements are of interest and transform then into a data frame, I am able to analyse.

Getting the webpages

The 100 Best website shows a list of all jury members. Their Top-10 is hidden as a link behind their names. All pictures are contained inside a main css element .containerMain and can be identified as the element .beitrag, which can be translated as something like contribution. So after loading the packages my first code snippet looks like:

1
2
3
4
5
6
7
8


library(rvest)
library(tidyverse)

data <- read_html("https://www.radioeins.de/musik/die-100-besten-2019/sommer-songs/") %>%
    html_nodes(".containerMain .beitrag") %>% 
    html_attr("href") %>% 
    unique() %>% 
    str_subset("/musik/")

Because I’m just interested in the link, which leads to a Top-10, I only extract the href attribute. Sadly the link is behind the persons name and its picture, so I always have double entries, which will be excluded by unique() and because there are much more links, than I’m interested in, I only filter for links, that leads to a chart-subpage.

Getting the votes

At first I want to define, how the final data frame should look like. The score derives from the top-10 place in each list. So I need the place, artist, song and because it might be interesting for future analyses, the jury member’s name.

1
2
3
4
5


Results_df <- data.frame(place = numeric(),
                         artist = character(),
                         song = character(),
                         jury_member = character(),
                         stringsAsFactors = FALSE)

I want to obtain the data from each voting and pass it into a data.frame. Here, I encountered a problem. The top-10 tables aren’t always formatted the same way. Practically, this means, I need different functions for each formatting type or I get error messages. This led me to a chapter of Advanced R, which revolves exactly around this topic. It is a great resource to dig deeper into programming with R.

I wanted to catch the error, by using tryCatch(), and execute the alternative code. Unfortunately, this threw some different errors as well, which I couldn’t make sense of. So instead, I simply save all exception cases and extract the data for those separately.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


get_df <- function(input){
  tryCatch(
    {
      output <- read_html(paste("https://www.radioeins.de", input, sep = "")) %>% 
        html_nodes("table") %>% 
        html_table() %>% 
        extract2(2) %>% 
        as.data.frame()
      
      return(output)
    },
    
    error = function(e){
      print(i)	# print out where the exceptions appear
    }
  )
}

jury_ColName <- names(Results_df)[-4] # I have to figure out if there is a way to leave this out. 
exception <- vector(mode = "numeric") # creating the exception vector

The strategy is, that if I get a good table, I’m going to add this table to my previously defined data.frame. If not, I recognize, what page has a different format and save it to my exceptions vector.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


for (i in 1:length(data)) {
  jury_i <- get_df(data[i])
  
  if (class(jury_i) == "data.frame") {
    
    names(jury_i) <- jury_ColName  
    jury_i$jury_member <- read_html(paste("https://www.radioeins.de", data[i], sep = "")) %>% 
      html_nodes(".TitleText") %>% 
      html_text()
    Results_df <- rbind(Results_df, jury_i)
    
  }else{
    exception <- append(exception, jury_i)
  }
}

Exception Handling

Exception mainly consists of additional tables on the jury members sub-page. You can see below, the code is pretty similar to the code above. After extracting the “exceptional data”, I add it to the Results data.frame.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


for (j in 1:length(exception)) {
  
  jury_j <- read_html(paste("https://www.radioeins.de", data[exception[j]], sep = "")) %>% 
    html_nodes("table") %>%
    html_table(fill = TRUE) %>% 
    extract2(3) %>% 
    as.data.frame()
  
  names(jury_j) <- jury_ColName  
  jury_j$jury_member <- read_html(paste("https://www.radioeins.de", data[exception[j]], sep = "")) %>% 
    html_nodes(".TitleText") %>% 
    html_text()
  Results_df <- rbind(Results_df, jury_j)
}

Counting the votes

Counting the votes is an interesting part, because it demands implementing the shows complete set of rules into the summarising system. But first things first, checking for accidental NAs.

1
2
3


Results_df$place <- as.numeric(Results_df$place)
Results_df_nona <- na.omit(Results_df)
Results_df_nona$score = 0

If a song is listed in a top-10 it gets a score, corresponding to it’s place. Because this dependency isn’t linear, I will define a small dictionary, where every place can be identified with a score. place_dict shows the place in a top-10 and score_dict depicts the score. Afterwards I will filter for all songs, that were voted on the same place and assign the score to the now empty score column of the data.frame.

1
2
3
4
5


place_dict = c(1,2,3,4,5,6,7,8,9,10)
score_dict = c(12,10,8,7,6,5,4,3,2,1)
for (i in 1:length(place_dict)) {
  Results_df_nona[Results_df_nona$place == place_dict[i],5] = score_dict[i]
}

In order to generate the final top-100 chart position, I group by a unique combination of artist and song and summarise the score for each combination. In the end I have to order it by score first. Afterwards, by how many time the song was mentioned and then in alphabetical order of artist and song.

1
2
3
4
5
6


Charts <- Results_df_nona %>% 
  group_by(artist, song) %>% 
  summarise(avg_place = mean(place),
            score = sum(score),
            mentioned = n()) %>% 
  arrange(desc(score), desc(mentioned), desc(artist), desc(song))

Results

Finally I can present you my web-scraping results for the best summer songs of all time. As I was listening to the show today, I realized, I still have some problems to fix. For Example, same songs have a different spelling and hence, are considered as different songs. Also, sometimes bands/artists are named, including an article and sometime not. Further, the article can make a difference, if it is considered for alphabetically ordering by artist. I think I should exclude every “the”, no matter what.

Nevertheless, I could successfully generate the #1 title. It is interesting, that its average place in all top-10 is just at around place 6. So the sheer amount of mentions - 26 time - led to such a high position. It will be interesting to consider different scoring systems, for example ordering by average place for every title mentioned 5 times or more. I definitely have several questions to answer and methods to try, in order to improve the predictions. But for now, have fun at web scraping and listening to some summer songs in the official spotify playlist.

UPDATE 2019-06-30: updated the summer song top list after new insights, described in the follow up post

place	artist	song	score	mentioned	average place
1	Lovin'Spoonful	SummerInCity	153	27	5.740741
2	Beatles	HereComesSun	128	17	4.058823
3	MungoJerry	InSummertime	120	19	5.157895
4	Kinks	SunnyAfternoon	99	14	4.428571
5	Ramones	RockawayBeach	97	18	5.666667
6	EddieCochran	SummertimeBlues	94	15	5.000000
7	Undertones	HereComesSummer	86	12	4.416667
8	Weezer	IslandInSun	77	15	6.133333
9	DonHenley	BoysOfSummer	71	12	5.416667
10	EllaFitzgerald&LouisArmstrong	Summertime	69	9	3.888889
11	JanisJoplin	Summertime	66	9	4.444444
12	BeachBoys	GoodVibrations	62	10	5.300000
13	DieFantastischenVier	TagAmMeer	59	11	6.181818
14	BryanAdams	SummerOf'69	44	10	7.100000
15	DJJazzyJeff&FreshPrince	Summertime	44	7	6.000000
16	Beatsteaks	Summer	42	9	8.000000
17	ViolentFemmes	BlisterInSun	41	7	6.714286
18	Caribou	Sun	40	7	6.857143
19	LaidBack	SunshineReggae	39	8	7.625000
20	BillWithers	LovelyDay	39	7	6.285714
21	LeeHazlewood&NancySinatra	SummerWine	38	10	7.600000
22	BobMarley&Wailers	SunIsShining	38	10	8.100000
23	LanaDelRey	SummertimeSadness	37	9	8.222222
24	MarthaReeves&Vandellas	DancingInStreet	34	6	10.333333
25	Bananarama	CruelSummer	34	6	7.833333

The LatestT