100 Best – Hip Hop 3/4

Another summer Sunday, another day for radioeins to award the 100 best songs. This time it is all about Hip-Hop. Personally, I’m a big fan of present Hip-Hop artist, such as Kendrick, or the recently released records of Fatoni or Tua, but as I know the editors and listeners of radioeins, the top songs will be something from the 80’s, the early era of Hip-Hop.

Wrap-Up

Just like I’ve done in last weeks post, I will wrap up my current code and see, whether I can catch some strange behavior form the radioeins.de page.

1 - Inconsistent number of jury members

Last week, I noticed a changing number of jury members in different page-calls. To check for this, I created a small loop. It does several website calls and prints out new jury members. The code can be seen below, nothing fancy, except the %notin% comparison I like very much. The goal is to observe a possibly varying number of judges for multiple page calls.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
'%notin%' <- Negate('%in%')
data_prev <- as.character(seq(1:98))

for (i in 1:50) {

  data <- read_html("https://www.radioeins.de/musik/die-100-besten-2019/hippiesongs/") %>% 
    html_nodes(".containerMain .beitrag") %>% # reading this data, there is a curious thing, Patrick kessler is present/or not on different website reloads
    html_attr("href") %>% 
    unique() %>%     
    str_subset("/musik/")
  
  new_jury <- data[data %notin% data_prev]
  print(new_jury)
  
  data_prev <-  data
  
}

2 – Empty columns

Have a closer look at Jacho. If you use the element selector, you can see a fourth, empty column in his top 10 table. I can’t tell if this happened intentionally or by accident. It doesn’t matter, I just have to check for this kind of behavior. I compare the number of columns with the expected number and remove an unnecessary fourth column.

It just occurs to me, that this is very case-specific and I should generalize this problem a little bit more.

1
2
3
4
5
6
7
8
 if(length(names(jury_i)) != length(jury_ColName)){
      outlier <- read_html(paste("https://www.radioeins.de", data[i], sep = "")) %>% 
        html_nodes(".TitleText") %>% 
        html_text() %>% 
        print()
      
      jury_i <- jury_i[,jury_ColName]
    }

3 – Apostrophes

Let’s take a look at last weeks number 3 of the 100 best hippie songs. Buffalo Springfield, with For what it’s worth. Now, a big question: What do you use as your standard apostrophe? Of course, there is no other way than using shift + #, but it appears, there are different ways, it is used in the radioeins charts.

1
2
3
4
5
6
7
8
# A tibble: 4 x 6
# Groups:   artist [1]
  place artist             song             score mentioned avg_place
  <int> <chr>              <chr>            <dbl>     <int>     <dbl>
1     4 BuffaloSpringfield ForWhatIt'sWorth    93        14      5   
2    66 BuffaloSpringfield ForWhatIt´sWorth    20         3      4.33
3   134 BuffaloSpringfield ForWhatIt‘sWorth    11         2      5.5 
4   505 BuffaloSpringfield ForWhatIt’sWorth     1         1     10   

Exactly, there are four different ways of using an apostrophe. I was wondering, if it makes sense, to define a single apostrophe and change all differing entries. I came to the conclusion, that I can simply remove all apostrophes, because they don’t really serve an identifying purpose, but make reading easier. The same happens with other punctuation signs.

Maybe I should make an analysis of the most popular apostrophes.

1
2
3

Results_df_nona$song = gsub("’|‘|´|'|,|`|\\.|-","", Results_df_nona$song, fixed = FALSE) 

Additionally to removing punctuation, I’ll also transform upper case letters lower case letter for songs and artists. It shouldn’t has an effect on the identifying value, but should improve the final results.

Similarities

Because, things like punctuations doesn’t change the text very much, I was wondering, whether there is a way to calculate the difference between two texts. I should make a more detailed blog post about this method, but for now, we can inspect the results, when estimating the difference between song titles and artists. I calculate a difference value for the both of them. Afterwards, I’ll add my originally calculated score, in order to manually check if I miss some very significant songs or artists. The code is not very eloquent, but works for now.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
d_song = stringdistmatrix(Charts$song)
d_artist = stringdistmatrix(Charts$artist)
song_sim <- dist_to_df(d_song)
song_sim$artist <- dist_to_df(d_artist)$value

song_sim <- song_sim %>% 
  filter(value < 6,
         artist <6) %>% 
  arrange(value, artist) %>% 
  mutate(artist_row = 0,
         artist_col = 0,
         score_row = 0,
         score_col = 0,
         song_row = 0,
         song_col = 0,
         place_row = 0,
         place_col = 0)

  for (i in 1:nrow(song_sim)) {
    song_sim[i,"artist_row"] = Charts[song_sim[i,1],"artist"]
    song_sim[i,"artist_col"] = Charts[song_sim[i,2],"artist"]
    song_sim[i,"score_row"] = Charts[song_sim[i,1],"score"]
    song_sim[i,"score_col"] = Charts[song_sim[i,2],"score"]
    song_sim[i,"song_row"] = Charts[song_sim[i,1],"song"]
    song_sim[i,"song_col"] = Charts[song_sim[i,2],"song"]
    song_sim[i,"place_row"] = Charts[song_sim[i,1],"place"]
    song_sim[i,"place_col"] = Charts[song_sim[i,2],"place"]
  }
song_sim$new_score = rowSums(song_sim[,7:8])

A peak of this table can be seen below. It basically reveals a lot of special cases of different spelling. Maybe I can include these values for the next week to improve my results. We can also see, that this approach works pretty good to identify similar songs or artists.

row col value artist artist_row artist_col score_row score_col song_row song_col place_row place_col new_score
85 35 0 1 snoopdoggfeatpharrellwillams snoopdoggfeatpharrellwilliams 12 24 dropitlikeitshot dropitlikeitshot 85 35 36
491 45 0 1 africabambaataa&soulsonicforce afrikabambaataa&soulsonicforce 2 20 planetrock planetrock 491 45 22
276 90 0 1 pubicenemy publicenemy 6 12 dontbelievehype dontbelievehype 276 90 18
476 186 0 1 fishmob fischmob 2 8 susannezurfreiheit susannezurfreiheit 476 186 10
293 200 0 1 käptnpeng&dietenktakelvondelphi käptnpeng&dietentakelvondelphi 6 8 deranfangistnah deranfangistnah 293 200 14
209 208 0 1 drdrefeatsnoopdoggkoruptnatedogg drdrefeatsnoopdoggkuruptnatedogg 8 8 nextepisode nextepisode 209 208 16
198 54 0 2 llcooljay llcoolj 8 17 ineedlove ineedlove 198 54 25
514 470 0 4 jurassicfive jurassic5 1 2 concreteschoolyard concreteschoolyard 514 470 3
274 238 0 5 rootsfeaturingerykahbadu rootsfeaterykahbadu 6 7 yougotme yougotme 274 238 13
217 113 0 6 cooliofeatlv coolio 8 12 gangstasparadise gangstasparadise 217 113 20
330 146 0 6 missieelliott&dabrat missyelliottfeatdabrat 5 10 sockit2me sockit2me 330 146 15
495 27 0 7 tonelōc icecube 1 30 itwasagoodday itwasagoodday 495 27 31
489 95 0 8 amine mcsolaar 2 12 caroline caroline 489 95 14
532 176 0 8 beginner absolutebeginner 1 10 hammerhart hammerhart 532 176 11
502 454 0 11 saltnpepa saltnpepafeatenvogue 1 2 whattaman whattaman 502 454 3

This weeks results

This much for the new scraping insights, lets go to my results for this week. As usually I’m gonna show the first 25 entries, according to my web-scraping script.

Have fun re-listening the show on spotify and see you next week.

place artist song score mentioned average place
1 grandmasterflash&furiousfive message 272 32 3.312500
2 eminem loseyourself 134 20 4.550000
3 missyelliott geturfreakon 125 18 4.444444
4 sugarhillgang rappersdelight 109 20 5.750000
5 publicenemy fightpower 109 18 5.277778
6 cypresshill insaneinbrain 65 13 6.153846
7 beastieboys sabotage 59 11 5.818182
8 nwa fuckthapolice 57 9 5.000000
9 beastieboys intergalactic 55 7 3.857143
10 2pacfeatdrdre&rogertroutman californialove 52 12 6.666667
11 atribecalledquest canikickit? 49 8 5.125000
12 rundmcvsaerosmith walkthisway 49 7 4.571429
13 missyelliott workit 49 6 3.166667
14 beastieboys (yougotta)fightforyourright(toparty) 48 6 3.666667
15 houseofpain jumparound 43 9 6.333333
16 kendricklamar humble 43 5 3.200000
17 delasoul memyself&i 39 6 5.000000
18 drdrefeatsnoopdogg nuthinbutagthang 38 5 3.400000
19 beastieboys sureshot 38 4 2.250000
20 nas nystateofmind 37 6 5.500000
21 saltnpepa pushit 34 6 5.500000
22 atribecalledquest wepeople 34 3 1.333333
23 rootsfeatcodychesnutt seed(20) 33 5 4.800000
24 beastieboys sowhatchawant 33 4 3.500000
25 nwa straightouttacompton 32 6 6.000000
The LatestT