07. July 2019

100 Best – Hip Hop 3/4

Another summer Sunday, another day for radioeins to award the 100 best songs. This time it is all about Hip-Hop. Personally, I’m a big fan of present Hip-Hop artist, such as Kendrick, or the recently released records of Fatoni or Tua, but as I know the editors and listeners of radioeins, the top songs will be something from the 80’s, the early era of Hip-Hop.

Wrap-Up

Just like I’ve done in last weeks post, I will wrap up my current code and see, whether I can catch some strange behavior form the radioeins.de page.

1 - Inconsistent number of jury members

Last week, I noticed a changing number of jury members in different page-calls. To check for this, I created a small loop. It does several website calls and prints out new jury members. The code can be seen below, nothing fancy, except the %notin% comparison I like very much. The goal is to observe a possibly varying number of judges for multiple page calls.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


'%notin%' <- Negate('%in%')
data_prev <- as.character(seq(1:98))

for (i in 1:50) {

  data <- read_html("https://www.radioeins.de/musik/die-100-besten-2019/hippiesongs/") %>% 
    html_nodes(".containerMain .beitrag") %>% # reading this data, there is a curious thing, Patrick kessler is present/or not on different website reloads
    html_attr("href") %>% 
    unique() %>%     
    str_subset("/musik/")
  
  new_jury <- data[data %notin% data_prev]
  print(new_jury)
  
  data_prev <-  data
  
}

2 – Empty columns

Have a closer look at Jacho. If you use the element selector, you can see a fourth, empty column in his top 10 table. I can’t tell if this happened intentionally or by accident. It doesn’t matter, I just have to check for this kind of behavior. I compare the number of columns with the expected number and remove an unnecessary fourth column.

It just occurs to me, that this is very case-specific and I should generalize this problem a little bit more.

1
2
3
4
5
6
7
8


 if(length(names(jury_i)) != length(jury_ColName)){
      outlier <- read_html(paste("https://www.radioeins.de", data[i], sep = "")) %>% 
        html_nodes(".TitleText") %>% 
        html_text() %>% 
        print()
      
      jury_i <- jury_i[,jury_ColName]
    }

3 – Apostrophes

Let’s take a look at last weeks number 3 of the 100 best hippie songs. Buffalo Springfield, with For what it’s worth. Now, a big question: What do you use as your standard apostrophe? Of course, there is no other way than using shift + #, but it appears, there are different ways, it is used in the radioeins charts.

1
2
3
4
5
6
7
8


# A tibble: 4 x 6
# Groups:   artist [1]
  place artist             song             score mentioned avg_place
  <int> <chr>              <chr>            <dbl>     <int>     <dbl>
1     4 BuffaloSpringfield ForWhatIt'sWorth    93        14      5   
2    66 BuffaloSpringfield ForWhatIt´sWorth    20         3      4.33
3   134 BuffaloSpringfield ForWhatIt‘sWorth    11         2      5.5 
4   505 BuffaloSpringfield ForWhatIt’sWorth     1         1     10   

Exactly, there are four different ways of using an apostrophe. I was wondering, if it makes sense, to define a single apostrophe and change all differing entries. I came to the conclusion, that I can simply remove all apostrophes, because they don’t really serve an identifying purpose, but make reading easier. The same happens with other punctuation signs.

Maybe I should make an analysis of the most popular apostrophes.

1
2
3

Results_df_nona$song = gsub("’|‘|´|'|,|`|\\.|-","", Results_df_nona$song, fixed = FALSE)

Additionally to removing punctuation, I’ll also transform upper case letters lower case letter for songs and artists. It shouldn’t has an effect on the identifying value, but should improve the final results.

Similarities

Because, things like punctuations doesn’t change the text very much, I was wondering, whether there is a way to calculate the difference between two texts. I should make a more detailed blog post about this method, but for now, we can inspect the results, when estimating the difference between song titles and artists. I calculate a difference value for the both of them. Afterwards, I’ll add my originally calculated score, in order to manually check if I miss some very significant songs or artists. The code is not very eloquent, but works for now.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


d_song = stringdistmatrix(Charts$song)
d_artist = stringdistmatrix(Charts$artist)
song_sim <- dist_to_df(d_song)
song_sim$artist <- dist_to_df(d_artist)$value

song_sim <- song_sim %>% 
  filter(value < 6,
         artist <6) %>% 
  arrange(value, artist) %>% 
  mutate(artist_row = 0,
         artist_col = 0,
         score_row = 0,
         score_col = 0,
         song_row = 0,
         song_col = 0,
         place_row = 0,
         place_col = 0)

  for (i in 1:nrow(song_sim)) {
    song_sim[i,"artist_row"] = Charts[song_sim[i,1],"artist"]
    song_sim[i,"artist_col"] = Charts[song_sim[i,2],"artist"]
    song_sim[i,"score_row"] = Charts[song_sim[i,1],"score"]
    song_sim[i,"score_col"] = Charts[song_sim[i,2],"score"]
    song_sim[i,"song_row"] = Charts[song_sim[i,1],"song"]
    song_sim[i,"song_col"] = Charts[song_sim[i,2],"song"]
    song_sim[i,"place_row"] = Charts[song_sim[i,1],"place"]
    song_sim[i,"place_col"] = Charts[song_sim[i,2],"place"]
  }
song_sim$new_score = rowSums(song_sim[,7:8])

A peak of this table can be seen below. It basically reveals a lot of special cases of different spelling. Maybe I can include these values for the next week to improve my results. We can also see, that this approach works pretty good to identify similar songs or artists.

row	col	artist	artist_row	artist_col	score_row	score_col	song_row	song_col	place_row	place_col	new_score
85	35	1	snoopdoggfeatpharrellwillams	snoopdoggfeatpharrellwilliams	12	24	dropitlikeitshot	dropitlikeitshot	85	35	36
491	45	1	africabambaataa&soulsonicforce	afrikabambaataa&soulsonicforce	2	20	planetrock	planetrock	491	45	22
276	90	1	pubicenemy	publicenemy	6	12	dontbelievehype	dontbelievehype	276	90	18
476	186	1	fishmob	fischmob	2	8	susannezurfreiheit	susannezurfreiheit	476	186	10
293	200	1	käptnpeng&dietenktakelvondelphi	käptnpeng&dietentakelvondelphi	6	8	deranfangistnah	deranfangistnah	293	200	14
209	208	1	drdrefeatsnoopdoggkoruptnatedogg	drdrefeatsnoopdoggkuruptnatedogg	8	8	nextepisode	nextepisode	209	208	16
198	54	2	llcooljay	llcoolj	8	17	ineedlove	ineedlove	198	54	25
514	470	4	jurassicfive	jurassic5	1	2	concreteschoolyard	concreteschoolyard	514	470	3
274	238	5	rootsfeaturingerykahbadu	rootsfeaterykahbadu	6	7	yougotme	yougotme	274	238	13
217	113	6	cooliofeatlv	coolio	8	12	gangstasparadise	gangstasparadise	217	113	20
330	146	6	missieelliott&dabrat	missyelliottfeatdabrat	5	10	sockit2me	sockit2me	330	146	15
495	27	7	tonelōc	icecube	1	30	itwasagoodday	itwasagoodday	495	27	31
489	95	8	amine	mcsolaar	2	12	caroline	caroline	489	95	14
532	176	8	beginner	absolutebeginner	1	10	hammerhart	hammerhart	532	176	11
502	454	11	saltnpepa	saltnpepafeatenvogue	1	2	whattaman	whattaman	502	454	3

This weeks results

This much for the new scraping insights, lets go to my results for this week. As usually I’m gonna show the first 25 entries, according to my web-scraping script.

Have fun re-listening the show on spotify and see you next week.

place	artist	song	score	mentioned	average place
1	grandmasterflash&furiousfive	message	272	32	3.312500
2	eminem	loseyourself	134	20	4.550000
3	missyelliott	geturfreakon	125	18	4.444444
4	sugarhillgang	rappersdelight	109	20	5.750000
5	publicenemy	fightpower	109	18	5.277778
6	cypresshill	insaneinbrain	65	13	6.153846
7	beastieboys	sabotage	59	11	5.818182
8	nwa	fuckthapolice	57	9	5.000000
9	beastieboys	intergalactic	55	7	3.857143
10	2pacfeatdrdre&rogertroutman	californialove	52	12	6.666667
11	atribecalledquest	canikickit?	49	8	5.125000
12	rundmcvsaerosmith	walkthisway	49	7	4.571429
13	missyelliott	workit	49	6	3.166667
14	beastieboys	(yougotta)fightforyourright(toparty)	48	6	3.666667
15	houseofpain	jumparound	43	9	6.333333
16	kendricklamar	humble	43	5	3.200000
17	delasoul	memyself&i	39	6	5.000000
18	drdrefeatsnoopdogg	nuthinbutagthang	38	5	3.400000
19	beastieboys	sureshot	38	4	2.250000
20	nas	nystateofmind	37	6	5.500000
21	saltnpepa	pushit	34	6	5.500000
22	atribecalledquest	wepeople	34	3	1.333333
23	rootsfeatcodychesnutt	seed(20)	33	5	4.800000
24	beastieboys	sowhatchawant	33	4	3.500000
25	nwa	straightouttacompton	32	6	6.000000

The LatestT