23. June 2019
100 Best – Summer Songs 1/4
I just achieved 50% Completion in Datacamp and in order to get my head around slightly different topic I wanted to get another small project started. The local Berlin radio station radioeins is doing it’s summer Sunday specials. A jury of around 100 people voted for the best songs of a specific topic. Every member of the jury created an individual Top-10, which was evaluated and put into a final Top-100. The final #1 will be revealed at 7 p.m. but the every members vote will be available at 9 a.m. So I thought, it might be fun to a little web-scraping and figure out the top song.
Methods
Rvest and the SelectorGadget are the perfect combination to figure out website elements and gain their information. First, I have to look at the website and think about the details I want to know. After wards I extract the information by telling the rvest functions which elements are of interest and transform then into a data frame, I am able to analyse.
Getting the webpages
The 100 Best website shows a list of all jury members. Their Top-10 is hidden as a link behind their names. All pictures are contained inside a main css element .containerMain
and can be identified as the element .beitrag
, which can be translated as something like contribution. So after loading the packages my first code snippet looks like:
|
|
Because I’m just interested in the link, which leads to a Top-10, I only extract the href
attribute. Sadly the link is behind the persons name and its picture, so I always have double entries, which will be excluded by unique()
and because there are much more links, than I’m interested in, I only filter for links, that leads to a chart-subpage.
Getting the votes
At first I want to define, how the final data frame should look like. The score derives from the top-10 place in each list. So I need the place, artist, song and because it might be interesting for future analyses, the jury member’s name.
|
|
I want to obtain the data from each voting and pass it into a data.frame. Here, I encountered a problem. The top-10 tables aren’t always formatted the same way. Practically, this means, I need different functions for each formatting type or I get error messages. This led me to a chapter of Advanced R, which revolves exactly around this topic. It is a great resource to dig deeper into programming with R.
I wanted to catch the error, by using tryCatch()
, and execute the alternative code. Unfortunately, this threw some different errors as well, which I couldn’t make sense of. So instead, I simply save all exception cases and extract the data for those separately.
|
|
The strategy is, that if I get a good table, I’m going to add this table to my previously defined data.frame. If not, I recognize, what page has a different format and save it to my exceptions vector.
|
|
Exception Handling
Exception mainly consists of additional tables on the jury members sub-page. You can see below, the code is pretty similar to the code above. After extracting the “exceptional data”, I add it to the Results data.frame.
|
|
Counting the votes
Counting the votes is an interesting part, because it demands implementing the shows complete set of rules into the summarising system. But first things first, checking for accidental NAs
.
|
|
If a song is listed in a top-10 it gets a score, corresponding to it’s place. Because this dependency isn’t linear, I will define a small dictionary, where every place can be identified with a score. place_dict
shows the place in a top-10 and score_dict
depicts the score. Afterwards I will filter for all songs, that were voted on the same place and assign the score to the now empty score column of the data.frame.
|
|
In order to generate the final top-100 chart position, I group by a unique combination of artist and song and summarise the score for each combination. In the end I have to order it by score first. Afterwards, by how many time the song was mentioned and then in alphabetical order of artist and song.
|
|
Results
Finally I can present you my web-scraping results for the best summer songs of all time. As I was listening to the show today, I realized, I still have some problems to fix. For Example, same songs have a different spelling and hence, are considered as different songs. Also, sometimes bands/artists are named, including an article and sometime not. Further, the article can make a difference, if it is considered for alphabetically ordering by artist. I think I should exclude every “the”, no matter what.
Nevertheless, I could successfully generate the #1 title. It is interesting, that its average place in all top-10 is just at around place 6. So the sheer amount of mentions - 26 time - led to such a high position. It will be interesting to consider different scoring systems, for example ordering by average place for every title mentioned 5 times or more. I definitely have several questions to answer and methods to try, in order to improve the predictions. But for now, have fun at web scraping and listening to some summer songs in the official spotify playlist.
UPDATE 2019-06-30: updated the summer song top list after new insights, described in the follow up post
place | artist | song | score | mentioned | average place |
---|---|---|---|---|---|
1 | Lovin'Spoonful | SummerInCity | 153 | 27 | 5.740741 |
2 | Beatles | HereComesSun | 128 | 17 | 4.058823 |
3 | MungoJerry | InSummertime | 120 | 19 | 5.157895 |
4 | Kinks | SunnyAfternoon | 99 | 14 | 4.428571 |
5 | Ramones | RockawayBeach | 97 | 18 | 5.666667 |
6 | EddieCochran | SummertimeBlues | 94 | 15 | 5.000000 |
7 | Undertones | HereComesSummer | 86 | 12 | 4.416667 |
8 | Weezer | IslandInSun | 77 | 15 | 6.133333 |
9 | DonHenley | BoysOfSummer | 71 | 12 | 5.416667 |
10 | EllaFitzgerald&LouisArmstrong | Summertime | 69 | 9 | 3.888889 |
11 | JanisJoplin | Summertime | 66 | 9 | 4.444444 |
12 | BeachBoys | GoodVibrations | 62 | 10 | 5.300000 |
13 | DieFantastischenVier | TagAmMeer | 59 | 11 | 6.181818 |
14 | BryanAdams | SummerOf'69 | 44 | 10 | 7.100000 |
15 | DJJazzyJeff&FreshPrince | Summertime | 44 | 7 | 6.000000 |
16 | Beatsteaks | Summer | 42 | 9 | 8.000000 |
17 | ViolentFemmes | BlisterInSun | 41 | 7 | 6.714286 |
18 | Caribou | Sun | 40 | 7 | 6.857143 |
19 | LaidBack | SunshineReggae | 39 | 8 | 7.625000 |
20 | BillWithers | LovelyDay | 39 | 7 | 6.285714 |
21 | LeeHazlewood&NancySinatra | SummerWine | 38 | 10 | 7.600000 |
22 | BobMarley&Wailers | SunIsShining | 38 | 10 | 8.100000 |
23 | LanaDelRey | SummertimeSadness | 37 | 9 | 8.222222 |
24 | MarthaReeves&Vandellas | DancingInStreet | 34 | 6 | 10.333333 |
25 | Bananarama | CruelSummer | 34 | 6 | 7.833333 |