Worksheet 1#

Girls in Data Science Camp#

In this worksheet, we will be using a data set obtained from the Spotify Web API the top 50 tracks from 2023 (more details can be found here).

https://storage.googleapis.com/pr-newsroom-wp/1/2018/11/Spotify_Logo_RGB_Green.png

I have reduced the data set to include the following variables:

  • artist_name: the artist name

  • track_name: the title of the track

  • album_release_date: The date when the track was released

  • genres: A list of genres associated with the track’s artist(s)

  • danceability: A measure from 0.0 to 1.0 indicating how suitable a track is for dancing based on a combination of musical elements

  • energy: A measure from 0.0 to 1.0 representing a perceptual measure of intensity and activity

  • loudness: The overall loudness of a track in decibels (dB)

  • key: The key the track is in. Integers map to pitches using standard Pitch Class notation.

  • tempo: The overall estimated tempo of a track in beats per minute (BPM)

  • duration_ms: The length of the track in milliseconds

  • time_signature: An estimated overall time signature of a track

  • popularity: A score between 0 and 100, with 100 being the most popular

I also included a new variable called pop which is yes if the song falls into any type of pop genre and no otherwise.

# Load libraries

library(tidyverse)
── Attaching core tidyverse packages ───────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
 dplyr     1.1.4      readr     2.1.5
 forcats   1.0.0      stringr   1.5.1
 ggplot2   3.5.1      tibble    3.2.1
 lubridate 1.9.3      tidyr     1.3.1
 purrr     1.0.2     
── Conflicts ─────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()
 Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Read in the data

spotify <- read_csv("data/spotify_top_50_2023.csv")
Rows: 50 Columns: 12
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): artist_name, track_name, genres
dbl  (8): danceability, energy, loudness, key, tempo, duration_ms, time_sign...
date (1): album_release_date
 Use `spec()` to retrieve the full column specification for this data.
 Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Create 'pop' column 

spotify <- spotify |> 
                mutate(pop = if_else(str_detect(genres, "pop"), "yes", "no"))
head(spotify)
A tibble: 6 × 13
artist_nametrack_namealbum_release_dategenresdanceabilityenergyloudnesskeytempoduration_mstime_signaturepopularitypop
<chr><chr><date><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr>
Miley Cyrus Flowers 2023-08-18['pop'] 0.7060.691-4.775 0118.048200600494yes
SZA Kill Bill 2022-12-08['pop', 'r&b', 'rap'] 0.6440.735-5.747 8 88.980153947486yes
Harry Styles As It Was 2022-05-20['pop'] 0.5200.731-5.338 6173.930167303495yes
Jung Kook Seven (feat. Latto) (Explicit Ver.)2023-11-03['k-pop'] 0.7900.831-4.18511124.987183551490yes
Eslabon ArmadoElla Baila Sola 2023-04-28['corrido', 'corridos tumbados', 'sad sierreno', 'sierreno']0.6680.758-5.176 5147.989165671386no
Taylor Swift Cruel Summer 2019-08-23['pop'] 0.5520.702-5.707 9169.994178427499yes

Exercise 1: Wrangling#

1.1#

Is the spotify data tidy? Why or why not?

Answer: The data set is mostly tidy: each row corresponds to a single observation and each column corresponds to a single variable. However, each cell does not always correspond to a single value. For example, the genres column sometimes displays multiple genres per song (e.g., the song “Kill Bill” is clsasified as three genres: pop, r&b and rap).

1.2#

What are the dimensions of this data set (i.e., the number of rows and columns)?

dim(spotify) 
  1. 50
  2. 13

1.3#

Without examining the entire data set, which artist and track is in the 35th row of the data set? Note that your code should return only the required variables (artist_name and track_name).

spotify |> 
    slice(35) |>
    select(artist_name, track_name)
A tibble: 1 × 2
artist_nametrack_name
<chr><chr>
Doja CatPaint The Town Red

1.4#

Create a subset of the data that only includes tracks with a popularity score over 90. Assign this to an object called pop_90. How many songs have a popularity score over 90?

pop_90 <- spotify |> 
            filter(popularity > 90)

pop_90
A tibble: 20 × 13
artist_nametrack_namealbum_release_dategenresdanceabilityenergyloudnesskeytempoduration_mstime_signaturepopularitypop
<chr><chr><date><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr>
Miley Cyrus Flowers 2023-08-18['pop'] 0.7060.691 -4.775 0118.048200600494yes
Harry Styles As It Was 2022-05-20['pop'] 0.5200.731 -5.338 6173.930167303495yes
Taylor Swift Cruel Summer 2019-08-23['pop'] 0.5520.702 -5.707 9169.994178427499yes
Metro Boomin Creepin' (with The Weeknd & 21 Savage)2022-12-02['rap'] 0.7150.620 -6.005 1 97.950221520491no
Taylor Swift Anti-Hero 2022-10-21['pop'] 0.6370.643 -6.571 4 97.008200690492yes
Arctic Monkeys I Wanna Be Yours 2013-09-09['garage rock', 'modern rock', 'permanent wave', 'rock', 'sheffield indie']0.4640.417 -9.345 0 67.528183956496no
David Guetta I'm Good (Blue) 2022-08-26['big room', 'dance pop', 'edm', 'pop', 'pop dance'] 0.5610.965 -3.673 7128.040175238492yes
David Kushner Daylight 2023-04-14['gen z singer-songwriter', 'singer-songwriter pop'] 0.5080.430 -9.475 2130.090212954494yes
The Weeknd Starboy 2016-11-25['canadian contemporary r&b', 'canadian pop', 'pop'] 0.6790.587 -7.015 7186.003230453495yes
OneRepublic I Ain't Worried 2022-05-13['piano rock', 'pop'] 0.7040.797 -5.927 0139.994148486492yes
Myke Towers LALA 2023-03-23['reggaeton', 'trap latino', 'urbano latino'] 0.7080.737 -4.045 1 91.986197920493no
Doja Cat Paint The Town Red 2023-09-20['dance pop', 'pop'] 0.8640.556 -7.683 2 99.974230480493yes
The Neighbourhood Sweater Weather 2013-04-19['modern alternative rock', 'modern rock', 'pop'] 0.6120.807 -2.81010124.053240400493yes
SZA Snooze 2022-12-09['pop', 'r&b', 'rap'] 0.5590.551 -7.231 5143.008201800493yes
The Weeknd Blinding Lights 2020-03-20['canadian contemporary r&b', 'canadian pop', 'pop'] 0.5140.730 -5.934 1171.005200040493yes
Jimin Like Crazy 2023-03-24['k-pop'] 0.6290.733 -5.445 7120.001212241493yes
Eminem Mockingbird 2004-11-12['detroit hip hop', 'hip hop', 'rap'] 0.6370.678 -3.798 0 84.039250760491no
Olivia Rodrigo vampire 2023-09-08['pop'] 0.5110.532 -5.745 5138.005219724495yes
Tyler, The CreatorSee You Again (feat. Kali Uchis) 2017-07-21['hip hop', 'rap'] 0.5580.559 -9.222 6 78.558180387492no
d4vd Romantic Homicide 2022-07-20['bedroom pop'] 0.5710.544-10.613 6132.052132631491yes
nrow(pop_90) 
20

20 tracks have a popularity score over 90.

1.5#

Now, I want to look at pop songs that have a danceability score over 0.7. Create a subset of the spotify data set to achieve this task, ordered from highest to lowest danceability.

spotify |>
    filter(danceability > 0.7 & pop == "yes") |> 
    arrange(by = desc(danceability))
A tibble: 13 × 13
artist_nametrack_namealbum_release_dategenresdanceabilityenergyloudnesskeytempoduration_mstime_signaturepopularitypop
<chr><chr><date><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr>
Ozuna Hey Mor 2022-10-07['puerto rican pop', 'reggaeton', 'trap latino', 'urbano latino'] 0.9010.589-6.713 1 98.002196600487yes
Doja Cat Paint The Town Red 2023-09-20['dance pop', 'pop'] 0.8640.556-7.683 2 99.974230480493yes
Feid CLASSY 101 2023-03-31['colombian pop', 'pop reggaeton', 'reggaeton', 'reggaeton colombiano', 'trap latino', 'urbano latino']0.8590.658-4.79011100.065195987487yes
Manuel TurizoLa Bachata 2023-03-17['colombian pop', 'latin pop', 'reggaeton', 'reggaeton colombiano', 'trap latino', 'urbano latino'] 0.8350.679-5.329 7124.980162638485yes
NewJeans OMG 2023-01-02['k-pop', 'k-pop girl group'] 0.8040.771-4.067 9126.956212253487yes
Rema Calm Down (with Selena Gomez) 2023-04-27['afrobeats', 'nigerian pop'] 0.7990.802-5.19611107.008239318490yes
Jung Kook Seven (feat. Latto) (Explicit Ver.) 2023-11-03['k-pop'] 0.7900.831-4.18511124.987183551490yes
FIFTY FIFTY Cupid - Twin Ver. 2023-09-22['k-pop girl group'] 0.7830.592-8.33211120.018174253475yes
Bizarrap Shakira: Bzrp Music Sessions, Vol. 532023-01-11['argentine hip hop', 'pop venezolano', 'trap argentino', 'trap latino', 'urbano latino'] 0.7780.632-5.600 2122.104218289485yes
Taylor Swift Blank Space 2014-10-27['pop'] 0.7530.678-5.421 5 96.006231827478yes
Sam Smith Unholy (feat. Kim Petras) 2023-01-27['pop', 'uk pop'] 0.7120.463-7.399 2131.199156943484yes
Miley Cyrus Flowers 2023-08-18['pop'] 0.7060.691-4.775 0118.048200600494yes
OneRepublic I Ain't Worried 2022-05-13['piano rock', 'pop'] 0.7040.797-5.927 0139.994148486492yes

1.6#

What is the average danceability score for Taylor Swift songs?

# Filter for only Taylor Swift songs
spotify |>
    filter(artist_name == "Taylor Swift") 
A tibble: 3 × 13
artist_nametrack_namealbum_release_dategenresdanceabilityenergyloudnesskeytempoduration_mstime_signaturepopularitypop
<chr><chr><date><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr>
Taylor SwiftCruel Summer2019-08-23['pop']0.5520.702-5.7079169.994178427499yes
Taylor SwiftAnti-Hero 2022-10-21['pop']0.6370.643-6.5714 97.008200690492yes
Taylor SwiftBlank Space 2014-10-27['pop']0.7530.678-5.4215 96.006231827478yes
# Compute the average danceability score using `summarize`
spotify |>
    filter(artist_name == "Taylor Swift") |>
    summarize(avg_dance_TS = mean(danceability))
A tibble: 1 × 1
avg_dance_TS
<dbl>
0.6473333

1.7#

Are Taylor Swfit’s songs more danceable than The Weekend’s? Hint: use group_by.

spotify |>
     group_by(artist_name) |>
     summarize(avg_dance = mean(danceability)) |> 
     filter(artist_name == "Taylor Swift" | artist_name == "The Weeknd")
A tibble: 2 × 2
artist_nameavg_dance
<chr><dbl>
Taylor Swift0.6473333
The Weeknd 0.5885000

1.8#

Are pop songs more popular than other genres? Compare the average popularity scores between pop songs vs. other genres.

spotify |>
     group_by(pop) |>
     summarize(avg_popularity = mean(popularity))
A tibble: 2 × 2
popavg_popularity
<chr><dbl>
no 87.75000
yes88.26471

Exercise 2: Data Visualization#

2.1#

Using a histogram, visualize the distribution of popularity scores for Spotify’s top 50 tracks from 2023. Describe what you see.

ggplot(spotify, aes(x = popularity)) +
    geom_histogram(bins=25, fill = "cornflowerblue", color = "black") + 
    theme(text = element_text(size = 26)) + # increase text size
    labs(x='Popularity Score', y='Count', title='Popularity Score')
../_images/855f6b00faadfddfd334ace9f017485d8aeeb96e0df7c62e7682a46f194fcdab.png

The popularity scores are distributed from about 72-99, with two main points of concentration around approximately 85 and 95.

2.2#

Are pop songs more popular than other genres? You answered this in question 1.8 using summary statistics, but now use histograms to visualize the popularity groups for the two groups on the same graph. Do you notice anything different?

Hint: use facet_grid

ggplot(spotify, aes(x = popularity, fill = pop)) +
    geom_histogram(bins=25, color = "black") +
    facet_grid(rows = vars(pop)) +
    theme(text = element_text(size = 26)) +
    labs(x='Popularity Score', y='Count', title='Histogram of Popularity Score ')
../_images/a5b8a20c8ccbd0bacd9415c16bdf5f44a52d7e181057dff8db1f32ca7823ddf2.png

While the center of the two distributions are very similar (both around 88), we can see that the distribution of popularity scores for pop songs has more variability and a wider spread than songs that are not classified as pop.

2.3#

Create a barplot comparing the counts of pop songs vs. non-pop songs.

ggplot(spotify, aes(x = pop)) +
    geom_bar(fill = "cornflowerblue", color = "black") +
    theme(text = element_text(size = 26)) + 
    labs(x = 'Pop', y = 'Count', title = 'Barplot of Pop vs. Non-Pop Songs') 
  
../_images/f254d5bc77ae94b8d672226755a4fcae6bb191c71a229dc77620262c2c4a5fe0.png

2.4#

Is there a relationship between how loud a song is in decibels and its popularity? Visualize the relationship between loudness and popularity with a scatterplot, plotting loudness on the \(y\)-axis and popularity on the \(x\)-axis. What do you notice?

options(repr.plot.width = 10, repr.plot.height = 8)

ggplot(spotify, aes(x = popularity, y = loudness)) + 
    geom_point(size=4) +
    theme(text = element_text(size = 26)) + 
    labs(x='Popularity Score', y='Loudness (dB)',title='Scatterplot of Loudness vs. Popularity') # rename axes and add title
../_images/4fcb330a4feb6c5066c8dcba14210ce3b4927559ff4c30a9044c37bcb4919fd7.png

It seems like there is a relatively weak positive relationship between loudness and popularity score.

2.5 (Challenge)#

Find the song that has the highest popularity score, but a relatively moderate loudness. Highlight the point on the graph and label it.

Hint: Use the annotate function to highlight a point

spotify |> 
    arrange(desc(popularity)) |>
    slice(1) |>
    select(artist_name, track_name, popularity, loudness)
A tibble: 1 × 4
artist_nametrack_namepopularityloudness
<chr><chr><dbl><dbl>
Taylor SwiftCruel Summer99-5.707
options(repr.plot.width = 10, repr.plot.height = 8)

ggplot(spotify, aes(x = popularity, y = loudness)) + 
    geom_point(size=4) +
    annotate("point", x = 99, y = -5.707, color = 'violet', size = 4) + 
    geom_text(x = 98, y = -5.4 , label = "Cruel Summer", color = "violet", size = 5) + # add text to plot
    theme(text = element_text(size = 26)) + 
    labs(x='Popularity Score', y='Loudness (dB)',title='Scatterplot of Loudness vs. Popularity') # rename axes and add title
../_images/fd8eebda917e516e06b9df5eeaacc1796b32c2d16f0492ea10e50c09a4ddb280.png

2.6 (Challenge)#

List the artists who have more than one track in the top 50. For each artist, show the number of tracks and their average popularity score.

popular_artists <- spotify |>
    group_by(artist_name) |> 
    summarize(track_count = n(), avg_popularity = mean(popularity)) |>
    filter(track_count > 1)

popular_artists
A tibble: 6 × 3
artist_nametrack_countavg_popularity
<chr><int><dbl>
Bad Bunny 286.50000
Bizarrap 286.50000
SZA 289.50000
Taylor Swift389.66667
The Weeknd 489.50000
d4vd 290.50000