Advanced Data Manipulation with Pandas¶

In this notebook, we will explore various advanced data manipulation techniques using the pandas library. We will start by importing a dataframe and then perform operations such as merging dataframes, splitting columns, duplicating and modifying columns, and applying functions.

Let's begin by importing the necessary libraries and loading the dataframe.

In [ ]:
# import tayor_swift_spotify.csv with pandas
import pandas as pd

# read csv file
df = pd.read_csv('taylor_swift_spotify.csv', index_col=0)
In [ ]:
df.head()
Out[ ]:
name album release_date track_number id uri acousticness danceability energy instrumentalness liveness loudness speechiness tempo valence popularity duration_ms
0 Welcome To New York (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 1 4WUepByoeqcedHoYhSNHRt spotify:track:4WUepByoeqcedHoYhSNHRt 0.009420 0.757 0.610 0.000037 0.3670 -4.840 0.0327 116.998 0.685 72 212600
1 Blank Space (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 2 0108kcWLnn2HlH2kedi1gn spotify:track:0108kcWLnn2HlH2kedi1gn 0.088500 0.733 0.733 0.000000 0.1680 -5.376 0.0670 96.057 0.701 73 231833
2 Style (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 3 3Vpk1hfMAQme8VJ0SNRSkd spotify:track:3Vpk1hfMAQme8VJ0SNRSkd 0.000421 0.511 0.822 0.019700 0.0899 -4.785 0.0397 94.868 0.305 74 231000
3 Out Of The Woods (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 4 1OcSfkeCg9hRC2sFKB4IMJ spotify:track:1OcSfkeCg9hRC2sFKB4IMJ 0.000537 0.545 0.885 0.000056 0.3850 -5.968 0.0447 92.021 0.206 73 235800
4 All You Had To Do Was Stay (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 5 2k0ZEeAqzvYMcx9Qt5aClQ spotify:track:2k0ZEeAqzvYMcx9Qt5aClQ 0.000656 0.588 0.721 0.000000 0.1310 -5.579 0.0317 96.997 0.520 72 193289
In [ ]:
df.tail()
Out[ ]:
name album release_date track_number id uri acousticness danceability energy instrumentalness liveness loudness speechiness tempo valence popularity duration_ms
525 Our Song Taylor Swift 2006-10-24 11 15DeqWWQB4dcEWzJg15VrN spotify:track:15DeqWWQB4dcEWzJg15VrN 0.11100 0.668 0.672 0.000000 0.3290 -4.931 0.0303 89.011 0.539 74 201106
526 I'm Only Me When I'm With You Taylor Swift 2006-10-24 12 0JIdBrXGSJXS72zjF9ss9u spotify:track:0JIdBrXGSJXS72zjF9ss9u 0.00452 0.563 0.934 0.000807 0.1030 -3.629 0.0646 143.964 0.518 59 213053
527 Invisible Taylor Swift 2006-10-24 13 5OOd01o2YS1QFwdpVLds3r spotify:track:5OOd01o2YS1QFwdpVLds3r 0.63700 0.612 0.394 0.000000 0.1470 -5.723 0.0243 96.001 0.233 56 203226
528 A Perfectly Good Heart Taylor Swift 2006-10-24 14 1spLfUJxtyVyiKKTegQ2r4 spotify:track:1spLfUJxtyVyiKKTegQ2r4 0.00349 0.483 0.751 0.000000 0.1280 -5.726 0.0365 156.092 0.268 54 220146
529 Teardrops on My Guitar - Pop Version Taylor Swift 2006-10-24 15 4pJi1rVt9GNegU9kywjg4z spotify:track:4pJi1rVt9GNegU9kywjg4z 0.04020 0.459 0.753 0.000000 0.0863 -3.827 0.0537 199.997 0.483 55 179066
In [ ]:
# how many columns are there?
print('there are', df.shape[1], 'columns')

# the columns are?
print('the columns are:', df.columns.values)
there are 17 columns
the columns are: ['name' 'album' 'release_date' 'track_number' 'id' 'uri' 'acousticness'
 'danceability' 'energy' 'instrumentalness' 'liveness' 'loudness'
 'speechiness' 'tempo' 'valence' 'popularity' 'duration_ms']

Split the dataframe¶

Let's split the df into two dataframes.

One dataframe will contain metadata about the songs and the other will contain the audio characteristics of the songs.

In [ ]:
# let's create a new dataframe that has only metadata, so the first 5 columns
df_metadata = df.iloc[:, :5]

df_metadata.head()
Out[ ]:
name album release_date track_number id
0 Welcome To New York (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 1 4WUepByoeqcedHoYhSNHRt
1 Blank Space (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 2 0108kcWLnn2HlH2kedi1gn
2 Style (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 3 3Vpk1hfMAQme8VJ0SNRSkd
3 Out Of The Woods (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 4 1OcSfkeCg9hRC2sFKB4IMJ
4 All You Had To Do Was Stay (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 5 2k0ZEeAqzvYMcx9Qt5aClQ
In [ ]:
# let's create a new dataframe that has only song characteristics, so the last 11 columns
df_characteristics = df.iloc[:, 6:]
df_characteristics.head()
Out[ ]:
acousticness danceability energy instrumentalness liveness loudness speechiness tempo valence popularity duration_ms
0 0.009420 0.757 0.610 0.000037 0.3670 -4.840 0.0327 116.998 0.685 72 212600
1 0.088500 0.733 0.733 0.000000 0.1680 -5.376 0.0670 96.057 0.701 73 231833
2 0.000421 0.511 0.822 0.019700 0.0899 -4.785 0.0397 94.868 0.305 74 231000
3 0.000537 0.545 0.885 0.000056 0.3850 -5.968 0.0447 92.021 0.206 73 235800
4 0.000656 0.588 0.721 0.000000 0.1310 -5.579 0.0317 96.997 0.520 72 193289
In [ ]:
# you can also do it like this and creating more sub-df in one line
df_metadata, df_characteristics = df.iloc[:, :5], df.iloc[:, 6:]

Metadata df¶

let's first see if we can divide the df with respect to album. Are there any repeated entries that correspond to different versions of the same album?

In [ ]:
# print the unique values of the column 'album'
for album in df_metadata['album'].unique():
    print(album)
1989 (Taylor's Version) [Deluxe]
1989 (Taylor's Version)
Speak Now (Taylor's Version)
Midnights (The Til Dawn Edition)
Midnights (3am Edition)
Midnights
Red (Taylor's Version)
Fearless (Taylor's Version)
evermore (deluxe version)
evermore
folklore: the long pond studio sessions (from the Disney+ special) [deluxe edition]
folklore (deluxe version)
folklore
Lover
reputation
reputation Stadium Tour Surprise Song Playlist
1989 (Deluxe Edition)
1989
Red (Deluxe Edition)
Red
Speak Now World Tour Live
Speak Now (Deluxe Edition)
Speak Now
Fearless Platinum Edition
Fearless
Live From Clear Channel Stripped 2008
Taylor Swift

As you can see several albums are repeated since they have been published in different versions. What can we deal with this?

It might be reasonable in this case to keep only the last version of each album that was published. How do we do this?

In [ ]:
# sort the dataframe by the column 'album' and year
df_metadata.sort_values(by=['album', 'release_date'], inplace=True)
In [ ]:
df_metadata
Out[ ]:
name album release_date track_number id
374 Welcome To New York 1989 2014-10-27 1 6qnM0XXPZOINWA778uNqQ9
375 Blank Space 1989 2014-10-27 2 1p80LdxRV74UKvL8gnD7ky
376 Style 1989 2014-10-27 3 4lIxdJw6W3Fg4vUIYCB0S5
377 Out Of The Woods 1989 2014-10-27 4 5OndtwLGA9O6XHFcGm2H7r
378 All You Had To Do Was Stay 1989 2014-10-27 5 0dAb8TY433dl3ZfXYCLE19
... ... ... ... ... ...
350 Wonderland reputation Stadium Tour Surprise Song Playlist 2017-11-09 42 0eTCaVOYLpJlSkbEi2jay4
351 White Horse reputation Stadium Tour Surprise Song Playlist 2017-11-09 43 4RMfSYDFkcz0bKoK8fHDgR
352 I'm Only Me When I'm With You reputation Stadium Tour Surprise Song Playlist 2017-11-09 44 6Py8IKZqDyU8aQVrjilbJw
353 Starlight reputation Stadium Tour Surprise Song Playlist 2017-11-09 45 7txpzmA69zyk50K3ZyZSzC
354 I Know Places reputation Stadium Tour Surprise Song Playlist 2017-11-09 46 3qGLcmO0XnnrcpAQGZwTvA

530 rows × 5 columns

In [ ]:
# print the albums now
for album in df_metadata['album'].unique():
    print(album)
1989
1989 (Deluxe Edition)
1989 (Taylor's Version)
1989 (Taylor's Version) [Deluxe]
Fearless
Fearless (Taylor's Version)
Fearless Platinum Edition
Live From Clear Channel Stripped 2008
Lover
Midnights
Midnights (3am Edition)
Midnights (The Til Dawn Edition)
Red
Red (Deluxe Edition)
Red (Taylor's Version)
Speak Now
Speak Now (Deluxe Edition)
Speak Now (Taylor's Version)
Speak Now World Tour Live
Taylor Swift
evermore
evermore (deluxe version)
folklore
folklore (deluxe version)
folklore: the long pond studio sessions (from the Disney+ special) [deluxe edition]
reputation
reputation Stadium Tour Surprise Song Playlist
In [ ]:
# now let's find the unique values of the column albums
# let's write a code that read the list that you have just printed, read the first word of the list and save the last album that starts with that word. 
# Then, the algorith restart from the first album that does not start with that word, and so on.

albums = []

first_word = df_metadata['album'].unique()[0].split()[0]
for i, album in enumerate(df_metadata['album'].unique()):
    # if the album does not start with the first word of the album name save the previous album
    if not album.startswith(first_word):
        albums.append(df_metadata['album'].unique()[i-1])
        # update the first word
        first_word = album.split()[0]    
In [ ]:
albums
Out[ ]:
["1989 (Taylor's Version) [Deluxe]",
 'Fearless Platinum Edition',
 'Live From Clear Channel Stripped 2008',
 'Lover',
 'Midnights (The Til Dawn Edition)',
 "Red (Taylor's Version)",
 'Speak Now World Tour Live',
 'Taylor Swift',
 'evermore (deluxe version)',
 'folklore: the long pond studio sessions (from the Disney+ special) [deluxe edition]']
In [ ]:
# now let's creat a sub-df that contains songs only from the albums that we have just found
df_metadata_unique_albums = df_metadata[df_metadata['album'].isin(albums)]
In [ ]:
df
Out[ ]:
name album release_date track_number id uri acousticness danceability energy instrumentalness liveness loudness speechiness tempo valence popularity duration_ms
0 Welcome To New York (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 1 4WUepByoeqcedHoYhSNHRt spotify:track:4WUepByoeqcedHoYhSNHRt 0.009420 0.757 0.610 0.000037 0.3670 -4.840 0.0327 116.998 0.685 72 212600
1 Blank Space (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 2 0108kcWLnn2HlH2kedi1gn spotify:track:0108kcWLnn2HlH2kedi1gn 0.088500 0.733 0.733 0.000000 0.1680 -5.376 0.0670 96.057 0.701 73 231833
2 Style (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 3 3Vpk1hfMAQme8VJ0SNRSkd spotify:track:3Vpk1hfMAQme8VJ0SNRSkd 0.000421 0.511 0.822 0.019700 0.0899 -4.785 0.0397 94.868 0.305 74 231000
3 Out Of The Woods (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 4 1OcSfkeCg9hRC2sFKB4IMJ spotify:track:1OcSfkeCg9hRC2sFKB4IMJ 0.000537 0.545 0.885 0.000056 0.3850 -5.968 0.0447 92.021 0.206 73 235800
4 All You Had To Do Was Stay (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 5 2k0ZEeAqzvYMcx9Qt5aClQ spotify:track:2k0ZEeAqzvYMcx9Qt5aClQ 0.000656 0.588 0.721 0.000000 0.1310 -5.579 0.0317 96.997 0.520 72 193289
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
525 Our Song Taylor Swift 2006-10-24 11 15DeqWWQB4dcEWzJg15VrN spotify:track:15DeqWWQB4dcEWzJg15VrN 0.111000 0.668 0.672 0.000000 0.3290 -4.931 0.0303 89.011 0.539 74 201106
526 I'm Only Me When I'm With You Taylor Swift 2006-10-24 12 0JIdBrXGSJXS72zjF9ss9u spotify:track:0JIdBrXGSJXS72zjF9ss9u 0.004520 0.563 0.934 0.000807 0.1030 -3.629 0.0646 143.964 0.518 59 213053
527 Invisible Taylor Swift 2006-10-24 13 5OOd01o2YS1QFwdpVLds3r spotify:track:5OOd01o2YS1QFwdpVLds3r 0.637000 0.612 0.394 0.000000 0.1470 -5.723 0.0243 96.001 0.233 56 203226
528 A Perfectly Good Heart Taylor Swift 2006-10-24 14 1spLfUJxtyVyiKKTegQ2r4 spotify:track:1spLfUJxtyVyiKKTegQ2r4 0.003490 0.483 0.751 0.000000 0.1280 -5.726 0.0365 156.092 0.268 54 220146
529 Teardrops on My Guitar - Pop Version Taylor Swift 2006-10-24 15 4pJi1rVt9GNegU9kywjg4z spotify:track:4pJi1rVt9GNegU9kywjg4z 0.040200 0.459 0.753 0.000000 0.0863 -3.827 0.0537 199.997 0.483 55 179066

530 rows × 17 columns

In [ ]:
df_metadata_unique_albums.head(25)
Out[ ]:
name album release_date track_number id
0 Welcome To New York (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 1 4WUepByoeqcedHoYhSNHRt
1 Blank Space (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 2 0108kcWLnn2HlH2kedi1gn
2 Style (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 3 3Vpk1hfMAQme8VJ0SNRSkd
3 Out Of The Woods (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 4 1OcSfkeCg9hRC2sFKB4IMJ
4 All You Had To Do Was Stay (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 5 2k0ZEeAqzvYMcx9Qt5aClQ
5 Shake It Off (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 6 50yNTF0Od55qnHLxYsA5Pw
6 I Wish You Would (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 7 3FxJDucHWdw6caWTKO5b23
7 Bad Blood (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 8 7oZONwFiFIErZcXAtTu7FY
8 Wildest Dreams (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 9 27exgla7YBw9DUNNcTIpjy
9 How You Get The Girl (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 10 733OhaXQIHY7BKtY3vnSkn
10 This Love (Taylor’s Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 11 4WBEj8TeGtRPNJdOmT3WJW
11 I Know Places (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 12 1ULabO0PEsdJekqVH6G10G
12 Clean (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 13 0lKUBmEyQfzsQHozyeXzES
13 Wonderland (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 14 6HRsJu8vcnzYDN4t0570FY
14 You Are In Love (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 15 0TyGh27YQ5LknmiDhCzJiT
15 New Romantics (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 16 5M787RexsAiVYjQusM98CV
16 "Slut!" (Taylor's Version) (From The Vault) 1989 (Taylor's Version) [Deluxe] 2023-10-27 17 0CD7DzeCsuPJygddqlUVYa
17 Say Don't Go (Taylor's Version) (From The Vault) 1989 (Taylor's Version) [Deluxe] 2023-10-27 18 3CCRVu4F91Qp2mnGjmWBrf
18 Now That We Don't Talk (Taylor's Version) (Fro... 1989 (Taylor's Version) [Deluxe] 2023-10-27 19 5QUIK7ZtziW8kGWo8RqopF
19 Suburban Legends (Taylor's Version) (From The ... 1989 (Taylor's Version) [Deluxe] 2023-10-27 20 6M9ppdfFjR1AbpUl3Y8DcV
20 Is It Over Now? (Taylor's Version) (From The V... 1989 (Taylor's Version) [Deluxe] 2023-10-27 21 6IG3sQ8s9nfk6TUlVzRhbN
21 Bad Blood (feat. Kendrick Lamar) (Taylor's Ver... 1989 (Taylor's Version) [Deluxe] 2023-10-27 22 6qAcApH8obo8eqatCKUHd9
475 Jump Then Fall Fearless Platinum Edition 2008-11-11 1 08gavXombT6KR0af88i9tA
476 Untouchable Fearless Platinum Edition 2008-11-11 2 2IZ00ed83ygPIiacYScWUE
477 Forever & Always - Piano Version Fearless Platinum Edition 2008-11-11 3 46HGgtwmmuEB8mvDCyjyAc
In [ ]:
# check if the albums are correct
for album in df_metadata_unique_albums['album'].unique():
    print(album)
1989 (Taylor's Version) [Deluxe]
Fearless Platinum Edition
Live From Clear Channel Stripped 2008
Lover
Midnights (The Til Dawn Edition)
Red (Taylor's Version)
Speak Now World Tour Live
Taylor Swift
evermore (deluxe version)
folklore: the long pond studio sessions (from the Disney+ special) [deluxe edition]

Excercise¶

  1. Create a new dataframe that contains only the first song of each album.
  2. Create a new dataframe that contains only the last version of songs if songs are repeated.

Modify columns, and apply functions¶

Let's work now on the second dataframe that contains the audio characteristics of the songs. Suppose that you want to define a new characteristic that depends on the values of other characteristics. For example, you want to define a new characteristic called "popness" that is the sum of the squares of "acousticness" and "danceability" under square root.

In [ ]:
# Let's define energy as the sum of the squares of "acousticness" and "danceability" under square root.
# create a new column called 'energy' that contains the energy of each song, use apply() and lambda function
df_characteristics['popness'] = df_characteristics.apply(lambda row: (row['acousticness']**2 + row['danceability']**2)**0.5, axis=1)

df_characteristics.head()
Out[ ]:
acousticness danceability instrumentalness liveness loudness speechiness tempo valence popness Energy energy
0 0.009420 0.757 0.000037 0.3670 -4.840 0.0327 116.998 0.685 0.757059 0.610 0.610
1 0.088500 0.733 0.000000 0.1680 -5.376 0.0670 96.057 0.701 0.738323 0.733 0.733
2 0.000421 0.511 0.019700 0.0899 -4.785 0.0397 94.868 0.305 0.511000 0.822 0.822
3 0.000537 0.545 0.000056 0.3850 -5.968 0.0447 92.021 0.206 0.545000 0.885 0.885
4 0.000656 0.588 0.000000 0.1310 -5.579 0.0317 96.997 0.520 0.588000 0.721 0.721
In [ ]:
# another way to do it is to use the .apply() method without lambda function
def popness(row):
    '''
    This function returns the energy of a song
    
    Parameters:
    -----------
    row: pandas.Series
        a row of a pandas dataframe
        
        
    Returns:
    -------
    float
    the energy of a song
                
    '''
    return (row['acousticness']**2 + row['danceability']**2)**0.5


df_characteristics['popness'] = df_characteristics.apply(popness, axis=1)
In [ ]:
df_characteristics.head()
Out[ ]:
acousticness danceability energy instrumentalness liveness loudness speechiness tempo valence popularity duration_ms popness
0 0.009420 0.757 0.610 0.000037 0.3670 -4.840 0.0327 116.998 0.685 72 212600 0.757059
1 0.088500 0.733 0.733 0.000000 0.1680 -5.376 0.0670 96.057 0.701 73 231833 0.738323
2 0.000421 0.511 0.822 0.019700 0.0899 -4.785 0.0397 94.868 0.305 74 231000 0.511000
3 0.000537 0.545 0.885 0.000056 0.3850 -5.968 0.0447 92.021 0.206 73 235800 0.545000
4 0.000656 0.588 0.721 0.000000 0.1310 -5.579 0.0317 96.997 0.520 72 193289 0.588000
In [ ]:
# can you move the column 'energy' to the last column of the dataframe?
cols = list(df_characteristics.columns.values)
print(cols)

cols.insert(12, cols.pop(cols.index('energy')))
print(cols)

df_characteristics = df_characteristics[cols]
['acousticness', 'danceability', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'valence', 'popularity', 'duration_ms', 'popness', 'energy']
['acousticness', 'danceability', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'valence', 'popularity', 'duration_ms', 'popness', 'energy']

cols is a list, df[list] returns a dataframe with only the columns in the list. You can rearrange the columns in the dataframe by reordering the list, or obtain a subset of the columns by removing some of the elements in the list.

In [ ]:
df_characteristics.head()
Out[ ]:
acousticness danceability instrumentalness liveness loudness speechiness tempo valence popularity duration_ms popness energy
0 0.009420 0.757 0.000037 0.3670 -4.840 0.0327 116.998 0.685 72 212600 0.757059 0.610
1 0.088500 0.733 0.000000 0.1680 -5.376 0.0670 96.057 0.701 73 231833 0.738323 0.733
2 0.000421 0.511 0.019700 0.0899 -4.785 0.0397 94.868 0.305 74 231000 0.511000 0.822
3 0.000537 0.545 0.000056 0.3850 -5.968 0.0447 92.021 0.206 73 235800 0.545000 0.885
4 0.000656 0.588 0.000000 0.1310 -5.579 0.0317 96.997 0.520 72 193289 0.588000 0.721
In [ ]:
# from this let's drop popularity and duration_ms
df_characteristics.drop(['popularity', 'duration_ms'], axis=1, inplace=True)
df_characteristics.head()
Out[ ]:
acousticness danceability instrumentalness liveness loudness speechiness tempo valence popness energy
0 0.009420 0.757 0.000037 0.3670 -4.840 0.0327 116.998 0.685 0.757059 0.610
1 0.088500 0.733 0.000000 0.1680 -5.376 0.0670 96.057 0.701 0.738323 0.733
2 0.000421 0.511 0.019700 0.0899 -4.785 0.0397 94.868 0.305 0.511000 0.822
3 0.000537 0.545 0.000056 0.3850 -5.968 0.0447 92.021 0.206 0.545000 0.885
4 0.000656 0.588 0.000000 0.1310 -5.579 0.0317 96.997 0.520 0.588000 0.721
In [ ]:
df_characteristics['energy'] = df_characteristics['Energy']

Merging dataframes¶

Let's now merge the two dataframes into one. We have worked on df_metadata removing rows while on df_characteristics removing columns. Can we still merge the two dataframes?

Let's try!

In [ ]:
# merge df_metadata and df_characteristics
new_df = pd.merge(df_metadata_unique_albums, df_characteristics, left_index=True, right_index=True)
new_df
Out[ ]:
name album release_date track_number id acousticness danceability instrumentalness liveness loudness speechiness tempo valence popness Energy energy
0 Welcome To New York (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 1 4WUepByoeqcedHoYhSNHRt 0.009420 0.757 0.000037 0.3670 -4.840 0.0327 116.998 0.685 0.757059 0.610 0.610
1 Blank Space (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 2 0108kcWLnn2HlH2kedi1gn 0.088500 0.733 0.000000 0.1680 -5.376 0.0670 96.057 0.701 0.738323 0.733 0.733
2 Style (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 3 3Vpk1hfMAQme8VJ0SNRSkd 0.000421 0.511 0.019700 0.0899 -4.785 0.0397 94.868 0.305 0.511000 0.822 0.822
3 Out Of The Woods (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 4 1OcSfkeCg9hRC2sFKB4IMJ 0.000537 0.545 0.000056 0.3850 -5.968 0.0447 92.021 0.206 0.545000 0.885 0.885
4 All You Had To Do Was Stay (Taylor's Version) 1989 (Taylor's Version) [Deluxe] 2023-10-27 5 2k0ZEeAqzvYMcx9Qt5aClQ 0.000656 0.588 0.000000 0.1310 -5.579 0.0317 96.997 0.520 0.588000 0.721 0.721
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
238 epiphany - the long pond studio sessions folklore: the long pond studio sessions (from ... 2020-11-25 13 4yyKppX85ZkmlGOC2toBZA 0.916000 0.524 0.000392 0.0680 -13.004 0.0274 93.870 0.113 1.055288 0.200 0.200
239 betty - the long pond studio sessions folklore: the long pond studio sessions (from ... 2020-11-25 14 1mh5dObNoOvCfHmXnhvDEK 0.841000 0.538 0.000000 0.1190 -9.500 0.0299 95.680 0.518 0.998361 0.316 0.316
240 peace - the long pond studio sessions folklore: the long pond studio sessions (from ... 2020-11-25 15 4YHHz3VIXgNDGdoL9xC1tx 0.930000 0.619 0.000076 0.0734 -11.472 0.0647 149.897 0.513 1.117167 0.344 0.344
241 hoax - the long pond studio sessions folklore: the long pond studio sessions (from ... 2020-11-25 16 0QTBsdoJhamY3Aid3fOsOd 0.962000 0.682 0.000000 0.1800 -13.083 0.0394 118.832 0.483 1.179223 0.155 0.155
242 the lakes - the long pond studio sessions folklore: the long pond studio sessions (from ... 2020-11-25 17 6mqyTq948oap5AkyTsd6XF 0.794000 0.606 0.000000 0.0888 -11.767 0.0468 89.844 0.112 0.998835 0.221 0.221

202 rows × 16 columns

How is this possible? Lets's check the shapes!

In [ ]:
# Print the shape of the three dataframes
print('df shape:', df.shape)
print('df_metadata unique albums shape:', df_metadata_unique_albums.shape)
print('df_characteristics shape:', df_characteristics.shape)

print('new_df shape:', new_df.shape)
df shape: (530, 17)
df_metadata unique albums shape: (202, 5)
df_characteristics shape: (530, 11)
new_df shape: (202, 16)
In [ ]:
new_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 202 entries, 0 to 242
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              202 non-null    object 
 1   album             202 non-null    object 
 2   release_date      202 non-null    object 
 3   track_number      202 non-null    int64  
 4   id                202 non-null    object 
 5   acousticness      202 non-null    float64
 6   danceability      202 non-null    float64
 7   instrumentalness  202 non-null    float64
 8   liveness          202 non-null    float64
 9   loudness          202 non-null    float64
 10  speechiness       202 non-null    float64
 11  tempo             202 non-null    float64
 12  valence           202 non-null    float64
 13  popness           202 non-null    float64
 14  energy            202 non-null    float64
dtypes: float64(10), int64(1), object(4)
memory usage: 25.2+ KB
In [ ]:
df_metadata_unique_albums.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 202 entries, 0 to 242
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   name          202 non-null    object
 1   album         202 non-null    object
 2   release_date  202 non-null    object
 3   track_number  202 non-null    int64 
 4   id            202 non-null    object
dtypes: int64(1), object(4)
memory usage: 17.6+ KB
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 530 entries, 0 to 529
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              530 non-null    object 
 1   album             530 non-null    object 
 2   release_date      530 non-null    object 
 3   track_number      530 non-null    int64  
 4   id                530 non-null    object 
 5   uri               530 non-null    object 
 6   acousticness      530 non-null    float64
 7   danceability      530 non-null    float64
 8   energy            530 non-null    float64
 9   instrumentalness  530 non-null    float64
 10  liveness          530 non-null    float64
 11  loudness          530 non-null    float64
 12  speechiness       530 non-null    float64
 13  tempo             530 non-null    float64
 14  valence           530 non-null    float64
 15  popularity        530 non-null    int64  
 16  duration_ms       530 non-null    int64  
dtypes: float64(9), int64(3), object(5)
memory usage: 90.7+ KB
In [ ]:
# how do I check if I merged the dataframes correctly?

# for every row in new_df, check if the values of the columns  are the same as the values in df

# columns in common between new_df and df
common_columns = new_df.columns.intersection(df.columns)
common_columns
Out[ ]:
Index(['name', 'album', 'release_date', 'track_number', 'id', 'acousticness',
       'danceability', 'instrumentalness', 'liveness', 'loudness',
       'speechiness', 'tempo', 'valence', 'energy'],
      dtype='object')
In [ ]:
for i in new_df.index.values:
    for col in common_columns:

        if new_df[col][i] != df[col][i]:
            print('row', i, 'column', col, 'is different') 
            print('new_df:', new_df[col][i])     
            print('df:', df[col][i])  

It did not print anything, so it means that the rows are the same and we have matched the indexes correctly!