Importing Libraries

We import Pandas and Numpy as the two libraries would help us clean, edit and prepare our scraped data. We also import Warnings. This notebook focused more on data cleaning and data wrangling. For data exploration you can view the notebook here.

You can find and download the data here.

import pandas as pd
import numpy as np
import warnings
warnings.simplefilter('ignore') # ignore any warnings from any code cell

Reading the data

data = pd.read_csv('./audible_uncleaned.csv')
data.shape # Gives use a tuple of (rows, columns)
(87489, 8)
data.head(3)
name author narrator time releasedate language stars price
0 Geronimo Stilton #11 & #12 Writtenby:GeronimoStilton Narratedby:BillLobely 2 hrs and 20 mins 04-08-08 English 5 out of 5 stars34 ratings 468.00
1 The Burning Maze Writtenby:RickRiordan Narratedby:RobbieDaymond 13 hrs and 8 mins 01-05-18 English 4.5 out of 5 stars41 ratings 820.00
2 The Deep End Writtenby:JeffKinney Narratedby:DanRussell 2 hrs and 3 mins 06-11-20 English 4.5 out of 5 stars38 ratings 410.00

Cleaning the data

While we have our scraped data, we don't exactly have a cleaner much readable view of our data or the datatypes for the columns we would prefer before diving into analysis and data visulization. Let's start by cleaning our data.

Checking for duplicates

We use .sum() method as data.duplicated() returns a Pandas Series of boolean values. Summing them up shows how many duplicated values are there.

data.duplicated().sum()
0

Cleaning the author and narrator columns

Removing the Writtenby: and Narratedby: as those values are redundant

  • Selecting the Pandas Series by using data.author
  • Then we use the .str method get all the string values
  • And finally, we run the replace method with the value_to_be_replaced and the value_to_be_replaced_with
data.author = data.author.str.replace('Writtenby:', '')
data.narrator = data.narrator.str.replace('Narratedby:', '')
data.sample(3)
name author narrator time releasedate language stars price
10867 The Trust Manifesto DamianBradfield DamianBradfield,RichardHughes,KristinAtherton 6 hrs and 54 mins 10-10-19 English Not rated yet 888.00
34136 How to Be a Conscious Eater SophieEgan SophieEgan 7 hrs and 15 mins 27-04-21 English Not rated yet 562.00
6549 Els silencis de la boca de la mina [The Silenc... AndreuSotorra JoanMora 4 hrs and 3 mins 02-01-19 catalan Not rated yet 537.00

Cleaning the stars columns

The stars column has three distinct format of values:

  1. 5 out of 5 stars34 ratings
  2. 4.5 out of 5 stars38 ratings
  3. Not rated yet

  • We handle the 3rd one efficiently using the Pandas Series and the replace method to replace with NaN values using the numpy library. Note: This doesn't use the str method and hence we use the inplace = True
  • As for the 1st and the 2nd one, we use Regex to split the string into two columns of stars and ratings.

    Note:I used the Regex101 as guide to use regular expression on the string.

data.stars.replace('Not rated yet', np.nan, inplace=True)
data[['unknown','stars', 'ratings']] = data.stars.str.split(r'(\d*.?\d\s[a-z]+\s[a-z]+\s\d+\s[a-z]+)', 1, expand=True, regex=True)
data.drop('unknown', axis=1, inplace=True) # Dropping the unknown column as it only consists of spaces and nan values

Changing Datatypes

Changing the price column to a float value

The price column has three distinct values:

  1. Free
  2. 3-digit value. Eg:839.00
  3. 4-digit value. Eg:1,230.00
  • We replace 1st one with 0
  • And for the 3rd one we remove the , as that blocks us to change the datatype to float
data.price = data.price.str.replace(',', '')
data.price.replace('Free', 0, inplace=True)
data.price = data.price.astype('float64')

Changing the ratings column to a float value

The ratings column has one value:

  1. 23 ratings
  • We split the data at the -space- and drop the column which doesn't have the value and then change the datatype to float.

    Note:Ratings are generally whole numbers but since the column has null values it couldn't be changed to int and hence the choice for float

data[['ratings', 'unknown']] = data.ratings.str.split(' ', 1, expand=True)
data.drop('unknown', axis=1, inplace=True)
data.ratings = data.ratings.str.replace(',', '').astype('float64')

Changing the stars column to a float value

The stars column has one value:

  1. 5 out of 5 stars
  • We split the data at the out and drop the extra column as all the values are rated out of 5 stars and then change the datatype to float.
data[['stars', 'unknown']] = data.stars.str.split('out', 1, expand=True)
data.drop('unknown', axis=1, inplace=True)
data.stars = data.stars.astype('float64')

Changing the releasedate to datetime object

data.releasedate = pd.to_datetime(data.releasedate)

Changing the time to datetime integer value of only minutes

The time column has 4 distinct values:

  1. 1 hr
  2. 1 min
  3. 3 hrs and 40 mins
  4. Less than 1 minute
  • For the 4th option we approx it to 1 min
  • We then replace the hrs to hr and mins to min, as that would help us genaralize it
data.time = data.time.str.replace('Less than 1 minute', '1 min')
data.time = data.time.str.replace('mins', 'min')
data.time = data.time.str.replace('hrs', 'hr')
data.head(3)
name author narrator time releasedate language stars price ratings
0 Geronimo Stilton #11 & #12 GeronimoStilton BillLobely 2 hr and 20 min 2008-04-08 English 5.0 468.0 34.0
1 The Burning Maze RickRiordan RobbieDaymond 13 hr and 8 min 2018-01-05 English 4.5 820.0 41.0
2 The Deep End JeffKinney DanRussell 2 hr and 3 min 2020-06-11 English 4.5 410.0 38.0

Here, we split the time twice, once to separate the hour and again to separate the minutes. This again uses Regex as that makes working with strings extremly easy and handy.

Note:It's not necessary to split it twice and can be done once by reusing mins however this makes it easier to read and understand.

data[['unknown', 'hour', 'mins']] = data['time'].str.split(r'(\d+ hr)', expand=True, regex=True)
data.drop(['unknown', 'mins'], axis =1, inplace=True)
data[['hr', 'minutes', 'unknown']] = data['time'].str.split(r'(\d+ min)', expand=True, regex=True)
data.drop(['unknown', 'hr'], axis =1, inplace=True)

Note:We see that we have None values and not np.nan values after splitting the string. We use .applymap() to map the lambda function and them sum them up to see the count of None values. An easier way to find what values you have in a Pandas Series is to run df.name_of_column.unique() to see all unique values.
To count the number of unique values in a Pandas Series run df.name_of_column.nunique().

data.applymap(lambda x: x is None).sum()
name               0
author             0
narrator           0
time               0
releasedate        0
language           0
stars              0
price              0
ratings            0
hour           13406
minutes         1343
dtype: int64

We fill None values the same as filling Nan values i.e. with fillna()

data.hour.fillna(value='0 hr', inplace=True)
data.minutes.fillna(value='0 min', inplace=True)
data.applymap(lambda x: x is None).sum()
name           0
author         0
narrator       0
time           0
releasedate    0
language       0
stars          0
price          0
ratings        0
hour           0
minutes        0
dtype: int64
data.head(3)
name author narrator time releasedate language stars price ratings hour minutes
0 Geronimo Stilton #11 & #12 GeronimoStilton BillLobely 2 hr and 20 min 2008-04-08 English 5.0 468.0 34.0 2 hr 20 min
1 The Burning Maze RickRiordan RobbieDaymond 13 hr and 8 min 2018-01-05 English 4.5 820.0 41.0 13 hr 8 min
2 The Deep End JeffKinney DanRussell 2 hr and 3 min 2020-06-11 English 4.5 410.0 38.0 2 hr 3 min

Now that we don't have any null values, we remove the string associated with the numbers - hr from the hour columns and min from the minutes column.
To convert the entire time from hours & minutes --> minutes, we need to multiply the hours by 60 and then add the minutes.

  • After removing the string object we turn the value to an integer and use the .mulitply() method to multiply the Pandas Series.

  • Finally, we add the minutes and hour column replacing the time column while dropping the hour and minutes column.

data.hour = data.hour.str.replace('hr', '').astype(int).multiply(60)
data.minutes = data.minutes.str.replace('min', '').astype(int)
data['time'] = data.minutes + data.hour
data.drop(['hour', 'minutes'], axis=1, inplace=True)
data.head(3)
name author narrator time releasedate language stars price ratings
0 Geronimo Stilton #11 & #12 GeronimoStilton BillLobely 140 2008-04-08 English 5.0 468.0 34.0
1 The Burning Maze RickRiordan RobbieDaymond 788 2018-01-05 English 4.5 820.0 41.0
2 The Deep End JeffKinney DanRussell 123 2020-06-11 English 4.5 410.0 38.0

Replacing the NaN values

We are almost done with cleaning our data however we still have Nan values. While we can have NaN values as it could portray the realistic way on how the data is represented but we would evnetually need to replace it with some value when analysing or visulizating the data.

In majority number of cases, I've seen NaN values being replaced by mean or median which in this case wouldn't be of much help, as the audiobooks doesn't actually have a rating or star. It's much preferable to replace the value with zero.

data.isna().sum()
name               0
author             0
narrator           0
time               0
releasedate        0
language           0
stars          72417
price              0
ratings        72417
dtype: int64

We see an equal number of missing values, just as we expected on the stars and ratings column. To replace then with 0 we can simple write:

data.fillna(0, inplace=True)

We can see that we have no missing values now! Great! To check, we can simply run a sample() and it would show us.
Note: sample() select a row randomly, you can pass in a value to get that many number of randomly selected rows.

data.isna().sum()
name           0
author         0
narrator       0
time           0
releasedate    0
language       0
stars          0
price          0
ratings        0
dtype: int64
data.sample(3)
name author narrator time releasedate language stars price ratings
40951 Шейх Мансур АнатолийВиноградов ВсеволодКузнецов 61 2020-12-21 russian 0.0 99.0 0.0
20151 Life KeithRichards StephanRemmler 400 2010-04-11 german 0.0 367.0 0.0
51989 The Ship of Silence AlbertR.Wetjen JeffHarding 52 2022-02-18 English 0.0 53.0 0.0

Finally, we can view the entire infomation of the cleaned DataFrame using .info()

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87489 entries, 0 to 87488
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   name         87489 non-null  object        
 1   author       87489 non-null  object        
 2   narrator     87489 non-null  object        
 3   time         87489 non-null  int64         
 4   releasedate  87489 non-null  datetime64[ns]
 5   language     87489 non-null  object        
 6   stars        87489 non-null  float64       
 7   price        87489 non-null  float64       
 8   ratings      87489 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(4)
memory usage: 6.0+ MB

Saving the cleaned dataset

We have cleaned our dataset, and now to save it we use .to_csv() function providing the name of the csv file while setting the index to False.

Important:Not setting the index to False would generate an extra column in the csv with row numbers i.e. 0-87488.

data.to_csv('audible_cleaned.csv', index=False)