Audible Data Cleaning
After scraping the data from Audible.in, the notebook focuses on cleaning the data prior to moving into Exploratory Data Analysis.
- Importing Libraries
- Reading the data
- Cleaning the data
- Changing Datatypes
- Replacing the NaN values
- Saving the cleaned dataset
Importing Libraries
We import Pandas and Numpy as the two libraries would help us clean, edit and prepare our scraped data. We also import Warnings. This notebook focused more on data cleaning and data wrangling. For data exploration you can view the notebook here.
You can find and download the data here.
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter('ignore') # ignore any warnings from any code cell
data = pd.read_csv('./audible_uncleaned.csv')
data.shape # Gives use a tuple of (rows, columns)
data.head(3)
data.duplicated().sum()
author
and narrator
columns
Cleaning the Removing the Writtenby:
and Narratedby:
as those values are redundant
- Selecting the Pandas Series by using
data.author
- Then we use the
.str
method get all the string values - And finally, we run the
replace
method with thevalue_to_be_replaced
and thevalue_to_be_replaced_with
data.author = data.author.str.replace('Writtenby:', '')
data.narrator = data.narrator.str.replace('Narratedby:', '')
data.sample(3)
stars
columns
Cleaning the The stars
column has three distinct format of values:
5 out of 5 stars34 ratings
4.5 out of 5 stars38 ratings
-
Not rated yet
- We handle the 3rd one efficiently using the Pandas Series and the replace method to replace with NaN values using the numpy library. Note: This doesn't use the
str
method and hence we use theinplace = True
- As for the 1st and the 2nd one, we use Regex to split the string into two columns of
stars
andratings
.Note:I used the Regex101 as guide to use regular expression on the string.
data.stars.replace('Not rated yet', np.nan, inplace=True)
data[['unknown','stars', 'ratings']] = data.stars.str.split(r'(\d*.?\d\s[a-z]+\s[a-z]+\s\d+\s[a-z]+)', 1, expand=True, regex=True)
data.drop('unknown', axis=1, inplace=True) # Dropping the unknown column as it only consists of spaces and nan values
data.price = data.price.str.replace(',', '')
data.price.replace('Free', 0, inplace=True)
data.price = data.price.astype('float64')
ratings
column to a float value
Changing the The ratings
column has one value:
23 ratings
- We split the data at the
-space-
and drop the column which doesn't have the value and then change the datatype tofloat
.Note:Ratings are generally whole numbers but since the column has null values it couldn't be changed to
int
and hence the choice forfloat
data[['ratings', 'unknown']] = data.ratings.str.split(' ', 1, expand=True)
data.drop('unknown', axis=1, inplace=True)
data.ratings = data.ratings.str.replace(',', '').astype('float64')
data[['stars', 'unknown']] = data.stars.str.split('out', 1, expand=True)
data.drop('unknown', axis=1, inplace=True)
data.stars = data.stars.astype('float64')
data.releasedate = pd.to_datetime(data.releasedate)
data.time = data.time.str.replace('Less than 1 minute', '1 min')
data.time = data.time.str.replace('mins', 'min')
data.time = data.time.str.replace('hrs', 'hr')
data.head(3)
Here, we split the time
twice, once to separate the hour
and again to separate the minutes
. This again uses Regex as that makes working with strings extremly easy and handy.
Note:It's not necessary to split it twice and can be done once by reusing
mins
however this makes it easier to read and understand.
data[['unknown', 'hour', 'mins']] = data['time'].str.split(r'(\d+ hr)', expand=True, regex=True)
data.drop(['unknown', 'mins'], axis =1, inplace=True)
data[['hr', 'minutes', 'unknown']] = data['time'].str.split(r'(\d+ min)', expand=True, regex=True)
data.drop(['unknown', 'hr'], axis =1, inplace=True)
Note:We see that we have
None
values and notnp.nan
values after splitting the string. We use.applymap()
to map thelambda
function and them sum them up to see the count ofNone
values. An easier way to find what values you have in a Pandas Series is to rundf.name_of_column.unique()
to see all unique values.
To count the number of unique values in a Pandas Series rundf.name_of_column.nunique()
.
data.applymap(lambda x: x is None).sum()
We fill None
values the same as filling Nan
values i.e. with fillna()
data.hour.fillna(value='0 hr', inplace=True)
data.minutes.fillna(value='0 min', inplace=True)
data.applymap(lambda x: x is None).sum()
data.head(3)
Now that we don't have any null values, we remove the string associated with the numbers - hr
from the hour columns and min
from the minutes column.
To convert the entire time
from hours & minutes --> minutes, we need to multiply the hours by 60 and then add the minutes.
-
After removing the string object we turn the value to an
integer
and use the.mulitply()
method to multiply the Pandas Series. -
Finally, we add the
minutes
andhour
column replacing thetime
column while dropping thehour
andminutes
column.
data.hour = data.hour.str.replace('hr', '').astype(int).multiply(60)
data.minutes = data.minutes.str.replace('min', '').astype(int)
data['time'] = data.minutes + data.hour
data.drop(['hour', 'minutes'], axis=1, inplace=True)
data.head(3)
Replacing the NaN values
We are almost done with cleaning our data however we still have Nan
values. While we can have NaN
values as it could portray the realistic way on how the data is represented but we would evnetually need to replace it with some value when analysing or visulizating the data.
In majority number of cases, I've seen NaN
values being replaced by mean
or median
which in this case wouldn't be of much help, as the audiobooks doesn't actually have a rating or star. It's much preferable to replace the value with zero.
data.isna().sum()
We see an equal number of missing values, just as we expected on the stars
and ratings
column. To replace then with 0
we can simple write:
data.fillna(0, inplace=True)
We can see that we have no missing values now! Great! To check, we can simply run a sample()
and it would show us.
Note: sample()
select a row randomly, you can pass in a value to get that many number of randomly selected rows.
data.isna().sum()
data.sample(3)
Finally, we can view the entire infomation of the cleaned DataFrame using .info()
data.info()
data.to_csv('audible_cleaned.csv', index=False)