Audible Data Cleaning
After scraping the data from Audible.in, the notebook focuses on cleaning the data prior to moving into Exploratory Data Analysis.
- Importing Libraries
- Reading the data
- Cleaning the data
- Changing Datatypes
- Replacing the NaN values
- Saving the cleaned dataset
Importing Libraries
We import Pandas and Numpy as the two libraries would help us clean, edit and prepare our scraped data. We also import Warnings. This notebook focused more on data cleaning and data wrangling. For data exploration you can view the notebook here.
You can find and download the data here.
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter('ignore') # ignore any warnings from any code cell
data = pd.read_csv('./audible_uncleaned.csv')
data.shape # Gives use a tuple of (rows, columns)
data.head(3)
data.duplicated().sum()
Cleaning the author and narrator columns
Removing the Writtenby: and Narratedby: as those values are redundant
- Selecting the Pandas Series by using
data.author - Then we use the
.strmethod get all the string values - And finally, we run the
replacemethod with thevalue_to_be_replacedand thevalue_to_be_replaced_with
data.author = data.author.str.replace('Writtenby:', '')
data.narrator = data.narrator.str.replace('Narratedby:', '')
data.sample(3)
Cleaning the stars columns
The stars column has three distinct format of values:
5 out of 5 stars34 ratings4.5 out of 5 stars38 ratings-
Not rated yet
- We handle the 3rd one efficiently using the Pandas Series and the replace method to replace with NaN values using the numpy library. Note: This doesn't use the
strmethod and hence we use theinplace = True - As for the 1st and the 2nd one, we use Regex to split the string into two columns of
starsandratings.Note:I used the Regex101 as guide to use regular expression on the string.
data.stars.replace('Not rated yet', np.nan, inplace=True)
data[['unknown','stars', 'ratings']] = data.stars.str.split(r'(\d*.?\d\s[a-z]+\s[a-z]+\s\d+\s[a-z]+)', 1, expand=True, regex=True)
data.drop('unknown', axis=1, inplace=True) # Dropping the unknown column as it only consists of spaces and nan values
data.price = data.price.str.replace(',', '')
data.price.replace('Free', 0, inplace=True)
data.price = data.price.astype('float64')
Changing the ratings column to a float value
The ratings column has one value:
23 ratings
- We split the data at the
-space-and drop the column which doesn't have the value and then change the datatype tofloat.Note:Ratings are generally whole numbers but since the column has null values it couldn't be changed to
intand hence the choice forfloat
data[['ratings', 'unknown']] = data.ratings.str.split(' ', 1, expand=True)
data.drop('unknown', axis=1, inplace=True)
data.ratings = data.ratings.str.replace(',', '').astype('float64')
data[['stars', 'unknown']] = data.stars.str.split('out', 1, expand=True)
data.drop('unknown', axis=1, inplace=True)
data.stars = data.stars.astype('float64')
data.releasedate = pd.to_datetime(data.releasedate)
data.time = data.time.str.replace('Less than 1 minute', '1 min')
data.time = data.time.str.replace('mins', 'min')
data.time = data.time.str.replace('hrs', 'hr')
data.head(3)
Here, we split the time twice, once to separate the hour and again to separate the minutes. This again uses Regex as that makes working with strings extremly easy and handy.
Note:It's not necessary to split it twice and can be done once by reusing
minshowever this makes it easier to read and understand.
data[['unknown', 'hour', 'mins']] = data['time'].str.split(r'(\d+ hr)', expand=True, regex=True)
data.drop(['unknown', 'mins'], axis =1, inplace=True)
data[['hr', 'minutes', 'unknown']] = data['time'].str.split(r'(\d+ min)', expand=True, regex=True)
data.drop(['unknown', 'hr'], axis =1, inplace=True)
Note:We see that we have
Nonevalues and notnp.nanvalues after splitting the string. We use.applymap()to map thelambdafunction and them sum them up to see the count ofNonevalues. An easier way to find what values you have in a Pandas Series is to rundf.name_of_column.unique()to see all unique values.
To count the number of unique values in a Pandas Series rundf.name_of_column.nunique().
data.applymap(lambda x: x is None).sum()
We fill None values the same as filling Nan values i.e. with fillna()
data.hour.fillna(value='0 hr', inplace=True)
data.minutes.fillna(value='0 min', inplace=True)
data.applymap(lambda x: x is None).sum()
data.head(3)
Now that we don't have any null values, we remove the string associated with the numbers - hr from the hour columns and min from the minutes column.
To convert the entire time from hours & minutes --> minutes, we need to multiply the hours by 60 and then add the minutes.
-
After removing the string object we turn the value to an
integerand use the.mulitply()method to multiply the Pandas Series. -
Finally, we add the
minutesandhourcolumn replacing thetimecolumn while dropping thehourandminutescolumn.
data.hour = data.hour.str.replace('hr', '').astype(int).multiply(60)
data.minutes = data.minutes.str.replace('min', '').astype(int)
data['time'] = data.minutes + data.hour
data.drop(['hour', 'minutes'], axis=1, inplace=True)
data.head(3)
Replacing the NaN values
We are almost done with cleaning our data however we still have Nan values. While we can have NaN values as it could portray the realistic way on how the data is represented but we would evnetually need to replace it with some value when analysing or visulizating the data.
In majority number of cases, I've seen NaN values being replaced by mean or median which in this case wouldn't be of much help, as the audiobooks doesn't actually have a rating or star. It's much preferable to replace the value with zero.
data.isna().sum()
We see an equal number of missing values, just as we expected on the stars and ratings column. To replace then with 0 we can simple write:
data.fillna(0, inplace=True)
We can see that we have no missing values now! Great! To check, we can simply run a sample() and it would show us.
Note: sample() select a row randomly, you can pass in a value to get that many number of randomly selected rows.
data.isna().sum()
data.sample(3)
Finally, we can view the entire infomation of the cleaned DataFrame using .info()
data.info()
data.to_csv('audible_cleaned.csv', index=False)