Importing Libraries

We import Pandas and Numpy as the two libraries would help us clean, edit and prepare our scraped data. We also import Warnings. This notebook focused more on data cleaning and data wrangling. For data exploration you can view the notebook here.

You can find and download the data here.

import pandas as pd
import numpy as np
import warnings
warnings.simplefilter('ignore') # ignore any warnings from any code cell

Reading the data

data = pd.read_csv('./audible_uncleaned.csv')
data.shape # Gives use a tuple of (rows, columns)

(87489, 8)

data.head(3)

Cleaning the data

While we have our scraped data, we don't exactly have a cleaner much readable view of our data or the datatypes for the columns we would prefer before diving into analysis and data visulization. Let's start by cleaning our data.

Checking for duplicates

We use .sum() method as data.duplicated() returns a Pandas Series of boolean values. Summing them up shows how many duplicated values are there.

data.duplicated().sum()

0

Cleaning the `author` and `narrator` columns

Removing the Writtenby: and Narratedby: as those values are redundant

Selecting the Pandas Series by using data.author
Then we use the .str method get all the string values
And finally, we run the replace method with the value_to_be_replaced and the value_to_be_replaced_with

data.author = data.author.str.replace('Writtenby:', '')
data.narrator = data.narrator.str.replace('Narratedby:', '')
data.sample(3)

Cleaning the `stars` columns

The stars column has three distinct format of values:

5 out of 5 stars34 ratings
4.5 out of 5 stars38 ratings
Not rated yet

We handle the 3rd one efficiently using the Pandas Series and the replace method to replace with NaN values using the numpy library. Note: This doesn't use the str method and hence we use the inplace = True
As for the 1st and the 2nd one, we use Regex to split the string into two columns of stars and ratings.

Note:I used the Regex101 as guide to use regular expression on the string.

data.stars.replace('Not rated yet', np.nan, inplace=True)
data[['unknown','stars', 'ratings']] = data.stars.str.split(r'(\d*.?\d\s[a-z]+\s[a-z]+\s\d+\s[a-z]+)', 1, expand=True, regex=True)

data.drop('unknown', axis=1, inplace=True) # Dropping the unknown column as it only consists of spaces and nan values

Changing Datatypes

Changing the `price` column to a float value

The price column has three distinct values:

Free
3-digit value. Eg:839.00
4-digit value. Eg:1,230.00

We replace 1st one with 0
And for the 3rd one we remove the , as that blocks us to change the datatype to float

data.price = data.price.str.replace(',', '')
data.price.replace('Free', 0, inplace=True)
data.price = data.price.astype('float64')

Changing the `ratings` column to a float value

The ratings column has one value:

23 ratings

We split the data at the -space- and drop the column which doesn't have the value and then change the datatype to float.

Note:Ratings are generally whole numbers but since the column has null values it couldn't be changed to int and hence the choice for float

data[['ratings', 'unknown']] = data.ratings.str.split(' ', 1, expand=True)
data.drop('unknown', axis=1, inplace=True)

data.ratings = data.ratings.str.replace(',', '').astype('float64')

Changing the `stars` column to a float value

The stars column has one value:

5 out of 5 stars

We split the data at the out and drop the extra column as all the values are rated out of 5 stars and then change the datatype to float.

data[['stars', 'unknown']] = data.stars.str.split('out', 1, expand=True)
data.drop('unknown', axis=1, inplace=True)

data.stars = data.stars.astype('float64')

Changing the `releasedate` to datetime object

data.releasedate = pd.to_datetime(data.releasedate)

Changing the `time` to datetime integer value of only minutes

The time column has 4 distinct values:

1 hr
1 min
3 hrs and 40 mins
Less than 1 minute

For the 4th option we approx it to 1 min
We then replace the hrs to hr and mins to min, as that would help us genaralize it

data.time = data.time.str.replace('Less than 1 minute', '1 min')
data.time = data.time.str.replace('mins', 'min')
data.time = data.time.str.replace('hrs', 'hr')
data.head(3)

Here, we split the time twice, once to separate the hour and again to separate the minutes. This again uses Regex as that makes working with strings extremly easy and handy.

Note:It's not necessary to split it twice and can be done once by reusing mins however this makes it easier to read and understand.

data[['unknown', 'hour', 'mins']] = data['time'].str.split(r'(\d+ hr)', expand=True, regex=True)
data.drop(['unknown', 'mins'], axis =1, inplace=True)

data[['hr', 'minutes', 'unknown']] = data['time'].str.split(r'(\d+ min)', expand=True, regex=True)
data.drop(['unknown', 'hr'], axis =1, inplace=True)

Note:We see that we have None values and not np.nan values after splitting the string. We use .applymap() to map the lambda function and them sum them up to see the count of None values. An easier way to find what values you have in a Pandas Series is to run df.name_of_column.unique() to see all unique values.
To count the number of unique values in a Pandas Series run df.name_of_column.nunique().

data.applymap(lambda x: x is None).sum()

name               0
author             0
narrator           0
time               0
releasedate        0
language           0
stars              0
price              0
ratings            0
hour           13406
minutes         1343
dtype: int64

We fill None values the same as filling Nan values i.e. with fillna()

data.hour.fillna(value='0 hr', inplace=True)
data.minutes.fillna(value='0 min', inplace=True)

data.applymap(lambda x: x is None).sum()

name           0
author         0
narrator       0
time           0
releasedate    0
language       0
stars          0
price          0
ratings        0
hour           0
minutes        0
dtype: int64

data.head(3)

Now that we don't have any null values, we remove the string associated with the numbers - hr from the hour columns and min from the minutes column.
To convert the entire time from hours & minutes --> minutes, we need to multiply the hours by 60 and then add the minutes.

After removing the string object we turn the value to an integer and use the .mulitply() method to multiply the Pandas Series.
Finally, we add the minutes and hour column replacing the time column while dropping the hour and minutes column.

data.hour = data.hour.str.replace('hr', '').astype(int).multiply(60)
data.minutes = data.minutes.str.replace('min', '').astype(int)
data['time'] = data.minutes + data.hour

data.drop(['hour', 'minutes'], axis=1, inplace=True)

data.head(3)

Replacing the NaN values

We are almost done with cleaning our data however we still have Nan values. While we can have NaN values as it could portray the realistic way on how the data is represented but we would evnetually need to replace it with some value when analysing or visulizating the data.

In majority number of cases, I've seen NaN values being replaced by mean or median which in this case wouldn't be of much help, as the audiobooks doesn't actually have a rating or star. It's much preferable to replace the value with zero.

data.isna().sum()

name               0
author             0
narrator           0
time               0
releasedate        0
language           0
stars          72417
price              0
ratings        72417
dtype: int64

We see an equal number of missing values, just as we expected on the stars and ratings column. To replace then with 0 we can simple write:

data.fillna(0, inplace=True)

We can see that we have no missing values now! Great! To check, we can simply run a sample() and it would show us.
Note: sample() select a row randomly, you can pass in a value to get that many number of randomly selected rows.

data.isna().sum()

name           0
author         0
narrator       0
time           0
releasedate    0
language       0
stars          0
price          0
ratings        0
dtype: int64

data.sample(3)

Finally, we can view the entire infomation of the cleaned DataFrame using .info()

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87489 entries, 0 to 87488
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   name         87489 non-null  object        
 1   author       87489 non-null  object        
 2   narrator     87489 non-null  object        
 3   time         87489 non-null  int64         
 4   releasedate  87489 non-null  datetime64[ns]
 5   language     87489 non-null  object        
 6   stars        87489 non-null  float64       
 7   price        87489 non-null  float64       
 8   ratings      87489 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(4)
memory usage: 6.0+ MB

Saving the cleaned dataset

We have cleaned our dataset, and now to save it we use .to_csv() function providing the name of the csv file while setting the index to False.

Important:Not setting the index to False would generate an extra column in the csv with row numbers i.e. 0-87488.

data.to_csv('audible_cleaned.csv', index=False)

	name	author	narrator	time	releasedate	language	stars	price
0	Geronimo Stilton #11 & #12	Writtenby:GeronimoStilton	Narratedby:BillLobely	2 hrs and 20 mins	04-08-08	English	5 out of 5 stars34 ratings	468.00
1	The Burning Maze	Writtenby:RickRiordan	Narratedby:RobbieDaymond	13 hrs and 8 mins	01-05-18	English	4.5 out of 5 stars41 ratings	820.00
2	The Deep End	Writtenby:JeffKinney	Narratedby:DanRussell	2 hrs and 3 mins	06-11-20	English	4.5 out of 5 stars38 ratings	410.00

	name	author	narrator	time	releasedate	language	stars	price
10867	The Trust Manifesto	DamianBradfield	DamianBradfield,RichardHughes,KristinAtherton	6 hrs and 54 mins	10-10-19	English	Not rated yet	888.00
34136	How to Be a Conscious Eater	SophieEgan	SophieEgan	7 hrs and 15 mins	27-04-21	English	Not rated yet	562.00
6549	Els silencis de la boca de la mina [The Silenc...	AndreuSotorra	JoanMora	4 hrs and 3 mins	02-01-19	catalan	Not rated yet	537.00

	name	author	narrator	time	releasedate	language	stars	price	ratings
0	Geronimo Stilton #11 & #12	GeronimoStilton	BillLobely	2 hr and 20 min	2008-04-08	English	5.0	468.0	34.0
1	The Burning Maze	RickRiordan	RobbieDaymond	13 hr and 8 min	2018-01-05	English	4.5	820.0	41.0
2	The Deep End	JeffKinney	DanRussell	2 hr and 3 min	2020-06-11	English	4.5	410.0	38.0

	name	author	narrator	time	releasedate	language	stars	price	ratings	hour	minutes
0	Geronimo Stilton #11 & #12	GeronimoStilton	BillLobely	2 hr and 20 min	2008-04-08	English	5.0	468.0	34.0	2 hr	20 min
1	The Burning Maze	RickRiordan	RobbieDaymond	13 hr and 8 min	2018-01-05	English	4.5	820.0	41.0	13 hr	8 min
2	The Deep End	JeffKinney	DanRussell	2 hr and 3 min	2020-06-11	English	4.5	410.0	38.0	2 hr	3 min

	name	author	narrator	time	releasedate	language	stars	price	ratings
0	Geronimo Stilton #11 & #12	GeronimoStilton	BillLobely	140	2008-04-08	English	5.0	468.0	34.0
1	The Burning Maze	RickRiordan	RobbieDaymond	788	2018-01-05	English	4.5	820.0	41.0
2	The Deep End	JeffKinney	DanRussell	123	2020-06-11	English	4.5	410.0	38.0

	name	author	narrator	time	releasedate	language	price
40951	Шейх Мансур	АнатолийВиноградов	ВсеволодКузнецов	61	2020-12-21	russian	99.0
20151	Life	KeithRichards	StephanRemmler	400	2010-04-11	german	367.0
51989	The Ship of Silence	AlbertR.Wetjen	JeffHarding	52	2022-02-18	English	53.0

Importing Libraries

Reading the data

Cleaning the data

Checking for duplicates

Cleaning the author and narrator columns

Cleaning the stars columns

Changing Datatypes

Changing the price column to a float value

Changing the ratings column to a float value

Changing the stars column to a float value

Changing the releasedate to datetime object

Changing the time to datetime integer value of only minutes