On the previous post, I have collected the drugs/herbs product claiming to treat cancer sold in Shopee by web scraping. The data is a set of raw data and we cannot interpret anything yet without data analysis. However, data cleaning has to be performed beforehand. Why?
Raw or primary data has missing value, noise, incorrect or inconsistent format, and unnecessary information that we have to deal with to present a reliable observation, analysis, and visualisation. Data cleaning will create a standardised and organised data in uniform format. There is no absolute manner on how to clean data because it will be dependent on how our raw data looks like, what do we want to observe, how we want to present the data, or what issues that we want to study.
Reminder: from our previous session, we save our data frame as herbcancerlist
.
Let’s get started!
Importing Pandas Library
The main library that I would like to use to clean this dataset is Pandas which stands for Python Data Analysis Library. Pandas is a user-friendly library to import and analyse data. In prior to working with Pandas, we have to import it in our IDE.
# Importing module import pandas as pd
Data Preview
We can preview how our data frame looks like at a glance so we can decide which part of the data that we have to wrangle. By dataframe.info()
, we can get the summary of our data frame. We can observe how our data looks like by using other built-in function in Pandas by dataframe.head()
to see the first five rows and dataframe.tail()
to check the last five rows of our data.
herbcancerlist.info() herbcancerlist.head() herbcancerlist.tail()
As we can see from the dataframe.info()
, we get the following information:
Non-null count
shows the number of data that is not missing. On the table, we can see that we have3000 non-null data
which means our data is complete. However, we have to remember that when we scraped the data from Shopee, we set our code to execute no name/ no price/ no value/ no rate/ no city/ no spec if the code can’t find the element we are looking for. So, this information doesn’t actually means that we collected all the information from the target elements.Dtype
tells us the type of each column. It can be something like float64 for float data type or int64 for integer data type. Our data hasobject
dtype which means we have mixed types. We have to adjust the dtype so that they can be visualised and analysed properly in corrected and uniformed format.
From dataframe.head()
and dataframe.tail()
, we notice that:
- In price column: we have currency (Rp) and “.“.
- in sold column: number of product that sold more than a thousand is written in RB (Indonesian language; means thousand) and in decimal format.
Data cleaning
Based on our observation above, the details that we have to clean this data are:
- Drop rows with no name/ no price/ no value/ no rate/ no city/ no spec
- In price column, remove the currency (Rp) and delete the ‘.’
- In sold column, remove the RB/ RB+ and multiply the ones with decimal format by 1000
- Converts data type of each column into the correct data type
# Drop rows with missing value herbcancerlist = herbcancerlist[herbcancerlist.name != 'No name'] # remove row with no name herbcancerlist = herbcancerlist[herbcancerlist.price != 'No price'] # remove row with no price herbcancerlist = herbcancerlist[herbcancerlist.sold != 'No value'] # remove row with no number of sold product herbcancerlist = herbcancerproducts[herbcancerproducts.rate != 'No rate'] # remove row with no rate herbcancerlist = herbcancerlist[herbcancerlist.city != 'No city'] # remove row with no city herbcancerlist = herbcancerlist[herbcancerproducts.specification != 'No spec'] # remove row with no spec herbcancerlist = herbcancerlist[~herbcancerlist.price.str.contains(' - ') == True] # remove rows with range price herbcancerlist = herbcancerlist.reset_index(drop=True) # reset the index and drop the old index # Remove currency in price column herbcancerlist['price'] = herbcancerlist['price'].str.replace('Rp','').str.replace('.','') # Remove the RB or RB+ with multiplication by 1000 in sold column herbcancerlist['sold'] = (herbcancerlist.sold.str.replace(",", ".") .str.replace("RB", "*1000") .str.replace("[^0-9.*]",'') .apply(eval) ) # Converts data type of each column into the correct data type herbcancerlist['price'] = pd.to_numeric(herbcancerlist['price']) # into int herbcancerlist['sold'] = pd.to_numeric(hherbcancerlist['sold']) # into int herbcancerlist['rate'] = pd.to_numeric(herbcancerlist['rate']) # into float
Export Dataset
After conducting data cleaning, our dataset is ready to be exported for further exploration in data visualisation and analysis which I will also post on this blog.
Feel free to give some input and feedback in the comment section below.
Thanks for reading!