Drugs/Herbs Sold in Shopee Claiming to Treat Cancer: Data Cleaning

On the previous post, I have collected the drugs/herbs product claiming to treat cancer sold in Shopee by web scraping. The data is a set of raw data and we cannot interpret anything yet without data analysis. However, data cleaning has to be performed beforehand. Why?

Raw or primary data has missing value, noise, incorrect or inconsistent format, and unnecessary information that we have to deal with to present a reliable observation, analysis, and visualisation. Data cleaning will create a standardised and organised data in uniform format. There is no absolute manner on how to clean data because it will be dependent on how our raw data looks like, what do we want to observe, how we want to present the data, or what issues that we want to study.

Reminder: from our previous session, we save our data frame as herbcancerlist.

Let’s get started!

Importing Pandas Library

The main library that I would like to use to clean this dataset is Pandas which stands for Python Data Analysis Library. Pandas is a user-friendly library to import and analyse data. In prior to working with Pandas, we have to import it in our IDE.

# Importing module
import pandas as pd

Data Preview

We can preview how our data frame looks like at a glance so we can decide which part of the data that we have to wrangle. By dataframe.info(), we can get the summary of our data frame. We can observe how our data looks like by using other built-in function in Pandas by dataframe.head() to see the first five rows and dataframe.tail() to check the last five rows of our data.

herbcancerlist.info()

herbcancerlist.head()

herbcancerlist.tail()

As we can see from the dataframe.info(), we get the following information:

Non-null count shows the number of data that is not missing. On the table, we can see that we have 3000 non-null data which means our data is complete. However, we have to remember that when we scraped the data from Shopee, we set our code to execute no name/ no price/ no value/ no rate/ no city/ no spec if the code can’t find the element we are looking for. So, this information doesn’t actually means that we collected all the information from the target elements.
Dtype tells us the type of each column. It can be something like float64 for float data type or int64 for integer data type. Our data has object dtype which means we have mixed types. We have to adjust the dtype so that they can be visualised and analysed properly in corrected and uniformed format.

From dataframe.head() and dataframe.tail(), we notice that:

In price column: we have currency (Rp) and “.“.
in sold column: number of product that sold more than a thousand is written in RB (Indonesian language; means thousand) and in decimal format.

Data cleaning

Based on our observation above, the details that we have to clean this data are:

Drop rows with no name/ no price/ no value/ no rate/ no city/ no spec
In price column, remove the currency (Rp) and delete the ‘.’
In sold column, remove the RB/ RB+ and multiply the ones with decimal format by 1000
Converts data type of each column into the correct data type

# Drop rows with missing value
herbcancerlist = herbcancerlist[herbcancerlist.name != 'No name']  # remove row with no name
herbcancerlist = herbcancerlist[herbcancerlist.price != 'No price']    # remove row with no price
herbcancerlist = herbcancerlist[herbcancerlist.sold != 'No value'] # remove row with no number of sold product
herbcancerlist = herbcancerproducts[herbcancerproducts.rate != 'No rate']  # remove row with no rate
herbcancerlist = herbcancerlist[herbcancerlist.city != 'No city']  # remove row with no city
herbcancerlist = herbcancerlist[herbcancerproducts.specification != 'No spec'] # remove row with no spec
herbcancerlist = herbcancerlist[~herbcancerlist.price.str.contains(' - ') == True] # remove rows with range price
herbcancerlist = herbcancerlist.reset_index(drop=True)  # reset the index and drop the old index

# Remove currency in price column
herbcancerlist['price'] = herbcancerlist['price'].str.replace('Rp','').str.replace('.','')

# Remove the RB or RB+ with multiplication by 1000 in sold column
herbcancerlist['sold'] = (herbcancerlist.sold.str.replace(",", ".")
                  .str.replace("RB", "*1000")
                  .str.replace("[^0-9.*]",'')
                  .apply(eval)
              )

# Converts data type of each column into the correct data type
herbcancerlist['price'] = pd.to_numeric(herbcancerlist['price']) # into int
herbcancerlist['sold'] = pd.to_numeric(hherbcancerlist['sold']) # into int
herbcancerlist['rate'] = pd.to_numeric(herbcancerlist['rate']) # into float

Export Dataset

After conducting data cleaning, our dataset is ready to be exported for further exploration in data visualisation and analysis which I will also post on this blog.

Feel free to give some input and feedback in the comment section below.

Thanks for reading!