Previously, we have performed data collection and preparation of products claiming to be able to treat cancer that sold in Shopee. Now, we can observe the data to get to know several informations from our data set, such as:
- How is the price distribution?
- Which products are selling the most?
- Which products are the most expensive ones? How is the selling?
- Is there any correlation between price and sales?
- What are the most common words show up in the name of the product?
- What are the most common word show up in the specification of the product?
Reminder: data frame is stored as herbcancerproducts
.
Import Libraries
We have to import following libraries:
import pandas as pd import seaborn as sns from matplotlib import pyplot as plt import numpy as np %matplotlib inline import six from PIL import Image #converting images into arrays from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator from os import path
Preparation
Prior to data visualisation, some data attributes are prepared in order to make them look less chaos in the graph. herbcancerproducts
has a really long information in column name
. We can make a copy of the data frame with shortened information in that column. The name of this copied data frame is herbcancerproducts_copy
.
# Copy herbcancerproducts data frame to shortened information in column 'name' herbcancerproducts_copy = herbcancerproducts.copy() # Shortened 'name' column into 30 characters ended with '...' herbcancerproducts_copy['name'] = herbcancerproducts_copy['name'].apply(lambda x: x[:30] + '...')
Price distribution
Products’ price distribution will give us the general overview of the pricing strategies for drugs/herbs claiming to treat cancer.
sns.distplot(herbcancerproducts_copy['price'],color = 'orange') plt.ticklabel_format(style='plain',axis='both') plt.title('Price Distribution') plt.xlabel('Price')
Output:
We can see that most of the products are mostly sold lower than IDR 500,000.
The Most Expensive Products
After we observe the spread of the product pricing, we want to see and explore more details about them. We will start with observing which products are the most expensive.
top_10_price = herbcancerproducts_copy.sort_values(by='price', ascending=False).head() # Creating Bar Plot top_10_price.plot(x='name',y='price', kind="barh", color='orange',fontsize=10) plt.ticklabel_format(style='plain', useOffset=False, axis='x') plt.xlabel('Price (IDR)') plt.ylabel('Name')
Output:
The bar plot shows that a product has an outstanding price in compare to others with almost up to IDR 3,000,000. Some products are sold in the same price in IDR 2,000,000.
Most Selling Products
The pricy products do not mean they have lower sales and vice versa. Other than price, sales should be another thing to investigate.
# Most selling products import dataframe_image as dfi top_selling = herbcancerproducts_copy.sort_values('sold', ascending=False) top_selling = top_selling[:40] top_selling['name'] = top_selling['name'].apply(lambda x: x[:50]) df_styled = top_selling.drop(['city','specification','price'], axis=1).reset_index(drop=True).style.background_gradient() df_styled.set_properties(**{'text-align': 'left'}) pd.set_option('colheader_justify', 'center') dfi.export(df_styled,"mytable.png")
Output:
As we can see from the table above, 28 products have reached ten thousands of sales. The issue with counting the number of products sold by web scraping in Shopee is the maximum number of sales shown in website is ten thousands and if it reach more than that, it will only be shown as 10K+. It means that the accurate number of product that has been sold more than ten thousands would not be precise.
I tried to sorted all of the products out based on the name of the product and it shows that some products are sold in the same name but can be bought with different clicks and it is narrowed down the number of products.
# Sort product by name and sum up the number of product sold name_sorted = pd.DataFrame(herbcancerproducts_copy.groupby('name').sum()['sold']) name_sorted
Output:
After sorted them out by name and sum up the total number of products that have been sold, we get the total number of product narrowed down into 1666 products from the total of 3000 products that we scraped. Since the number of sold are also accumulated while we sort them out, we can see the top ten products with the highest sales.
# Top 10 most selling products top_sold = name_sorted.sort_values('sold', ascending=False) top_sold = top_sold[:10] # Visualizing top 10 most selling products top_sold.plot(kind='barh', color='orange') plt.title('Top 10 Most Selling Products') plt.xlabel('Amount of Product Sold') plt.ylabel('Product Name')
Acep Herbal and Prostero has sold the most drugs/herbs claiming to treat cancer in Shopee with more than thirty thousands of products has been sold.
Correlation of Price and Number of Product Sold
Then, we want to see whether price and sales are correlated to each other. We can see it by creating a scatter plot.
data = pd.concat([herbcancerproducts_copy['price'],herbcancerproducts['sold']], axis=1) data.plot.scatter(x='price', y='sold', ylim=(0,12000), color='orange') plt.ticklabel_format(style='plain') plt.title('Correlation Between Products Price and Sales') plt.xlabel('Price (IDR)') plt.ylabel('Number of Products Sold')
Output:
Overall, the scatter plot shows that products with lower price has sold more products than the expensive ones.
Even though scatter plot can be used to visualise the correlation between two variable, further observations are needed by using .corr()
. Also, kindly reminder, correlation does not mean causation.
The most common words show up in the name of the products
By using word cloud, we can see which keywords are mostly used by buyer to search the drugs/products that they would like to buy.
text_name = ' '.join(i for i in herbcancerproducts.name) stopwords = set(STOPWORDS) wordcloud = WordCloud(stopwords=stopwords, background_color='white', width=800, height=400).generate(text_name) plt.figure(figsize=(16,9)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') plt.show()
Output:
‘Obat Herbal’, ‘Obat Kanker’, ‘Kanker Payudara’, ‘Kanker Tumor’, and ‘Kanker Servik’ (english: ‘Herbal Drug’, ‘Cancer Drug’, ‘Breast Cancer’, ‘Tumor Cancer’, and ‘Cervical Cancer’) are the most frequent words used in Shopee to search cancer treatment.
From this analysis, we discover that breast cancer and cervical cancer are the most common cancer types that customer look up for the treatment in e-commerce.
The most common words show up in product specification
Product specification provides some column to let buyer write detail information related to the products, such as registration number from National Food and Drug Agency and expired date.
text_spec = ' '.join(i for i in herbcancerproducts.specification) stopwords = set(STOPWORDS) wordcloud = WordCloud(stopwords=stopwords, background_color='white', width=800, height=400).generate(text_spec) plt.figure(figsize=(16,9)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') plt.show()
Output:
The most common words in specification analysis using word cloud shows that most of the products is registered and has a registration number from National Food and Drug Agency. It also let us know that most products are distributed from Jakarta.
Drawbacks
The problems with analysing data obtained from web scraping in Shopee are:
- Shopee has a maximum limit to show the number of each product has been sold, which is ten thousands. So, the results of sales analysis would be not precise.
- The same product can be sold with different clicks and various prices. So this is quite challenging to investigate the product pricing strategies in Shopee.
- By looking up the specification column, it is possible to check whether the herbs/drugs is registered in the National Food and Drug Agency. However, sellers do not write that information in the column provided by Shopee. Some of them write the informations up in the description column or even in the product name column. Therefore, it is really challenging to explore the impact whether buyer really concern with herbs/drugs registration and other informations or not.
Further investigation
From this visualisation and analysis, more research could be perform, such as:
- Price and sales analysis of current clinical used cancer drugs
- The price analysis comparison of drugs/herbs claiming to treat cancer in e-commerce with clinical used cancer drugs
- Total cost of standard cancer treatment
By understanding points above, we could try to understand the reason why people tend to buy drugs/herbs treatment in an e-commerce and not just follow the clinical treatment advised by doctors—other than because of their beliefs.
Featured image by Lukas Blazek on Unsplash.