Previously, we have performed data collection and preparation of products claiming to be able to treat cancer that sold in Shopee. Now, we can observe the data to get to know several informations from our data set, such as:

  • How is the price distribution?
  • Which products are selling the most?
  • Which products are the most expensive ones? How is the selling?
  • Is there any correlation between price and sales?
  • What are the most common words show up in the name of the product?
  • What are the most common word show up in the specification of the product?

Reminder: data frame is stored as herbcancerproducts.

Import Libraries

We have to import following libraries:

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np
%matplotlib inline
import six

from PIL import Image #converting images into arrays
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from os import path

Preparation

Prior to data visualisation, some data attributes are prepared in order to make them look less chaos in the graph. herbcancerproducts has a really long information in column name. We can make a copy of the data frame with shortened information in that column. The name of this copied data frame is herbcancerproducts_copy.

# Copy herbcancerproducts data frame to shortened information in column 'name'
herbcancerproducts_copy = herbcancerproducts.copy()

# Shortened 'name' column into 30 characters ended with '...' 
herbcancerproducts_copy['name'] = herbcancerproducts_copy['name'].apply(lambda x: x[:30] + '...')

Price distribution

Products’ price distribution will give us the general overview of the pricing strategies for drugs/herbs claiming to treat cancer.

sns.distplot(herbcancerproducts_copy['price'],color = 'orange')
plt.ticklabel_format(style='plain',axis='both')
plt.title('Price Distribution')
plt.xlabel('Price')

Output:

We can see that most of the products are mostly sold lower than IDR 500,000.

The Most Expensive Products

After we observe the spread of the product pricing, we want to see and explore more details about them. We will start with observing which products are the most expensive.

top_10_price = herbcancerproducts_copy.sort_values(by='price', ascending=False).head()

# Creating Bar Plot
top_10_price.plot(x='name',y='price', kind="barh", color='orange',fontsize=10)
plt.ticklabel_format(style='plain', useOffset=False, axis='x')
plt.xlabel('Price (IDR)')
plt.ylabel('Name')

Output:

The bar plot shows that a product has an outstanding price in compare to others with almost up to IDR 3,000,000. Some products are sold in the same price in IDR 2,000,000.

Most Selling Products

The pricy products do not mean they have lower sales and vice versa. Other than price, sales should be another thing to investigate.

# Most selling products
import dataframe_image as dfi
top_selling = herbcancerproducts_copy.sort_values('sold', ascending=False)
top_selling = top_selling[:40]
top_selling['name'] = top_selling['name'].apply(lambda x: x[:50])

df_styled = top_selling.drop(['city','specification','price'], axis=1).reset_index(drop=True).style.background_gradient()
df_styled.set_properties(**{'text-align': 'left'})
pd.set_option('colheader_justify', 'center')

dfi.export(df_styled,"mytable.png")

Output:

As we can see from the table above, 28 products have reached ten thousands of sales. The issue with counting the number of products sold by web scraping in Shopee is the maximum number of sales shown in website is ten thousands and if it reach more than that, it will only be shown as 10K+. It means that the accurate number of product that has been sold more than ten thousands would not be precise.

I tried to sorted all of the products out based on the name of the product and it shows that some products are sold in the same name but can be bought with different clicks and it is narrowed down the number of products.

# Sort product by name and sum up the number of product sold
name_sorted = pd.DataFrame(herbcancerproducts_copy.groupby('name').sum()['sold'])
name_sorted

Output:

After sorted them out by name and sum up the total number of products that have been sold, we get the total number of product narrowed down into 1666 products from the total of 3000 products that we scraped. Since the number of sold are also accumulated while we sort them out, we can see the top ten products with the highest sales.

# Top 10 most selling products
top_sold = name_sorted.sort_values('sold', ascending=False)
top_sold = top_sold[:10]

# Visualizing top 10 most selling products
top_sold.plot(kind='barh', color='orange')

plt.title('Top 10 Most Selling Products')
plt.xlabel('Amount of Product Sold')
plt.ylabel('Product Name')

Acep Herbal and Prostero has sold the most drugs/herbs claiming to treat cancer in Shopee with more than thirty thousands of products has been sold.

Correlation of Price and Number of Product Sold

Then, we want to see whether price and sales are correlated to each other. We can see it by creating a scatter plot.

data = pd.concat([herbcancerproducts_copy['price'],herbcancerproducts['sold']], axis=1)
data.plot.scatter(x='price', y='sold', ylim=(0,12000), color='orange')

plt.ticklabel_format(style='plain')

plt.title('Correlation Between Products Price and Sales')
plt.xlabel('Price (IDR)')
plt.ylabel('Number of Products Sold')

Output:

Overall, the scatter plot shows that products with lower price has sold more products than the expensive ones.

Even though scatter plot can be used to visualise the correlation between two variable, further observations are needed by using .corr(). Also, kindly reminder, correlation does not mean causation.

The most common words show up in the name of the products

By using word cloud, we can see which keywords are mostly used by buyer to search the drugs/products that they would like to buy.

text_name = ' '.join(i for i in herbcancerproducts.name)
stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, background_color='white', width=800, height=400).generate(text_name)
plt.figure(figsize=(16,9))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

Output:

‘Obat Herbal’, ‘Obat Kanker’, ‘Kanker Payudara’, ‘Kanker Tumor’, and ‘Kanker Servik’ (english: ‘Herbal Drug’, ‘Cancer Drug’, ‘Breast Cancer’, ‘Tumor Cancer’, and ‘Cervical Cancer’) are the most frequent words used in Shopee to search cancer treatment.

From this analysis, we discover that breast cancer and cervical cancer are the most common cancer types that customer look up for the treatment in e-commerce.

The most common words show up in product specification

Product specification provides some column to let buyer write detail information related to the products, such as registration number from National Food and Drug Agency and expired date.

text_spec = ' '.join(i for i in herbcancerproducts.specification)
stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, background_color='white', width=800, height=400).generate(text_spec)
plt.figure(figsize=(16,9))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

Output:

The most common words in specification analysis using word cloud shows that most of the products is registered and has a registration number from National Food and Drug Agency. It also let us know that most products are distributed from Jakarta.

Drawbacks

The problems with analysing data obtained from web scraping in Shopee are:

  • Shopee has a maximum limit to show the number of each product has been sold, which is ten thousands. So, the results of sales analysis would be not precise.
  • The same product can be sold with different clicks and various prices. So this is quite challenging to investigate the product pricing strategies in Shopee.
  • By looking up the specification column, it is possible to check whether the herbs/drugs is registered in the National Food and Drug Agency. However, sellers do not write that information in the column provided by Shopee. Some of them write the informations up in the description column or even in the product name column. Therefore, it is really challenging to explore the impact whether buyer really concern with herbs/drugs registration and other informations or not.

Further investigation

From this visualisation and analysis, more research could be perform, such as:

  • Price and sales analysis of current clinical used cancer drugs
  • The price analysis comparison of drugs/herbs claiming to treat cancer in e-commerce with clinical used cancer drugs
  • Total cost of standard cancer treatment

By understanding points above, we could try to understand the reason why people tend to buy drugs/herbs treatment in an e-commerce and not just follow the clinical treatment advised by doctorsother than because of their beliefs.

Featured image by Lukas Blazek on Unsplash.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *