Across many cultural believes, alternative treatments based on herbs has been believed to treat anything, including cancer. It is true that a lot of experiments in the lab on cells and animals have been performed to claim the efficacy and safety of the “natural” remedies. Unfortunately, these results doesn’t mean they are automatically applicable to human use (source: link).
The high expectation of relying only on “natural” treatment to cure cancer gives patients false hope that leads to more critical situation, such as delay or even abandon standard cancer therapy. As a consequence, patients lose their chance of heal and may cause shorten life expectancy.
It is really easy to find and buy any drugs or herbs with marketing (over)claims. Imagine, how is it possible a capsule can treat any kind of cancer, diabetes, cholesterol all at once? Recently, online shopping platform services are soaring like crazy and this phenomena makes this kind of product is more accessible. The free cancer treatment covered by national health insurance doesn’t make this business suddenly shut down.
Thus, I am interested to look up how many products with cancer claim sold in Shopee as one of online shopping platform in Indonesia. I am interested to look at the price range, rate, number of product has been sold, city where the drugs will be sent from, and the specification of the product written by the seller, such as side effects and registration claim/number from national food and drug agency (BPOM). I gather those data by web scraping using several Python libraries.
Brief description
First, we’ll go to the Shopee Indonesia website and search for ‘obat kanker’ (english: cancer drugs), then we’ll find pages of search results with maximum sixty products in each page. Since we would like to see more description from each of the product, we have to go to inside each of the page product.
So, we are going to get the list of all of the link from every single product in (60 products in 50 pages, which means in total 3000 products). Then, we’ll go to the product individually to get the information and description of every products.
So, our program need to:
- Load the Shopee Indonesia homepage (
shopee.co.id ) - Get search keywords from the command line arguments
- Retrieve the search results page
- Opens a browser tab for each result
- Get the list of link to open the page of each product
- Capture the information needed from every product
- Save the information in a data frame
Website Preview
URL
The base url of the website that we would like to scrape is:
We also need to know the URL of the search result page which is
Element Inspection
By Inspect Element, we can look at the HTML structure of the page. So, we can identify which part of the page that we would like to scrape.
Pre-requisites
In this tutorial, I use Visual Studio Code as the IDE on MacOS.
Make sure to install libraries below in virtual environment:
requests
- to download the page
BeautifulSoup
- to find the search result links in the HTML
- to extract the search result links from downloaded HTML
- to parse the HTML data
Selenium
- act as human user interacting with the page to let Python control the browser by programmatically clicking links and filling in login information
- more advance than Requests and BeautifulSoup
- recommended when we would like to scrape a web page that depends on the JavaScript code that updates the page
Chrome driver
- to control Chrome (installation guide click here)
Time
time.sleep
is to spare some delay time so it will prevent over loading with too much requests at once.
Import Libraries
First, we have to import the packages below:
import requests from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.webdriver.support.ui import WebDriverWait from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from time import sleep from turtle import delay import time import pandas as pd
Manipulate chrome options class
Next, we customize the properties of the chrome driver based on our desired operations preferences.
# create object for chrome options chrome_options = Options() # Customize chrome display chrome_options.add_argument('start-maximized') chrome_options.add_argument('--no-sandbox') chrome_options.add_argument('--headless') chrome_options.add_argument('disable-notifications') chrome_options.add_argument("window-size=1365,6572.610") chrome_options.add_argument('--disable-infobars') # create webdriver object path = '/Applications/chromedriver' # path where the chromedriver is saved webdriver_service = Service(path) driver = webdriver.Chrome(executable_path=path, options=chrome_options)
start-maximized
- to open chrome in maximize mode
no-sandbox
- to inactivate the all processes that are normally sandboxed
headless
- to run the Chrome browser in a headless environment. Headless means we would like to explore the web browser with no user interface or invisible Graphical User Interface (GUI).
- by configuring headless mode, we will not display a Chrome window with alert message on top written “Chrome is being controlled by Selenium”
disable-notifications
- to disable the web notifications and the push APIs
window-size
- to customize the size of invisible browser window
disable-infobars
- to avoid chrome from showing “Chrome is being controlled by automated software”
Starting the Browser
In Selenium, WebDriver is used to control our web browser. As we already install the selenium package and import chrome and chrome driver, then we can start the browser by typing this code below:
path = '/Applications/chromedriver' # path where the chromedriver is saved webdriver_service = Service(path) driver = webdriver.Chrome(executable_path=path, options=chrome_options
Retrieve product links
We would like obtain the link of the products from fifty pages of search results with sixty products per page. By using loop function, we are going through every product’s element to get all of the products then iterating through it to get the product link and save it in our dictionary named “product_links”.
baseurl = 'https://shopee.co.id' product_links = [] for page in range(0, 51): search_link = 'https://shopee.co.id/search?keyword=obat%20kanker&page={}'.format(page) driver.get(search_link) WebDriverWait(driver, 80).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "shopee-search-item-result"))) driver.execute_script(""" var scroll = document.body.scrollHeight / 10; var i = 0; function scrollit(i) { window.scrollBy({top: scroll, left: 0, behavior: 'smooth'}); i++; if (i < 10) { setTimeout(scrollit, 500, i); } } scrollit(i); """) sleep(5) html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML") soup = BeautifulSoup(html, "html.parser") product_list = soup.find_all('div',class_='col-xs-2-4 shopee-search-item-result__item' ) for item in product_list: for link in item.find_all('a', href=True): product_links.append(baseurl + link['href'])
Because the search results are load dynamically as we scroll down the page, then we need to use driver.execute_script
as a javascript scroller to scrape the products. On the code above, it will scrolls to a tenth of the page’s height, take a break for 500 mili seconds, and then continues.
I set the explicit wait of WebDriverWait
for 20 seconds to allow our code to halt program execution until the condition we pass, which is to find all element, resolves.
Get the product information
Next we are going to another iteration from the product links that we obtained from code above to go to the page of every product and gather the individual product information, such as name, price, rate, the specification, and the city where the product will be sent from.
herbcancerlist = [] for link in product_links: driver.get(link) WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.CLASS_NAME, "_2VZg1J"))) driver.execute_script(""" var scroll = document.body.scrollHeight / 10; var i = 0; function scrollit(i) { window.scrollBy({top: scroll, left: 0, behavior: 'smooth'}); i++; if (i < 10) { setTimeout(scrollit, 500, i); } } scrollit(i); """) sleep(10) html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML") soup = BeautifulSoup(html, "html.parser") try: name = soup.find('div', class_='_2rQP1z').get_text() price = soup.find('div', class_='_2Shl1j').get_text() sold = soup.find('div', class_ = 'HmRxgn').get_text() rate = soup.find('div', class_ = '_3y5XOB _14izon').get_text() city = soup.find('span', class_ = '_2fJrvA').get_text() specification = soup.find('div', class_ = '_2jz573').get_text() except: name = 'No name' price = 'No price' sold = 'No value' rate = 'No rate' city = 'No city' specification = 'No spec' herbcancer = { 'name': name, 'price': price, 'sold': sold, 'rate': rate, 'city': city, 'specification': specification } herbcancerlist.append(herbcancer) print('Saving: ', herbcancer['name'])
I use try-error so when the code can’t find the element of the information that we are looking for, they will be executed as no name/ no price/ no value/ no rate/ no city/ no spec.
Finally, we turn all of the information that we are already collected into a dataframe named herbcancerlist.
df = pd.DataFrame(herbcancerlist)
Here is the sample of the first five data among 3000 data that I have gathered:
Following this, I would like to wrangle, analyze, and visualize the collected data which I will also post it on this blog.
Source
- Sweigart, AI. 2019. Automate the Boring Stuff with Python. No Starch Press, Incorporated.
- Beverloo, Peter. “List of Chromium Command Line Switches.” Peter Beverloo, 28 Aug 2022, https://peter.sh/experiments/chromium-command-line-switches/.