Table of Contents
- 1 Our objective
- 2 What you’ll need
- 3 Project setup
- 4 Writing Selenium scraping logic
- 5 Wrapping up
- 6 Advanced web scraping with python: Selenium
- 7 Conclusion
In this article, we’re going to talk about how to perform web scraping with python, using Selenium in the Python programming language.
Web scraping, also called web data extraction, refers to the technique of harvesting data from a web page through leveraging the patterns in the page’s underlying code. It can be used to collect unstructured information from websites for processing and storage in a structured format.
There are several tools you can use to make the process of web data extraction easy and efficient. For example, Selenium is a portable framework that allows you to automate the functionalities of web browsers using a wide range of programming languages.
Whereas it’s primarily used for testing web applications automatically, it can also be used for extracting online data.
In this web scraping tutorial, we want to use Selenium to navigate to Reddit’s homepage, use the search box to perform a search for a term, and scrape the headings of the results.
What you’ll need
- Web browser
- Python development environment
Ready? Let’s get going…
In this web scraping project, we’ll need to install Python bindings for Selenium and the associated WebDriver for the browser we want to automate tasks on.
Let’s use pip (package installer for Python) to install Selenium in our development environment:
pip install selenium
Selenium requires a driver to imitate the actions of a real user as closely as possible. Since every browser comes with its own unique ways of setting up browser sessions, you’ll need to set up a browser-specific driver for interfacing with Selenium.
So, for your preferred browser, you’ll need to download its supported driver and place it in a folder located on your system’s path.
For this Selenium tutorial, we’ll use the Chrome driver.
Writing Selenium scraping logic
Let’s now write the logic for scraping web data with Python and Selenium. These are the steps we’ll follow.
1. Importing required modules
Let’s import the modules we’ll use in this project. Importing module for launching or initializing a browser:
from selenium import webdriver
Importing module for emulating keyboard actions:
from selenium.webdriver.common.keys import Keys
Importing module for searching for items using the specified parameters:
from selenium.webdriver.common.by import By
Importing module for waiting for a web page to load:
from selenium.webdriver.support.ui import WebDriverWait
Importing module that issues instructions to wait for the expected conditions to be present before the rest of the code is executed:
from selenium.webdriver.support import expected_conditions as EC
2. Initializing the WebDriver
Selenium provides the WebDriver API, which defines the interface for imitating a real user’s actions on a web browser. As earlier mentioned, every browser has its own unique implementation of the WebDriver, called a driver.
Here is how to create an instance of the Chrome WebDriver, which will let us use all its useful features:
PATH = "C:\Program Files (x86)\chromedriver.exe" driver = webdriver.Chrome(PATH)
Note that we specified the path where the Chrome WebDriver is installed on our Windows machine.
The above code will launch Chrome in a headful mode; that is, just like a normal browser. And a message will appear on the top section of the browser stating that automated software is controlling its behavior.
We’ll illustrate how to launch a headless browser later in this article.
Next, let’s use the driver.get method to navigate to the web page we want to scrape its data.
Here is the code:
4. Locating the search box
The WebDriver provides a wide range of
find_element(s)_by_* methods to locate a single element or multiple elements on a web page. You can use tag names, CSS selectors, XPath, IDs, class names, and others to select elements.
If we examine the Reddit homepage using the inspector tool on the Chrome web browser, we notice that the search box has a
name attribute of
So, we can use the
find_element_by_name method to locate the target element.
Here is the code:
search = driver.find_element_by_name("q")
5. Entering the search term
Let’s use the
send_keys method to specify the term we want to search for in the input field. Then, we’ll use
Keys.RETURN to enter the term.
This is similar to using the keyboard for performing a search.
Here is the code:
6. Locating the search results
Most modern websites use AJAX techniques to load their content. So, when a browser loads the page, all the elements may not be present immediately on the page. When the elements are loaded at different intervals, it makes locating them for scraping purposes difficult.
Fortunately, Selenium WebDriver provides the waits feature to allow us to solve this issue. With waits, you can add a bit of slack between actions, ensuring an element is present in the DOM before you can locate it.
For this tutorial, we’ll use an explicit wait that makes the WebDriver to wait for the element we want to locate to be present on the page before proceeding with the rest of the code execution.
We’ll accomplish this using a combination of the
WebDriverWait method and the
In this case, we’ll instruct Selenium to wait for 20 seconds for the
rpBJOHq2PR60pnwJlUyP0 class to be present on the page. If that element is not located within that duration, then a
TimeoutException will be thrown.
Note that if we examine the search results, we notice that all the posts are enclosed in a
Here is the code:
search_results = WebDriverWait(driver, 20).until( EC.presence_of_element_located((By.CLASS_NAME, "rpBJOHq2PR60pnwJlUyP0")) )
7. Scraping the posts’ headings
Next, let’s scrape the headings of the posts on the search results page.
Note that each post heading is wrapped in an
h3 tag and a
_eYtD2XCVieq6emjKBH3m class. Further, each heading is enclosed in a
So, let’s start by selecting all the posts’ headings and storing them in a list:
posts = search_results.find_elements_by_css_selector("h3._eYtD2XCVieq6emjKBH3m")
Then, let’s go over each heading and output their content:
for post in posts: header = post.find_element_by_tag_name("span") print(header.text)
8. Quitting the browser
Finally, let’s quit the Chrome browser instance:
Here is the entire code for using Python and Selenium to scrape the content of the Reddit site and output the results:
#import modules from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC #initialize webdriver PATH = "C:\Program Files (x86)\chromedriver.exe" driver = webdriver.Chrome(PATH) #navigate to web page driver.get("https://www.reddit.com/") #locate search box search = driver.find_element_by_name("q") #enter search term search.send_keys("scraping") search.send_keys(Keys.RETURN) try: #locate search results search_results = WebDriverWait(driver, 20).until( EC.presence_of_element_located((By.CLASS_NAME, "rpBJOHq2PR60pnwJlUyP0")) ) #scrape posts' headings posts = search_results.find_elements_by_css_selector("h3._eYtD2XCVieq6emjKBH3m") for post in posts: header = post.find_element_by_tag_name("span") print(header.text) finally: #quit browser driver.quit()
If we run the code above, here is the output we get (for brevity, we’ve truncated the results):
Be a miserable old homophobe? Enjoy scraping out your can. Entitled kid fights with his tow truck driver for scraping his already destroyed front end of his car. Mom is no help. [Jenkins] This is what the NFL gets for not scraping Dan Snyder off its shoe by now. "The good bits": Description of the soft porn videos of cheerleaders: outtakes of nipples, crotches inadvertently exposed and intentionally kept, for the enjoyment of the juvy little pervs in Washington's upper mgmt and scrape that pickle off Scraping power armor should yield more junk than it does. Not sure what this is but oh man do I want a crack at scraping that bad boy off. Context in comments
a. Adding proxies
Web scraping can sometimes be difficult because of the strict policies instituted by websites. With a proxy server, you can mask your real IP address and bypass access restrictions, enabling you to harvest online data quickly and efficiently.
Here is how you can add proxy settings in Selenium:
from selenium import webdriver #other imports here PROXY = "127.63.13.19:3184" #HOST:PORT or IP:PORT chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--proxy-server=%s' % PROXY) chrome = webdriver.Chrome(options=chrome_options) chrome.get("https://www.reddit.com/") #more code here
You can do this by using the
Here is an example:
scroll_page_down = "window.scrollTo(0, document.body.scrollHeight);" driver.execute_script(scroll_page_down)
document.body.scrollHeight to get the entire height of the
c. Using headless browser
Selenium allows you to use the headless mode for running a browser without displaying the graphical user interface. This is important for providing a smooth user experience, especially in production environments.
For example, here is how to run Chrome in a headless mode:
from selenium import webdriver from selenium.webdriver.chrome.options import Options #other imports here options = Options() options.headless = True driver = webdriver.Chrome(CHROMEDRIVER_PATH, chrome_options=options) driver.get("https://www.reddit.com/") #more code here
This article has just scratched the surface of what is possible when using Selenium in Python web scraping. If you intend to delve deeper into the subject, you can check the Selenium with Python documentation here.
Note that this article is provided for demonstration purposes only.