In this article, we’re going to talk about how to perform web scraping with python, using Selenium in the Python programming language.

Web scraping, also called web data extraction, refers to the technique of harvesting data from a web page through leveraging the patterns in the page’s underlying code. It can be used to collect unstructured information from websites for processing and storage in a structured format.

There are several tools you can use to make the process of web data extraction easy and efficient. For example, Selenium is a portable framework that allows you to automate the functionalities of web browsers using a wide range of programming languages.

Whereas it’s primarily used for testing web applications automatically, it can also be used for extracting online data.

Our objective

In this web scraping tutorial, we want to use Selenium to navigate to Reddit’s homepage, use the search box to perform a search for a term, and scrape the headings of the results.

Reddit utilizes JavaScript for dynamically rendering content, so it’s a good way of demonstrating how to perform web scraping for advanced websites.

What you’ll need

  • Web browser
  • Python development environment
  • Selenium

Ready? Let’s get going…

Project setup

In this web scraping project, we’ll need to install Python bindings for Selenium and the associated WebDriver for the browser we want to automate tasks on.

Let’s use pip (package installer for Python) to install Selenium in our development environment:

pip install selenium

Selenium requires a driver to imitate the actions of a real user as closely as possible. Since every browser comes with its own unique ways of setting up browser sessions, you’ll need to set up a browser-specific driver for interfacing with Selenium.

So, for your preferred browser, you’ll need to download its supported driver and place it in a folder located on your system’s path.

For this Selenium tutorial, we’ll use the Chrome driver.

Writing Selenium scraping logic

Let’s now write the logic for scraping web data with Python and Selenium. These are the steps we’ll follow.

1. Importing required modules

Let’s import the modules we’ll use in this project. Importing module for launching or initializing a browser:

from selenium import webdriver

Importing module for emulating keyboard actions:

from selenium.webdriver.common.keys import Keys

Importing module for searching for items using the specified parameters:

from selenium.webdriver.common.by import By

Importing module for waiting for a web page to load:

from selenium.webdriver.support.ui import WebDriverWait

Importing module that issues instructions to wait for the expected conditions to be present before the rest of the code is executed:

from selenium.webdriver.support import expected_conditions as EC

2. Initializing the WebDriver

Selenium provides the WebDriver API, which defines the interface for imitating a real user’s actions on a web browser. As earlier mentioned, every browser has its own unique implementation of the WebDriver, called a driver.

Here is how to create an instance of the Chrome WebDriver, which will let us use all its useful features:

PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)

Note that we specified the path where the Chrome WebDriver is installed on our Windows machine.

The above code will launch Chrome in a headful mode; that is, just like a normal browser. And a message will appear on the top section of the browser stating that automated software is controlling its behavior.

We’ll illustrate how to launch a headless browser later in this article.

3. Navigating to the web page

Next, let’s use the driver.get method to navigate to the web page we want to scrape its data.

Here is the code:

driver.get("https://www.reddit.com/")

4. Locating the search box

The WebDriver provides a wide range of find_element(s)_by_* methods to locate a single element or multiple elements on a web page. You can use tag names, CSS selectors, XPath, IDs, class names, and others to select elements.

If we examine the Reddit homepage using the inspector tool on the Chrome web browser, we notice that the search box has a name attribute of q.

So, we can use the find_element_by_name method to locate the target element.

Here is the code:

search = driver.find_element_by_name("q")

5. Entering the search term

Let’s use the send_keys method to specify the term we want to search for in the input field. Then, we’ll use Keys.RETURN to enter the term.

This is similar to using the keyboard for performing a search.

Here is the code:

search.send_keys("scraping")
search.send_keys(Keys.RETURN)

6. Locating the search results

Most modern websites use AJAX techniques to load their content. So, when a browser loads the page, all the elements may not be present immediately on the page. When the elements are loaded at different intervals, it makes locating them for scraping purposes difficult.

Fortunately, Selenium WebDriver provides the waits feature to allow us to solve this issue. With waits, you can add a bit of slack between actions, ensuring an element is present in the DOM before you can locate it.

For this tutorial, we’ll use an explicit wait that makes the WebDriver to wait for the element we want to locate to be present on the page before proceeding with the rest of the code execution.

We’ll accomplish this using a combination of the WebDriverWait method and the ExpectedCondition method.

In this case, we’ll instruct Selenium to wait for 20 seconds for the rpBJOHq2PR60pnwJlUyP0 class to be present on the page. If that element is not located within that duration, then a TimeoutException will be thrown.

Note that if we examine the search results, we notice that all the posts are enclosed in a rpBJOHq2PR60pnwJlUyP0 class.

Web scraping with python

Here is the code:

search_results = WebDriverWait(driver, 20).until(
EC.presence_of_element_located((By.CLASS_NAME, "rpBJOHq2PR60pnwJlUyP0"))
) 

7. Scraping the posts’ headings

Next, let’s scrape the headings of the posts on the search results page.

Note that each post heading is wrapped in an h3 tag and a _eYtD2XCVieq6emjKBH3m class. Further, each heading is enclosed in a span tag.

C:\Users\user\Downloads\szoter_annotated_image(13).jpeg

So, let’s start by selecting all the posts’ headings and storing them in a list:

posts = search_results.find_elements_by_css_selector("h3._eYtD2XCVieq6emjKBH3m")

Then, let’s go over each heading and output their content:

for post in posts:
header = post.find_element_by_tag_name("span")
print(header.text)

8. Quitting the browser

Finally, let’s quit the Chrome browser instance:

driver.quit()

Wrapping up

Here is the entire code for using Python and Selenium to scrape the content of the Reddit site and output the results:

#import modules 
from selenium import webdriver 
from selenium.webdriver.common.keys import Keys 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
#initialize webdriver 
PATH = "C:\Program Files (x86)\chromedriver.exe" 
driver = webdriver.Chrome(PATH) 
#navigate to web page 
driver.get("https://www.reddit.com/") 
#locate search box 
search = driver.find_element_by_name("q") 
#enter search term 
search.send_keys("scraping") 
search.send_keys(Keys.RETURN) 

try: 
#locate search results 
search_results = WebDriverWait(driver, 20).until( 
EC.presence_of_element_located((By.CLASS_NAME, "rpBJOHq2PR60pnwJlUyP0")) 
) 
#scrape posts' headings 
posts = search_results.find_elements_by_css_selector("h3._eYtD2XCVieq6emjKBH3m") 
for post in posts: 
header = post.find_element_by_tag_name("span") 
print(header.text) 
finally: 
#quit browser 
driver.quit()

If we run the code above, here is the output we get (for brevity, we’ve truncated the results):

Be a miserable old homophobe? Enjoy scraping out your can. 
Entitled kid fights with his tow truck driver for scraping his already destroyed front end of his car. Mom is no help. 
[Jenkins] This is what the NFL gets for not scraping Dan Snyder off its shoe by now. "The good bits": Description of the soft porn videos of cheerleaders: outtakes of nipples, crotches inadvertently exposed and intentionally kept, for the enjoyment of the juvy little pervs in Washington's upper mgmt 
and scrape that pickle off 
Scraping power armor should yield more junk than it does. 
Not sure what this is but oh man do I want a crack at scraping that bad boy off. Context in comments

It worked!

Advanced web scraping with python: Selenium

Selenium comes with several options for performing advanced web scraping with ease. For example, let’s see how you can set it up to use proxies, execute JavaScript, and use a headless browser version.

a. Adding proxies

Web scraping can sometimes be difficult because of the strict policies instituted by websites. With a proxy server, you can mask your real IP address and bypass access restrictions, enabling you to harvest online data quickly and efficiently.

You can use a powerful proxy service, such as the Zenscrape’s residential proxies or datacenter proxies, to make the most of your data extraction process.

Here is how you can add proxy settings in Selenium:

from selenium import webdriver 
#other imports here 
PROXY = "127.63.13.19:3184" #HOST:PORT or IP:PORT 
chrome_options = webdriver.ChromeOptions() 
chrome_options.add_argument('--proxy-server=%s' % PROXY) 
chrome = webdriver.Chrome(options=chrome_options) 
chrome.get("https://www.reddit.com/") 
#more code here

b. Executing JavaScript

Sometimes you may need to execute JavaScript on the target web page. For example, if the entire page is not loaded from the start, you may need to scroll down to grab HTML from the rest of the page.

You can do this by using the execute_script method that allows you to add any JavaScript code in its parameter.

Here is an example:

scroll_page_down = "window.scrollTo(0, document.body.scrollHeight);" 
driver.execute_script(scroll_page_down)

Note that scrollTo(x_coordinates, y_coordinates) is a JavaScript method that lets you scroll the page to the stipulated coordinates. In this case, we used document.body.scrollHeight to get the entire height of the body element.

c. Using headless browser

Selenium allows you to use the headless mode for running a browser without displaying the graphical user interface. This is important for providing a smooth user experience, especially in production environments.

For example, here is how to run Chrome in a headless mode:

from selenium import webdriver 
from selenium.webdriver.chrome.options import Options 
#other imports here 
options = Options() 
options.headless = True 
driver = webdriver.Chrome(CHROMEDRIVER_PATH, chrome_options=options) 
driver.get("https://www.reddit.com/") 
#more code here

Conclusion

Web data extraction using Selenium can be a handy skill in your Python toolbox, particularly when you want to scrape information from dynamic websites and JavaScript-heavy pages.

This article has just scratched the surface of what is possible when using Selenium in Python web scraping. If you intend to delve deeper into the subject, you can check the Selenium with Python documentation here.

Note that this article is provided for demonstration purposes only.

Happy scraping!