In this article, we’re going to talk about how to build a Python web scraper, using Selenium in the Python programming language.
Web scraping, also called web data extraction, refers to the technique of harvesting data from a web page by leveraging the patterns in the page’s underlying code. It can be used to collect unstructured information from websites for processing and storage in a structured format.
There are several tools you can use to make the process of web data extraction easy and efficient. For example, Selenium is a portable framework that allows you to automate the functionalities of web browsers using a wide range of programming languages.
Whereas it’s primarily used for testing web applications automatically, it can also be used for extracting online data.
Our objective
In this web scraping tutorial, we want to use Selenium to navigate to Reddit’s homepage, use the search box to perform a search for a term and scrape the headings of the results.
Reddit utilizes JavaScript for dynamically rendering content, so it’s a good way of demonstrating how to perform web scraping for advanced websites.
What you’ll need
- Web browser
- Python development environment
- Selenium
Ready? Let’s get going…
Project setup
In this web scraping project, we’ll need to install Python bindings for Selenium and the associated WebDriver for the browser we want to automate tasks on.
Let’s use pip (package installer for Python) to install Selenium in our development environment:
pip install selenium
Selenium requires a driver to imitate the actions of a real user as closely as possible. Since every browser comes with its own unique ways of setting up browser sessions, you’ll need to set up a browser-specific driver for interfacing with Selenium.
So, for your preferred browser, you’ll need to download its supported driver and place it in a folder located on your system’s path.
For this Selenium tutorial, we’ll use the Chrome driver.
Writing Selenium scraping logic
Let’s now write the logic for scraping web data with Python and Selenium. These are the steps we’ll follow.
1. Importing required modules
Let’s import the modules we’ll use in this project. We start with the module for launching or initializing a browser:
from selenium import webdriver
Next, the module for emulating keyboard actions:
from selenium.webdriver.common.keys import Keys
Now the module for searching for items using the specified parameters:
from selenium.webdriver.common.by import By
Then the module for waiting for a web page to load:
from selenium.webdriver.support.ui import WebDriverWait
Importing module that issues instructions to wait for the expected conditions to be present before the rest of the code is executed:
from selenium.webdriver.support import expected_conditions as EC
2. Initializing the WebDriver
Selenium provides the WebDriver API, which defines the interface for imitating a real user’s actions on a web browser. As earlier mentioned, every browser has its own unique implementation of the WebDriver, called a driver.
Here is how to create an instance of the Chrome WebDriver, which will let us use all its useful features:
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
Note that we specified the path where the Chrome WebDriver is installed on our Windows machine.
The above code will launch Chrome in a headful mode; that is, just like a normal browser. A message will appear on the top section of the browser stating that automated software is controlling its behavior.
We’ll illustrate how to launch a headless browser later in this article.
3. Navigating to the web page
Next, let’s use the driver.get method to navigate to the web page we want to scrape its data.
Here is the code:
driver.get("https://www.reddit.com/")
4. Locating the search box
The WebDriver provides a wide range of find_element(s)_by_*
methods to locate a single element or multiple elements on a web page. You can use tag names, CSS selectors, XPath, IDs, class names, and others to select elements.
If we examine the Reddit homepage using the inspector tool on the Chrome web browser, we notice that the search box has a name
attribute of q
.
So, we can use the find_element_by_name
method to locate the target element.
Here is the code:
search = driver.find_element_by_name("q")
5. Entering the search term
Let’s use the send_keys
method to specify the term we want to search for in the input field. Then, we’ll use Keys.RETURN
it to enter the term.
This is similar to using the keyboard for performing a search.
Here is the code:
search.send_keys("scraping")
search.send_keys(Keys.RETURN)
6. Locating the search results
Most modern websites use AJAX techniques to load their content. Hence, when a browser loads the page, all the elements may not be present immediately be visible to the user. When the elements are loaded at different intervals, it makes locating them for scraping purposes difficult.
Fortunately, Selenium WebDriver provides the wait -feature to allow us to solve this issue. With waits, you can add a bit of slack between actions, ensuring an element is present in the DOM before you can locate it.
For this tutorial, we’ll use an explicit wait that makes the WebDriver wait for the element we want to locate to be present on the page before proceeding with the rest of the code execution.
We’ll accomplish this using a combination of the WebDriverWait
method and the ExpectedCondition
method.
In this case, we’ll instruct Selenium to wait for 20 seconds for the rpBJOHq2PR60pnwJlUyP0
class to be present on the page. If that element is not located within that duration, then a TimeoutException
will be thrown.
Note that if we examine the search results, we notice that all the posts are enclosed in a rpBJOHq2PR60pnwJlUyP0
class.
Here is the code:
search_results = WebDriverWait(driver, 20).until(
EC.presence_of_element_located((By.CLASS_NAME, "rpBJOHq2PR60pnwJlUyP0"))
)
7. Scraping the posts’ headings
Next, let’s scrape the headings of the posts on the search results page.
Note that each post heading is wrapped in a h3
tag and a _eYtD2XCVieq6emjKBH3m
class. Further, each heading is enclosed in a span
tag.
So, let’s start by selecting all the posts’ headings and storing them in a list:
posts = search_results.find_elements_by_css_selector("h3._eYtD2XCVieq6emjKBH3m")
Then, let’s go over each heading and output their content:
8. Quitting the browser
Finally, let’s quit the Chrome browser instance:
driver.quit()
Wrapping up
Here is the entire code for using Python and Selenium to scrape the content of the Reddit site and output the results:
If we run the code above, here is the output we get (for brevity, we’ve truncated the results):
Be a miserable old homophobe? Enjoy scraping out your can.
Entitled kid fights with his tow truck driver for scraping his already destroyed front end of his car. Mom is no help.
[Jenkins] This is what the NFL gets for not scraping Dan Snyder off its shoe by now. "The good bits": Description of the soft porn videos of cheerleaders: outtakes of nipples, crotches inadvertently exposed and intentionally kept, for the enjoyment of the juvy little pervs in Washington's upper mgmt
and scrape that pickle off
Scraping power armor should yield more junk than it does.
Not sure what this is but oh man do I want a crack at scraping that bad boy off. Context in comments
It worked!
Advanced web scraping with Python: Selenium
Selenium comes with several options for performing advanced web scraping with ease. For example, let’s see how you can set it up to use proxies, execute JavaScript, and use a headless browser version.
a. Adding proxies
Web scraping can sometimes be difficult because of the strict policies instituted by websites. With a proxy server, you can mask your real IP address and bypass access restrictions, enabling you to harvest online data quickly and efficiently.
You can use a powerful proxy service, such as Zenscrape’s residential proxies or datacenter proxies, to make the most of your data extraction process.
Here is how you can add proxy settings in Selenium:
b. Executing JavaScript
Sometimes you may need to execute JavaScript on the target web page. For example, if the entire page is not loaded from the start, you may need to scroll down to grab HTML from the rest of the page.
You can do this by using the execute_script
a method that allows you to add any JavaScript code in its parameter.
Here is an example:
scroll_page_down = "window.scrollTo(0, document.body.scrollHeight);"
driver.execute_script(scroll_page_down)
Note that scrollTo(x_coordinates, y_coordinates)
is a JavaScript method that lets you scroll the page to the stipulated coordinates. In this case, we used document.body.scrollHeight
to get the entire height of the body
element.
c. Using a headless browser
Selenium allows you to use the headless mode for running a browser without displaying the graphical user interface. This is important for providing a smooth user experience, especially in production environments.
For example, here is how to run Chrome in a headless mode:
Conclusion
In conclusion, web data extraction using Selenium can be a handy skill in your Python toolbox, particularly when you want to scrape information from dynamic websites and JavaScript-heavy pages.
This article has just scratched the surface of what is possible when using Selenium in Python web scraping. If you intend to delve deeper into the subject, you can check the Selenium with Python documentation here.
Please note that this article is provided for demonstration purposes only.
Frequently Asked Questions
Q: What is a Python web scraper?
A: A Python web scraper is a program or script that automates the process of extracting data from websites. It uses the Python programming language and various libraries, such as BeautifulSoup and Selenium, to parse HTML and interact with web pages.
Q: How does Selenium help in web scraping with Python?
A: Selenium is a powerful tool for web scraping with Python because it allows you to automate interactions with web pages, such as clicking buttons, filling out forms, and scrolling. This makes it easier to scrape dynamic websites that use JavaScript to load content.
Q: What are the benefits of using Selenium for web scraping?
A: Using Selenium for web scraping allows you to scrape websites that would be difficult or impossible to scrape using traditional methods. It also allows you to scrape websites that require authentication or have complex interactions.
Q: Can I use Selenium to scrape websites that require login credentials?
A: Yes, Selenium can be used to scrape websites that require login credentials. You can use Selenium to automate the login process, then scrape the data once you are logged in.