author
Christoph Leitner Published: November 15, 2019 · 5 minutes read

The requirements for a webscraper highly depend up on the scale of the scraping project. Synchronous scrapers are very very quick to build and easy to help. However, asynchronous web scraping is known to be more efficient and less time consuming. In this article, we are going to compare both methods.

Synchronous web scraping

In this article, we’ll take a look at the difference between synchronous and
asynchronous web scraping. What do these terms mean? In essence, when doing synchronous web scraping of multiple sites, we’ll process one site at a time, moving on to process the next site only after the previous one has finished processing.

Here is an example of asynchronous web scraping in Python. In the file
urls.txt, I’ve collected the URLs of a hundred of the most visited websites on the internet, one URL per line. The Python script goes to each website, finds all the a elements (links) on the site, and counts them. It then sorts the results and writes the numbers to a file.

The popular requests library is used for making the network requests. For extracting the data from the website, we use a library called BeautifulSoup— a widely used Python library for parsing websites and extracting data from them.

from bs4 import BeautifulSoup
import requests
from requests.exceptions import RequestException

def get_link_count(url):
    try:
        print("Processing {}".format(url))
        response = requests.get(url, timeout=5)
        if response.status_code != 200:
            print("Request to {} returned {}".format(url, response.status_code))
        else:
            parsed = BeautifulSoup(response.content, features="html.parser")
            return len(parsed.find_all("a"))
    except RequestException as e:
        print("Request to {} raised exception: {}".format(url, e))

if __name__ == "__main__":
    results = []

    with open("urls.txt") as urls_file:
        for line in urls_file:
            result = get_link_count(line.strip())
            if result is not None:
                results.append(result)

    results.sort(reverse=True)
    with open("sync_results.txt", "w") as results_file:
        results_file.write("n".join(map(lambda n: str(n), results)))

I’ve set a relatively low timeout of five seconds for the network request in
order to not spend too much time waiting for responses from servers that never respond. Nevertheless, on my quad-core Macbook Pro, the code above takes about three minutes to run for a hundred URLs. However, you may have noticed a big inefficiency (or two) with this approach!

As mentioned earlier, for each site, the script waits for the processing of that site to finish before moving on to the next one. Now, what is likely to make up the majority of the time taken to process one site? The network request! Most of our time is spent waiting for the server to give us the content of a particular site, and during that time, the computer is not doing any useful work. The time spent waiting for the website could be used to send other network requests and to process the contents of the websites already received. This is where asynchronous web scraping comes in.

Async to the rescue

If you’re just processing one site, you’ll not benefit much from asynchronous web scraping. However, as soon as you have multiple sites to process, you may want to consider the async approach. Generally, when we talk about asynchronous programming, we talk about doing things outside the linear main flow of the program. If you’re familiar with JavaScript, you’ve probably encountered asynchronous constructs. For example, we may set up an event handler and give it a callback, then move on in the code… and the callback is not run at the time we set up the event handler, but rather at some later time, e.g. as a consequence of the user initiating a specific event. Event handlers, network request callbacks, and functions to be run at a later time or at intervals are all examples of asynchronous programming.

The JavaScript runtime in the browser is single-threaded and uses an event loop to implement asynchronicity. There are libraries and constructs available in Python for doing similar things. However, Python lets us address another inefficiency with the synchronous program above.

I mentioned running it on a quad-core Macbook… but at the moment, only one core is going to be doing the work. In this case, if we go to the trouble of making things work asynchronously, we might as well make the program use multiple threads, allowing it to use more than one core at a time. This particular program does not use a lot of CPU power, and so this is unlikely to make a large difference, but if you’re doing more intensive processing per site, using multiple cores may bring more marked benefits. Python has a module, concurrent.futures, which lets us conveniently spread the work over several cores, initiate many network requests at a time, and process the websites as the responses come back from the servers.

The title image illustrates the difference between synchronous and asynchronous web scraping: in the synchronous scenario, we do one thing at a time, waiting for each network request to return before processing the site and moving on to the next one. In the asynchronous scenario, we do things in parallell, and we utilize the computer to do work while we’re waiting for network requests toreturn.

Here is the above example again, but now changed to utilize asynchronous
programming and multiple threads.

from concurrent.futures import as_completed, ThreadPoolExecutor

from bs4 import BeautifulSoup
import requests
from requests.exceptions import RequestException

MAX_THREADS = 20

def get_link_count(url):
    try:
        print("Processing {}".format(url))
        response = requests.get(url, timeout=5)
        if response.status_code != 200:
            print("Request to {} returned {}".format(url, response.status_code))
        else:
            parsed = BeautifulSoup(response.content, features="html.parser")
            return len(parsed.find_all("a"))
    except RequestException as e:
        print("Request to {} raised exception: {}".format(url, e))

if __name__ == "__main__":
    results = []

    with open("urls.txt") as urls_file:
        with ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:
            futures = (
                executor.submit(get_link_count, line.strip()) for line in urls_file
            )

            for future in as_completed(futures):
                result = future.result()
                if result is not None:
                    results.append(future.result())

    results.sort(reverse=True)
    with open("async_results.txt", "w") as results_file:
        results_file.write("n".join(map(lambda n: str(n), results)))

Notice that the get_link_count function is exactly the same! The only
difference is how we invoke it. We don’t just loop through the URLs, processing
each one sequentially. Rather, using the concurrent.futures module and its
ThreadPoolExecutor class, we schedule an asynchronous call to the processing
function for each URL. Each asynchronous function call is encapsulated in a
construct called a future. Naturally, at some point we need to collect the
results and write them to the file—we need to synchronize the results of the
asynchronous calls. This is done using the as_completed(futures) call. As per the Python documentation, this “returns an iterator over the Future instances … that yields futures as they complete”. A full description of how to use the concurrent.futures module is outside the scope of this article, but as you can see above, the module allows us to make our program ansynchronous and concurrent with a minimum of effort.

In this example, the same web scraping task runs in 25 seconds on my computer, a large improvement compared to three minutes for the synchronous version! Even on a single-core processor, the asynchronous version of the program should perform better, as it doesn’t have to wait for each network request to finish before proceeding with the next one.

Also read: 6 tips for advanced python web scraping

Conclusions

For your own purposes, you may want to experiment with the number of threads used. Depending on the processing done and memory used for each site, as well as the number of sites to scrape, you may find a point where increasing the number of threads results in worse performance. Also, the BeautifulSoup library can use different parsers with different characteristics — you may see improved performance using the lxml parser for example. As mentioned, if you’re only scraping one site, asynchronous web scraping may not carry many benefits for you. Otherwise, I recommend giving the technique a chance!

If web scraping is something you need to do, you may want to take a look at the tools provided by Zenscrape. Zenscrape provides a visual scraping tool which requires no coding experience, along with a web scraping API handling difficult aspects like rendering JavaScript on the scraped website. As many websites act more like applications and less like documents, web scraping can become more difficult. Zenscrape exists to solve these problems so you don’t have to!

Disclaimer:
All articles are for learning purposes only in order to demonstrate different techniques of developing web scrapers. We do not take responsibility for how our code snippets are used and can not be held liable for any detrimental usage.