This tutorial will provide a step-by-step guide on how to scrape amazon product information with nodejs and puppeteer, including code snippets that you can run on your local enviroment. These are supposed to help you to build your own production ready amazon scraper.
Use cases
- Price Monitoring: E-Commerce is a very competitive industry, making a smart and dynamic pricing strategy indespensable. Monitoring amazon prices enables you to adapt and optimize your pricing automatically.
- More information: Amazon does provide a product API. However, product pages contain a lot more information that can be obtained via the API.
- Review Information: Scraping Reviews from Amazon enables you analyse the customer satisfaction related to specific products.
Getting stated
I have chosen Nodejs and Puppeteer for this tutorial, as we can use puppeteer to access the page content through a headless browser.
Using a headless browser has the huge advantage, that we can access content that is rendered through javascript frameworks like vue.js or react.
Amazon product pages can also be parsed without a headless browser. However, we decided to use puppeteer, as the script snippets can also be reused for pages that require a browser.
Installing puppeteer
We are going to install puppeteer with your command shell, using npm. Simply run the following command, while being inside the directory where are going to place all other code.
npm install puppeteer --save
The Code
Copy the following code inside a file and save it as “amazonscraper.js”.
const puppeteer = require('puppeteer');
puppeteer.launch({ headless: true, args: ['--no-sandbox', '--disable-setuid-sandbox', '--window-size=1920,1080','--user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3312.0 Safari/537.36"'] }).then(async browser => {
const page = await browser.newPage();
await page.goto("https://www.amazon.com/Apple-iPhone-XR-Fully-Unlocked/dp/B07P6Y7954");
await page.waitForSelector('body');
var productInfo = await page.evaluate(() => {
/* Get product title */
let title = document.body.querySelector('#productTitle').innerText;
/* Get review count */
let reviewCount = document.body.querySelector('#acrCustomerReviewText').innerText;
let formattedReviewCount = reviewCount.replace(/[^0-9]/g,'').trim();
/* Get and format rating */
let ratingElement = document.body.querySelector('.a-icon.a-icon-star').getAttribute('class');
let integer = ratingElement.replace(/[^0-9]/g,'').trim();
let parsedRating = parseInt(integer) / 10;
/* Get availability */
let availability = document.body.querySelector('#availability').innerText;
let formattedAvailability = availability.replace(/[^0-9]/g, '').trim();
/* Get list price */
let listPrice = document.body.querySelector('.priceBlockStrikePriceString').innerText;
/* Get price */
let price = document.body.querySelector('#priceblock_ourprice').innerText;
/* Get product description */
let description = document.body.querySelector('#renewedProgramDescriptionAtf').innerText;
/* Get product features */
let features = document.body.querySelectorAll('#feature-bullets ul li');
let formattedFeatures = [];
features.forEach((feature) => {
formattedFeatures.push(feature.innerText);
});
/* Get comparable items */
let comparableItems = document.body.querySelectorAll('#HLCXComparisonTable .comparison_table_image_row .a-link-normal');
formattedComparableItems = [];
comparableItems.forEach((item) => {
formattedComparableItems.push("https://amazon.com" + item.getAttribute('href'));
});
var product = {
"title": title,
"rating": parsedRating,
"reviewCount" : formattedReviewCount,
"listPrice": listPrice,
"price": price,
"availability": formattedAvailability,
"description": description,
"features": formattedFeatures,
"comparableItems": formattedComparableItems
};
return product;
});
console.log(productInfo);
await browser.close();
}).catch(function(error) {
console.error(error);
});
Run the code by executing the following command in your command shell:
node amazonscrape.js
The Output
{ title: 'Apple iPhone XR vollständig entsperrt (erneuert), schwarz',
rating: 4.5,
reviewCount: '509',
listPrice: '749,99 $',
price: '549,00 $',
availability: '5',
description:
'Dies ist ein Amazon-Renewed-Produkt und kommt mit einer 90 tägigen Garantie von Amazon.\n\nProfessionell geprüft und getestet, um wie neu auszusehen und zu funktionieren. Das Produkt ist nicht von Apple zertifiziert, aber wurde vom Verkäufer geprüft und getestet. Die Verpackung und das entsprechende Zubehör (exklusive Kopfhörer) können generisch sein. Drahtlose Produkte haben eine 90 tägige Mindestgarantie, die von Amazon gestellt wird. Weitere Informationen',
features:
[ 'Seven-layer color process. The beautiful finishes of the back glass are achieved using an advanced process that allows for deep, rich colors.',
'Aerospace-grade aluminum bands. A special Apple‑designed alloy is precision‑machined to create structural bands and anodized to complement the color of the back glass.',
'Wireless charging. The glass back allows iPhone XR to charge easily and wirelessly.',
'Intelligent A12 Bionic. This is the smartest, most powerful chip in a smartphone, with our next-generation Neural Engine',
'12MP rear Camera, ƒ/1. 8, wide-angle lens, portrait mode with depth Control, 2X faster sensor for smart HDR across your photos, 4K video up to 60 fps' ],
comparableItems:
[ 'https://amazon.com/dp/B077578W38/ref=psdc_2407748011_t1_B07P6Y7954',
'https://amazon.com/dp/B07756QYST/ref=psdc_2407748011_t2_B07P6Y7954',
'https://amazon.com/dp/B07K97BQDF/ref=psdc_2407748011_t3_B07P6Y7954',
'https://amazon.com/dp/B07XSS3Z9J/ref=psdc_2407748011_t4_B07P6Y7954',
'https://amazon.com/dp/B07TNNRQMS/ref=psdc_2407748011_t5_B07P6Y7954' ] }
The code above will only need minor adjustments, in order to save the scraped content into a file or a database. Let’s continue by taking a look at the challenges that come along with scraper larger amounts of amazon product pages.
How to scrape amazon product information at larger scale
Amazon tries to prevent excessive scraping and imposes CAPTCHAs as anti scraping measure. In order to keep your script up and running, you can do the following:
Go asynchronous
Using a puppeteer cluster will enable you to smoothly scrape amazon product information asynchronously and help you to drastically increase speed. However, keep in mind to limit the number of concurrent requests to a level that will not harm the web server of the site you are scraping.
Rotating the IP address
The part of your digital footprint that is used most often to identify and flag you is your IP address. Rotating your IP address after every few requests will be the most important countermeasure, to prevent you from being blocked. Go with datacenter– or residential proxies.
Rotating the user agent
Another important part of your digital footprint is your user agent, that is coming along with each of your requests. In the code snippet above, we have used the most commonly used user agent. However, you should rotate it every few requests. A list of the most commonly used user agents can be found here: http://www.networkinghowtos.com/howto/common-user-agent-list/
Retrying failed requests
No matter how hard you try, there will be requests that will fail. You should catch them by checking the page content and by checking the status code of the request response and retry again, with a different IP.
Disclaimer:
All articles are for learning purposes only in order to demonstrate different techniques of developing web scrapers. We do not take responsibility for how our code snippets are used and can not be held liable for any detrimental usage.