Web Scraping using Puppeteer and Node

Last updated 86 days ago by Hrithik Jha

nodejs

Introduction

Puppeteer is a node library which provides an API to control Google Chrome and Chromium. It can be used to scrape all aspects of a Chrome (or Chromium) window including the Chrome Developer Tools. Today, we’ll be scraping lobste.rs.

Why Scrape Data?

Scraping can be used for many different purposes. It helps collect data for Machine Learning or can be used for Data Visualization. It can also be used to automate processes. An amazing example would be to scrape movie tickets and prices of shows in theaters nearby. With the data you can sort in order of the price, show timing even with data from different websites to make decisions.

Getting Started

For starters, you’d need a text editor (Pro Tip: MS Word with Consolas) and a system with Node. You can get Node from here or you can use brew install node on a Mac, or sudo apt-get install nodejs npm on Linux.

To install Puppeteer, simply run:

npm install puppeteer

Puppeteer is built on Node and can be run with any existing node program. One might use Requests or Selenium (more scraping and automation libraries) with Flask or Django, similarly we can use Puppeteer for programs written on Node.

Into The Code

The first thing we do is to import the Puppeteer library to be used in the program.

const puppeteer = require('puppeteer');

The scraping logic is written inside the appropriately named function, scrape(). You would also notice the async keyword. That is needed as most of the Puppeteer API are asynchronous and have to be written inside an async function.

Read full Article