Get "PHP 8 in a Nuthshell" (Now with PHP 8.4)
Amit Merchant

Amit Merchant

A blog on PHP, JavaScript, and more

Using Bright Data's scraping browser to Scrape Web Pages

There comes a time when you need to scrape a website for data. For instance, you might want to scrape a website to get the list of all the products they have or you might want to scrape a website to get the list of all the blog posts they have. Or you just want to automate the process of scraping a website for data to make your life easier.

What is web scraping?

But what do you mean by scraping web pages anyway?

Well, scraping web pages is the process of extracting data from a website by traversing through its HTML structure. This is usually done by using a web scraping tool or a web scraping library.

We can accomplish this by using something called a headless browser. A headless browser is a web browser that can be used without a graphical user interface. It can be controlled programmatically and is used to automate web page interaction.

So, what we do, essentially, is spin up a headless browser and then use it to navigate through the website’s HTML structure and extract the data we need.

Using a headless browser

There are a lot of headless browsers out there. But Puppeteer is the most popular one. It’s a Node.js library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome or Chromium.

You install Puppeteer in your existing JavaScript project with npm using the following command.

npm i puppeteer

Once installed, a simple script to scrape a website title would look like so.

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto('https://example.com');
  const title = await page.title();

  console.log(title);

  await browser.close();
})();

As you can tell, we’re using the puppeteer package to launch a headless browser and then we’re navigating to the website we want to scrape. And then, we’re extracting the title of the website.

Now, this works fine for most of the websites but in scenarios where the website you want to scrap certain websites that are protected by anti-bot measures and CAPTCHAs, you might not be able to scrape them using Puppeteer.

In such cases, you can use a scraping browser like Bright Data’s scraping browser.

Bright Data’s scraping browser

It’s a browser that’s built specifically for web scraping at scale. It’s a Chromium-based browser instance that’s built to handle anti-bot measures, fingerprints, CAPTCHAs, and more. It even bypasses the toughest website blocks.

You can also instantiate unlimited instances of the scraping browser and run them in parallel to scrape multiple websites at the same time.

This scraping browser is a “headfull” browser as opposed to Puppeteer which is a “headless” browser.

Meaning, while headless browsers are often used with proxies for data scraping, these browsers are easily detected by bot-protection software, making data scraping difficult on a large scale.

A headfull browser, on the other hand, is far superior to headless browsers for scaling data scraping projects & bypassing blocks.

Using the scraping browser

To start using Bright Data’s scraping browser, you first need to create an account on Bright Data.

Once you’ve created an account, navigate to https://brightdata.com/cp/zones and click on Get Started in the Scraping Browser section.

Get Started

This will take you to the Scraping Browser page where you can create a new scraping browser instance like so.

Create Scraping Browser

This will take you to a page from which you can get the Host, Username, and Password that you can use to instantiate the scraping browser.

Browser Credentials

Once done, you can start using these credentials to instantiate the scraping browser using Puppeteer.

In this case, we’ll use the puppeteer-core package which is a version of Puppeteer that doesn’t download Chromium by default.

Note: Apart from Puppeteer, this scraping browser is also compatible with Playwright as well as Selenium. But we’ll stick with Puppeteer for this example.

Instead, it provides a way to launch Chromium from a specific path.

npm i puppeteer-core

And here’s how you can grab an entire website’s HTML using Bright Data’s scraping browser.

const puppeteer = require('puppeteer-core');

// should look like 'brd-customer-<ACCOUNTID>-zone:<PASSWORD>'
const auth='USERNAME:PASSWORD';

async function scrape(){
  let browser;
  try {
    browser = await puppeteer.connect({
        browserWSEndpoint: `wss://${auth}@brd.superproxy.io:9222`
    });

    const page = await browser.newPage();
    page.setDefaultNavigationTimeout(2*60*1000);
    await page.goto('https://example.com');

    const html = await page.evaluate(() => {
        return document.documentElement.outerHTML
    });

    console.log(html);
  } catch(e) {
    console.error('run failed', e);
  } finally {
    await browser?.close();
  }
}

if (require.main==module)
     scrape();

Save it as scrape.js and run it like so.

node scrape.js

As you can tell, we’re connecting to Bright Data’s scraping browser using the puppeteer.connect method by passing the browserWSEndpoint option. This option is the URL of the scraping browser which you can find in Bright Data’s dashboard.

You need to replace the USERNAME and PASSWORD with your own credentials. And the browser instance with the Host provided in the dashboard.

The best part is Bright Data provides a free trial of $5 if you want to try out their scraping browser. You can sign up for a free trial: Bright Data free trial

Using Bright Data’s scraping browser is ethical

It’s important to note here that when using Bright Data’s scraping browser, you don’t need to worry about the complaince since they maintains a compliance system that is based on deterrence, prevention, and enforcement. So, be sure that it’s fairly ethical to use it.

Wrapping up

So, this is how you can use Bright Data’s scraping browser to scrape web pages. As you can tell, it’s pretty straightforward and easy to use. You can also use it to scrape multiple websites at the same time by instantiating multiple scraping browser instances.

You can get a limited-time offer of up to 50% off!

Like this article?

Buy me a coffee

👋 Hi there! I'm Amit. I write articles about all things web development. You can become a sponsor on my blog to help me continue my writing journey and get your brand in front of thousands of eyes.

Comments?