Frontend Development 15 min read

Using Puppeteer for Emoji Scraping, Headless Chrome, and Front‑End Automation Testing

The article demonstrates how to use Puppeteer—a Node.js API built on the Chrome DevTools Protocol—to run headless Chrome for tasks such as scraping Google emoji images, generating screenshots or PDFs, and automating front‑end tests by launching a browser, navigating pages, handling cookies, simulating user input, capturing responses, and saving results.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Using Puppeteer for Emoji Scraping, Headless Chrome, and Front‑End Automation Testing

Puppeteer is a Node.js package released by the Chrome development team in 2017 that provides a set of APIs to control Chrome. It can be used as a headless Chrome browser (or with a UI) for web page data scraping, screenshot or PDF generation, front‑end automation testing (simulating input/click/keyboard actions), and performance analysis.

In a recent project we needed to download Google emoji images from emojipedia.org/google/ . By locating the ul.emoji-grid element and extracting the src attribute of each img , we can automate the download with Puppeteer. The following script demonstrates the complete process:

const puppeteer = require('puppeteer')
const request = require('request')
const fs = require('fs')

async function getEmojiImage (url) {
  // 返回解析为Promise的浏览器
  const browser = await puppeteer.launch()
  // 返回新的页面对象
  const page = await browser.newPage()
  // 页面对象访问对应的url地址
  await page.goto(url, {
    waitUntil: 'networkidle2'
  })
  // 等待3000ms,等待浏览器的加载
  await page.waitFor(3000)
  // 可以在page.evaluate的回调函数中访问浏览器对象,可以进行DOM操作
  const emojis = await page.evaluate(() => {
    let ol = document.getElementsByClassName('emoji-grid')[0]
    let imgs = ol.getElementsByTagName('img')
    let url = []
    for (let i = 0; i < 97; i++) {
      url.push(imgs[i].getAttribute('src'))
    }
    // 返回所有emoji的url地址数组
    return url
  })
  // 定义一个存在的json
  let json = []
  for (let i = 0; i < emojis.length; i++) {
    const name = emojis[i].slice(emojis[i].lastIndexOf('/') + 1)
    // 将emoji写入本地文件中
    request(emojis[i]).pipe(fs.createWriteStream('./' + (i < 10 ? '0' + i : i) + name))
    json.push({
      name,
      url: `./a/a/${name}` // 你的url地址
    })
    console.log(`${name}----emoji写入成功`)
  }
  // 写入json文件
  fs.writeFile('./google-emoji.json', JSON.stringify(json), function () {})
  // 关闭无头浏览器
  await browser.close()
}

getEmojiImage('https://emojipedia.org/google/')

Before diving deeper into Puppeteer, it is useful to understand Headless Chrome.

Headless Chrome was introduced in Chrome 59 and allows Chrome to run without a graphical UI, exposing all modern web platform features via the command line. Common command‑line usages include opening a page, printing the DOM, generating a PDF, or taking a screenshot.

chrome --headless --disable-gpu --remote-debugging-port=8080 https://vivo.com.cn

On macOS it is convenient to create an alias for the Chrome binary:

alias chrome="/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome"

Examples of typical Headless Chrome commands:

chrome --headless --disable-gpu --dump-dom https://vivo.com.cn
chrome --headless --disable-gpu --print-to-pdf https://vivo.com.cn
chrome --headless --disable-gpu --screenshot https://vivo.com.cn
chrome --headless --disable-gpu --screenshot --window-size=1280,1696 https://vivo.com.cn

Puppeteer builds on the Chrome DevTools Protocol (CDP) to provide a high‑level Node API. A Browser instance represents a Chrome process; a BrowserContext groups pages; a Page corresponds to a single tab. The API hierarchy is illustrated in the official diagram (omitted here).

Key internal managers used by Page include:

FrameManager : navigation, clicks, typing, waiting for loads.

NetworkManager : request interception, cache control.

EmulationManager : viewport emulation.

Below is a minimal example of launching a browser with a remote‑debugging port and printing the WebSocket endpoint:

const browser = await puppeteer.launch({
    // --remote-debugging-port=3333会启一个端口,在浏览器中访问http://127.0.0.1:3333/可以查看
    args: ['--remote-debugging-port=3333']
})
console.log(browser.wsEndpoint())

The endpoint looks like:

ws://127.0.0.1:57546/devtools/browser/5d6ee624-6b5e-4b8c-b284-5e4800eac853

CDP messages are JSON objects with an incremental id , a method , and optional params . For example:

{"id":46,"method":"CSS.getMatchedStylesForNode","params":{"nodeId":5}}
{"id":47,"method":"CSS.getComputedStyleForNode","params":{"nodeId":5}}

These messages can be used to perform low‑level actions such as executing a script:

{"id":190,"method":"Runtime.compileScript","params":{"expression":"alert()","sourceURL":"","persistScript":false,"executionContextId":3}}

Using Puppeteer, we can automate a typical front‑end testing workflow. The steps are:

STEP 1 – Create a Browser instance

const browser = await puppeteer.launch({
    devtools: true, // 自动打开 DevTools 面板
    headless: false, // 以有 UI 模式运行
    defaultViewport: { width: 1000, height: 1200 }, // 默认视口大小
    ignoreHTTPSErrors: true // 忽略 HTTPS 错误
})

STEP 2 – Create a Page and navigate

const page = await browser.newPage()
await page.goto(url, { waitUntil: 'networkidle0' })

When the application requires authentication, cookies can be set directly:

const cookies = [{
    name: 'token',
    value: 'system tokens', // 你的系统 token
    domain: 'domain' // 所在域名
}]
await page.setCookie(...cookies)

STEP 3 – Simulate input and click actions

await page.type('.el-form-item:nth-child(1) input', '132', { delay: 20 })
await page.click('.el-form-item:nth-child(2) .el-form-item__content label:nth-child(1)')

STEP 4 – Listen for API responses

page.on('response', response => {
    const req = response.request()
    console.log(`Response的请求地址:${req.url()},请求方式是:${req.method()},请求返回的状态${response.status()},`)
    response.text().then(result => console.log(`返回的数据:${result}`))
})

STEP 5 – Capture a screenshot

// 使用 URL 的 hash 作为文件名,防止覆盖
const testName = decodeURIComponent(url.split('#/')[1]).replace(/\//g, '-')
await page.screenshot({ path: `${testName}.png`, fullPage: true })

STEP 6 – Close the browser

await browser.close()

Running the script against a back‑office form yields two typical outcomes: a successful validation with API data logged, or a validation failure that results in a screenshot of the error state. The article also suggests further extensions such as simulating production‑environment checks or scheduling periodic data‑scraping jobs.

References:

Headless Chrome announcement

Chromium command‑line switches

Puppeteer API Chinese documentation

PuppeteerNode.jsAutomation TestingWeb Scrapingbrowser-automationheadless-chrome
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.