Using Puppeteer for Emoji Scraping, Headless Chrome, and Front‑End Automation Testing
The article demonstrates how to use Puppeteer—a Node.js API built on the Chrome DevTools Protocol—to run headless Chrome for tasks such as scraping Google emoji images, generating screenshots or PDFs, and automating front‑end tests by launching a browser, navigating pages, handling cookies, simulating user input, capturing responses, and saving results.
Puppeteer is a Node.js package released by the Chrome development team in 2017 that provides a set of APIs to control Chrome. It can be used as a headless Chrome browser (or with a UI) for web page data scraping, screenshot or PDF generation, front‑end automation testing (simulating input/click/keyboard actions), and performance analysis.
In a recent project we needed to download Google emoji images from emojipedia.org/google/ . By locating the ul.emoji-grid element and extracting the src attribute of each img , we can automate the download with Puppeteer. The following script demonstrates the complete process:
const puppeteer = require('puppeteer')
const request = require('request')
const fs = require('fs')
async function getEmojiImage (url) {
// 返回解析为Promise的浏览器
const browser = await puppeteer.launch()
// 返回新的页面对象
const page = await browser.newPage()
// 页面对象访问对应的url地址
await page.goto(url, {
waitUntil: 'networkidle2'
})
// 等待3000ms,等待浏览器的加载
await page.waitFor(3000)
// 可以在page.evaluate的回调函数中访问浏览器对象,可以进行DOM操作
const emojis = await page.evaluate(() => {
let ol = document.getElementsByClassName('emoji-grid')[0]
let imgs = ol.getElementsByTagName('img')
let url = []
for (let i = 0; i < 97; i++) {
url.push(imgs[i].getAttribute('src'))
}
// 返回所有emoji的url地址数组
return url
})
// 定义一个存在的json
let json = []
for (let i = 0; i < emojis.length; i++) {
const name = emojis[i].slice(emojis[i].lastIndexOf('/') + 1)
// 将emoji写入本地文件中
request(emojis[i]).pipe(fs.createWriteStream('./' + (i < 10 ? '0' + i : i) + name))
json.push({
name,
url: `./a/a/${name}` // 你的url地址
})
console.log(`${name}----emoji写入成功`)
}
// 写入json文件
fs.writeFile('./google-emoji.json', JSON.stringify(json), function () {})
// 关闭无头浏览器
await browser.close()
}
getEmojiImage('https://emojipedia.org/google/')Before diving deeper into Puppeteer, it is useful to understand Headless Chrome.
Headless Chrome was introduced in Chrome 59 and allows Chrome to run without a graphical UI, exposing all modern web platform features via the command line. Common command‑line usages include opening a page, printing the DOM, generating a PDF, or taking a screenshot.
chrome --headless --disable-gpu --remote-debugging-port=8080 https://vivo.com.cnOn macOS it is convenient to create an alias for the Chrome binary:
alias chrome="/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome"Examples of typical Headless Chrome commands:
chrome --headless --disable-gpu --dump-dom https://vivo.com.cn chrome --headless --disable-gpu --print-to-pdf https://vivo.com.cn chrome --headless --disable-gpu --screenshot https://vivo.com.cn chrome --headless --disable-gpu --screenshot --window-size=1280,1696 https://vivo.com.cnPuppeteer builds on the Chrome DevTools Protocol (CDP) to provide a high‑level Node API. A Browser instance represents a Chrome process; a BrowserContext groups pages; a Page corresponds to a single tab. The API hierarchy is illustrated in the official diagram (omitted here).
Key internal managers used by Page include:
FrameManager : navigation, clicks, typing, waiting for loads.
NetworkManager : request interception, cache control.
EmulationManager : viewport emulation.
Below is a minimal example of launching a browser with a remote‑debugging port and printing the WebSocket endpoint:
const browser = await puppeteer.launch({
// --remote-debugging-port=3333会启一个端口,在浏览器中访问http://127.0.0.1:3333/可以查看
args: ['--remote-debugging-port=3333']
})
console.log(browser.wsEndpoint())The endpoint looks like:
ws://127.0.0.1:57546/devtools/browser/5d6ee624-6b5e-4b8c-b284-5e4800eac853CDP messages are JSON objects with an incremental id , a method , and optional params . For example:
{"id":46,"method":"CSS.getMatchedStylesForNode","params":{"nodeId":5}} {"id":47,"method":"CSS.getComputedStyleForNode","params":{"nodeId":5}}These messages can be used to perform low‑level actions such as executing a script:
{"id":190,"method":"Runtime.compileScript","params":{"expression":"alert()","sourceURL":"","persistScript":false,"executionContextId":3}}Using Puppeteer, we can automate a typical front‑end testing workflow. The steps are:
STEP 1 – Create a Browser instance
const browser = await puppeteer.launch({
devtools: true, // 自动打开 DevTools 面板
headless: false, // 以有 UI 模式运行
defaultViewport: { width: 1000, height: 1200 }, // 默认视口大小
ignoreHTTPSErrors: true // 忽略 HTTPS 错误
})STEP 2 – Create a Page and navigate
const page = await browser.newPage()
await page.goto(url, { waitUntil: 'networkidle0' })When the application requires authentication, cookies can be set directly:
const cookies = [{
name: 'token',
value: 'system tokens', // 你的系统 token
domain: 'domain' // 所在域名
}]
await page.setCookie(...cookies)STEP 3 – Simulate input and click actions
await page.type('.el-form-item:nth-child(1) input', '132', { delay: 20 })
await page.click('.el-form-item:nth-child(2) .el-form-item__content label:nth-child(1)')STEP 4 – Listen for API responses
page.on('response', response => {
const req = response.request()
console.log(`Response的请求地址:${req.url()},请求方式是:${req.method()},请求返回的状态${response.status()},`)
response.text().then(result => console.log(`返回的数据:${result}`))
})STEP 5 – Capture a screenshot
// 使用 URL 的 hash 作为文件名,防止覆盖
const testName = decodeURIComponent(url.split('#/')[1]).replace(/\//g, '-')
await page.screenshot({ path: `${testName}.png`, fullPage: true })STEP 6 – Close the browser
await browser.close()Running the script against a back‑office form yields two typical outcomes: a successful validation with API data logged, or a validation failure that results in a screenshot of the error state. The article also suggests further extensions such as simulating production‑environment checks or scheduling periodic data‑scraping jobs.
References:
Headless Chrome announcement
Chromium command‑line switches
Puppeteer API Chinese documentation
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.