Backend Development 15 min read

Building a Node.js Web Crawler for Indeed Job Listings with MongoDB

This article details how to build a Node.js web crawler for Indeed job listings, covering entry page selection, HTML parsing with Cheerio, request handling, MongoDB task storage, and a modular architecture that extracts city, category, search, brief, and detail data for a searchable job engine.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Building a Node.js Web Crawler for Indeed Job Listings with MongoDB

In this tutorial the author describes the development of a Node.js web crawler aimed at harvesting job data from the Indeed website and turning it into a simple job search engine.

The first step is to choose an entry page. Because Indeed limits the number of results on its standard list pages, the crawler uses the "Browse Jobs" page, which provides links for all regions and job categories.

The following code shows how the "Browse Jobs" page is parsed to collect city and category URLs and insert them into a MongoDB collection:

<code>start: async (page) => {
  const host = URL.parse(page.url).hostname;
  const tasks = [];
  try {
    const $ = cheerio.load(iconv.decode(page.con, 'utf-8'), { decodeEntities: false });
    $('#states > tbody > tr > td > a').each((i, ele) => {
      const url = URL.resolve(page.url, $(ele).attr('href'));
      tasks.push({ _id: md5(url), type: 'city', host, url, done: 0, name: $(ele).text() });
    });
    $('#categories > tbody > tr > td > a').each((i, ele) => {
      const url = URL.resolve(page.url, $(ele).attr('href'));
      tasks.push({ _id: md5(url), type: 'category', host, url, done: 0, name: $(ele).text() });
    });
    const res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {});
    res && console.log(`${host}-start insert ${res.insertedCount} from ${tasks.length} tasks`);
    return 1;
  } catch (err) {
    console.error(`${host}-start parse ${page.url} ${err}`);
    return 0;
  }
}
</code>

The crawler’s architecture stores each page to be processed as a document in MongoDB with fields such as _id , url , type , host , and done . The type field determines which parsing function will handle the page, enabling a recursive crawl that eventually reaches the brief and detail pages for each job posting.

Network requests are wrapped in a Promise‑based helper that sets a common User‑Agent, a 30‑second timeout, and encoding: null so the raw buffer is returned regardless of the page’s character set:

<code>const req = require('request');
const request = req.defaults({
  headers: {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
  },
  timeout: 30000,
  encoding: null
});
const fetch = (url) => new Promise((resolve) => {
  console.log(`down ${url} started`);
  request(encodeURI(url), (err, res, body) => {
    if (res && res.statusCode === 200) {
      console.log(`down ${url} 200`);
      resolve(body);
    } else {
      console.error(`down ${url} ${res && res.statusCode} ${err}`);
      resolve(res && res.statusCode ? res.statusCode : 600);
    }
  });
});
</code>

Several asynchronous parsing functions are defined (e.g., city , category , search , suggest , jobs , brief , detail ). Each function extracts relevant links, creates new task documents, and inserts them into the appropriate collections. For example, the search function parses a job list page, extracts job keys, builds URLs for suggestion, brief, and detail data, and queues them for further processing.

After the crawler has populated MongoDB with structured job data, the final step is to index the data into Elasticsearch. A schema is created based on the collected fields, and a scheduled job pushes new job documents into the ES index. The author notes that the large content field from job details is omitted from the index to save memory.

Overall, the article demonstrates a complete end‑to‑end pipeline: starting from a target website, crawling and parsing HTML with Cheerio, storing intermediate tasks in MongoDB, and finally indexing the cleaned data for fast search.

Backendmongodbnodejsweb crawlerindeedjob-scraping
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.