Backend Development 17 min read

DIY Technical News Acquisition: Framework, Practices, and Code Samples

This article explains why personalized tech‑news gathering is valuable, proposes a DIY framework for controlling sources, collection, filtering, reading experience and iteration, and demonstrates three concrete Node.js scraping examples—HTML pages, API data, and WeChat public accounts—plus extended thoughts on building a simple product.

TAL Education Technology

Mar 6, 2020

DIY Technical News Acquisition: Framework, Practices, and Code Samples

Technical professionals often need a low‑cost, personalized way to obtain up‑to‑date technical information. Traditional channels such as RSS feeds, newsletters, or community platforms provide a "one‑size‑fits‑all" experience that rarely matches individual interests, leading to high acquisition costs.

The proposed solution combines tools and manual effort: select reliable sources, write custom collection and filtering scripts, tailor the reading and interaction experience, and continuously iterate based on feedback. This DIY approach gives full control and a sense of ownership over the information pipeline.

1. Scraping HTML pages

Using Node.js with request-promise and cheerio, the article shows how to fetch recent articles from Alibaba Cloud's "AliTech" column, extract titles, links, briefs, and compute the age of each article to keep only those published within the last seven days. Sample code:

const rp = require('request-promise');
const cheerio = require('cheerio');
const targetURL = 'https://example.com';
const options = { uri: targetURL, transform: body => cheerio.load(body) };
async function getArticles() {
  const $ = await rp(options);
  const elements = $('.yq-new-item h3 a');
  const result = [];
  elements.each((i, el) => {
    const $el = $(el);
    const linkObj = {};
    linkObj.title = $el.text();
    linkObj.link = `https://yq.aliyun.com${$el.attr('href')}`;
    // extract brief and compute deltaDay …
    result.push(linkObj);
  });
  console.log(result);
}
getArticles();

The script also demonstrates how to follow each article link to retrieve its publication date.

2. Scraping data from APIs

For platforms like Juejin, the article shows how to call the public GraphQL endpoint https://web-api.juejin.im/query with different category IDs to obtain recent posts. The Node.js code builds a map of category names to IDs, sends POST requests with request-promise, filters items newer than seven days, and writes the aggregated result to result2.json:

const categoryIDMap = {
  '推荐': '',
  '后端': '5562b419e4b00c57d9b94ae2',
  '前端': '5562b415e4b00c57d9b94ac8',
  // … other categories
};
async function getArtInOneCategory(categoryID, categoryName) {
  const options = generateOptions(categoryID);
  const res = await rp(options);
  const items = res.data.articleFeed.items.edges;
  return items.filter(item => {
    const deltaDay = (Date.now() - new Date(item.node.updatedAt)) / (24*60*60*1000);
    return deltaDay < 7;
  }).map(item => ({
    title: item.node.title,
    link: item.node.originalUrl,
    likeCount: item.node.likeCount,
    category: categoryName,
    deltaDay: ((Date.now() - new Date(item.node.updatedAt)) / (24*60*60*1000)).toFixed(1)
  }));
}
function getAllArticles() {
  const promises = Object.entries(categoryIDMap).map(([name, id]) =>
    getArtInOneCategory(id, name).then(res => allResults.push(...res))
  );
  Promise.all(promises).then(() => {
    fs.writeFileSync('./result2.json', JSON.stringify(allResults));
  });
}
getAllArticles();

3. Scraping WeChat public accounts

Because WeChat pages are rendered server‑side and protected by anti‑scraping measures, the article recommends using puppeteer (headless Chrome) to automate the browser, search for a public account via Sogou, filter results to the past week, and extract article information. A minimal launch example:

const browser = await puppeteer.launch({ headless: false });
// perform navigation, search, filter, and data extraction here

It also warns about captcha challenges and the need to limit request frequency.

Extended thinking

After collecting data, one can store it in a backend service, expose APIs for front‑end consumption, build a simple web or native app, and even add feedback loops to evaluate source quality. The article suggests visual design guidelines for a pleasant reading experience.

Conclusion

The piece analyses common news‑acquisition methods, presents a DIY framework, provides three concrete scraping implementations with full code, and discusses how the approach can evolve into a lightweight product or system, highlighting the educational value of building such pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Puppeteer Node.js Web Scraping cheerio request-promise

Written by

TAL Education Technology

TAL Education is a technology-driven education company committed to the mission of 'making education better through love and technology'. The TAL technology team has always been dedicated to educational technology research and innovation. This is the external platform of the TAL technology team, sharing weekly curated technical articles and recruitment information.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.