DIY Technical News Acquisition: Framework, Practices, and Code Samples
This article explains why personalized tech‑news gathering is valuable, proposes a DIY framework for controlling sources, collection, filtering, reading experience and iteration, and demonstrates three concrete Node.js scraping examples—HTML pages, API data, and WeChat public accounts—plus extended thoughts on building a simple product.
Technical professionals often need a low‑cost, personalized way to obtain up‑to‑date technical information. Traditional channels such as RSS feeds, newsletters, or community platforms provide a "one‑size‑fits‑all" experience that rarely matches individual interests, leading to high acquisition costs.
The proposed solution combines tools and manual effort: select reliable sources, write custom collection and filtering scripts, tailor the reading and interaction experience, and continuously iterate based on feedback. This DIY approach gives full control and a sense of ownership over the information pipeline.
1. Scraping HTML pages
Using Node.js with request-promise and cheerio , the article shows how to fetch recent articles from Alibaba Cloud's "AliTech" column, extract titles, links, briefs, and compute the age of each article to keep only those published within the last seven days. Sample code:
const rp = require('request-promise');
const cheerio = require('cheerio');
const targetURL = 'https://example.com';
const options = { uri: targetURL, transform: body => cheerio.load(body) };
async function getArticles() {
const $ = await rp(options);
const elements = $('.yq-new-item h3 a');
const result = [];
elements.each((i, el) => {
const $el = $(el);
const linkObj = {};
linkObj.title = $el.text();
linkObj.link = `https://yq.aliyun.com${$el.attr('href')}`;
// extract brief and compute deltaDay …
result.push(linkObj);
});
console.log(result);
}
getArticles();The script also demonstrates how to follow each article link to retrieve its publication date.
2. Scraping data from APIs
For platforms like Juejin, the article shows how to call the public GraphQL endpoint https://web-api.juejin.im/query with different category IDs to obtain recent posts. The Node.js code builds a map of category names to IDs, sends POST requests with request-promise , filters items newer than seven days, and writes the aggregated result to result2.json :
const categoryIDMap = {
'推荐': '',
'后端': '5562b419e4b00c57d9b94ae2',
'前端': '5562b415e4b00c57d9b94ac8',
// … other categories
};
async function getArtInOneCategory(categoryID, categoryName) {
const options = generateOptions(categoryID);
const res = await rp(options);
const items = res.data.articleFeed.items.edges;
return items.filter(item => {
const deltaDay = (Date.now() - new Date(item.node.updatedAt)) / (24*60*60*1000);
return deltaDay < 7;
}).map(item => ({
title: item.node.title,
link: item.node.originalUrl,
likeCount: item.node.likeCount,
category: categoryName,
deltaDay: ((Date.now() - new Date(item.node.updatedAt)) / (24*60*60*1000)).toFixed(1)
}));
}
function getAllArticles() {
const promises = Object.entries(categoryIDMap).map(([name, id]) =>
getArtInOneCategory(id, name).then(res => allResults.push(...res))
);
Promise.all(promises).then(() => {
fs.writeFileSync('./result2.json', JSON.stringify(allResults));
});
}
getAllArticles();3. Scraping WeChat public accounts
Because WeChat pages are rendered server‑side and protected by anti‑scraping measures, the article recommends using puppeteer (headless Chrome) to automate the browser, search for a public account via Sogou, filter results to the past week, and extract article information. A minimal launch example:
const browser = await puppeteer.launch({ headless: false });
// perform navigation, search, filter, and data extraction hereIt also warns about captcha challenges and the need to limit request frequency.
Extended thinking
After collecting data, one can store it in a backend service, expose APIs for front‑end consumption, build a simple web or native app, and even add feedback loops to evaluate source quality. The article suggests visual design guidelines for a pleasant reading experience.
Conclusion
The piece analyses common news‑acquisition methods, presents a DIY framework, provides three concrete scraping implementations with full code, and discusses how the approach can evolve into a lightweight product or system, highlighting the educational value of building such pipelines.
TAL Education Technology
TAL Education is a technology-driven education company committed to the mission of 'making education better through love and technology'. The TAL technology team has always been dedicated to educational technology research and innovation. This is the external platform of the TAL technology team, sharing weekly curated technical articles and recruitment information.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.