Artificial Intelligence 13 min read

News Page Identification Using Machine Learning: Feature Engineering, Model Selection, and Evaluation

To accurately distinguish news pages from other web page types, this study formulates the task as a binary classification problem, extracts 19 engineered features from HTML, evaluates logistic regression and SVM models with cross‑validation, and achieves over 90% precision, recall, and F1‑score using LR with Newton method.

UC Tech Team
UC Tech Team
UC Tech Team
News Page Identification Using Machine Learning: Feature Engineering, Model Selection, and Evaluation

Background – Web crawlers collect HTML pages from many international sites, including homepages, forums, news, lists, videos, downloads, galleries, etc. Simple heuristics based on word count and image size often misclassify novel, gallery, or video pages as news pages.

Goal – Treat news‑page identification as a binary classification problem, building a model that labels pages as news (positive) or non‑news (negative) with an F1‑score above 90%.

Metrics

Accuracy: proportion of correctly predicted positive samples among all predicted positives.

Recall: proportion of actual positives correctly identified.

F1‑score: harmonic mean of precision and recall; can be adjusted (F0.5, F2) depending on emphasis.

Identification Algorithm Flow

Step Details

5.1 Data Selection – Randomly sample diverse HTML pages from the crawl, ensuring coverage of homepages, forums, downloads, galleries, news, lists, videos, etc. The final dataset contains about 1,000 pages.

5.2 Data Cleaning – Remove malformed HTML, strip useless tags (script, style, comments), and keep only clean, parsable content.

5.3 Data Labeling – Manually label pages as news (positive) or non‑news (negative). Approximately 40% of the dataset are news pages, yielding a 4:6 positive‑negative ratio.

5.4 Feature Engineering – Iteratively select effective features, ending with 19 key attributes (see table below).

Index

Feature Name

Meaning

Remark

1

is_exist_author

Whether the page contains an author

Discriminative power: 16.5% contain

2

is_exist_title

Whether the page contains a title

Discriminative power: 41.7% contain

3

is_exist_publish_time

Whether the page contains a publish time

Discriminative power: 19.4% contain

4

is_include_date_URL

Whether the URL includes a date

Discriminative power: 10.3% contain

5

is_include_news_URL

Whether the URL includes the word "news"

Discriminative power: 11.2% contain

6

is_include_forum_URL

Whether the URL includes "forum" or "bbs"

Discriminative power: 12.6% contain

7

is_include_music_URL

Whether the URL includes music indicators (mp3, music, etc.)

Discriminative power: 6.1% contain

8

is_include_download_URL

Whether the URL includes "download"

Discriminative power: 5.8% contain

9

is_include_media_URL

Whether the URL includes media indicators (video, mp4, movie, etc.)

Discriminative power: 19.2% contain

10

is_include_image_set_URL

Whether the URL includes gallery/novel indicators

Discriminative power: 2.6% contain

11

is_include_torrent_URL

Whether the URL includes torrent indicators

Discriminative power: 1% contain

12

content_media_count

Number of media tags (

<img>

,

<video>

,

<audio>

) in the content

13

page_media_count

Number of media tags on the whole page

14

content_text_count

Number of text tags (

<p>

,

<i>

,

<b>

,

<u>

) in the content

15

page_text_count

Number of text tags on the whole page

16

content_link_count

Number of hyperlink tags (

<a>

) in the content

17

page_link_count

Number of hyperlink tags on the whole page

18

content_text_percent

Proportion of textual content in the page that belongs to the main content

19

content_html_percent

Proportion of HTML markup belonging to the main content

5.5 Model Selection – Binary classification algorithms considered: SVM, Logistic Regression (LR), Decision Trees, Random Forest, GBDT, Naïve Bayes, Neural Networks. Continuous features required discretization for tree‑based models, so the final candidates were LR and SVM.

5.5.1 Logistic Regression (LR) – Tested with Newton method (L2 regularization) and coordinate descent (L1 regularization). Since the dataset is modest, Newton and coordinate methods were sufficient.

5.5.2 Support Vector Machine (SVM) – Used RBF kernel with L2 regularization. Linear SVM would be equivalent to LR, so the kernel was kept for potential non‑linear benefits.

5.6 Cross‑Validation – 5‑fold cross‑validation was performed. Results:

LR – Newton method (L2)

Label

Precision

Recall

F1‑score

Sample Ratio

Non‑news

0.92

0.96

0.94

62%

News

0.89

0.78

0.83

38%

Total

0.91

0.91

0.91

100%

LR – Coordinate descent (L1)

Label

Precision

Recall

F1‑score

Sample Ratio

Non‑news

0.87

0.96

0.91

64.5%

News

0.90

0.75

0.82

35.5%

Total

0.88

0.88

0.88

100%

SVM – RBF kernel (L2)

Label

Precision

Recall

F1‑score

Sample Ratio

Non‑news

0.78

1.00

0.87

63%

News

1.00

0.16

0.28

37%

Total

0.83

0.78

0.72

100%

Conclusion – Extracting 19 effective features from HTML and applying LR with Newton method yields the best performance, achieving precision, recall, and F1‑score above 90%, meeting the project’s target for reliable news‑page identification.

machine learningfeature engineeringlogistic regressionSVMbinary classificationweb crawlingpage classification
UC Tech Team
Written by

UC Tech Team

We provide high-quality technical articles on client, server, algorithms, testing, data, front-end, and more, including both original and translated content.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.