Information Security 24 min read

Web Crawling and Anti‑Crawling Techniques: Principles, Implementation, and Countermeasures

This article explains the technical principles and implementation steps of web crawlers, introduces common crawling frameworks, provides a Python example for extracting app store rankings, and then details various anti‑crawling methods such as CSS offset, image camouflage, custom fonts, dynamic rendering, captchas, request signing, and honeypots, followed by counter‑strategies for each.

Architecture Digest

Sep 24, 2022

Web Crawling and Anti‑Crawling Techniques: Principles, Implementation, and Countermeasures

The article begins with an overview of the big‑data era and the importance of web crawlers for automatically acquiring web page information, highlighting both the benefits of data collection and the need to understand anti‑crawling measures.

1. Crawling Principles and Implementation

1.1 Definition of Crawlers – Crawlers are programs that automatically fetch web pages according to predefined rules. They are divided into general crawlers (e.g., search‑engine bots) and focused crawlers (e.g., ticket‑booking bots). The basic workflow includes seed URL selection, URL queue management, DNS resolution, page downloading, URL extraction, and iterative crawling.

1.2 Crawling Frameworks – Common Python frameworks such as Scrapy and Pyspider are compared; Scrapy offers powerful command‑line control, while Pyspider provides a visual interface.

1.3 Simple Crawling Example – A concrete example shows how to scrape an app‑store ranking page that lacks anti‑crawling protection. The source code used is:

#获取网页源码
def get_one_page(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None

#正则匹配提取目标信息并形成字典
def parse_one_page(html):
    pattern = re.compile('<li>.*?data-src="(.*?)".*?<h5>.*?det.*?>(.*?)</a>.*?p.*?<a.*?>(.*?)</a>.*?</li>', re.S)
    items = re.findall(pattern, html)
    j = 1
    for item in items[:-1]:
        yield {'index': str(j), 'name': item[1], 'class': item[2]}
        j = j + 1

#结果写入txt
def write_to_file(content):
    with open(r'test.txt', 'a', encoding='utf-8') as f:
        f.write(json.dumps(content, ensure_ascii=False) + '
')

The script fetches the page, extracts app names and categories via regular expressions, and writes the results to a text file.

2. Anti‑Crawling Techniques

Anti‑crawling aims to limit automated access that could overload servers or leak data. Common methods include:

CSS offset – rearranging characters in the HTML and using CSS positioning to display the correct order only in the browser.

Image camouflage – replacing text with images, requiring OCR to recover the content.

Custom fonts – encoding characters with custom font files that browsers render but crawlers cannot interpret.

Dynamic rendering – generating content via JavaScript on the client side, making it invisible in the raw HTML.

CAPTCHA – presenting visual or interactive challenges to verify human users.

Request signature verification – adding cryptographic signatures to API requests.

Honeypot links – hidden elements that only bots would request, allowing detection of automated crawlers.

Each technique is illustrated with screenshots and brief explanations.

2.1 CSS Offset Example – By analyzing the CSS `left` offsets of overlapping ` ` tags, the displayed price can be reconstructed (e.g., 467).

2.2 Image Camouflage Example – Text is replaced by clear images; OCR can be used to recover the text.

2.3 Custom Font Example – Characters are encoded as numeric entities and rendered via a custom WOFF font; extracting the font file and mapping resolves the data.

2.4 Dynamic Rendering Example – Client‑side rendering hides data in AJAX responses; tools like Selenium or direct AJAX inspection are needed.

2.5 CAPTCHA Example – Slider captchas are solved by training a YOLOv5 object‑detection model. The workflow includes data collection, labeling with labelImg, converting XML to YOLO format, and training the model.

#数据格式转换
for member in root.findall('object'):
    class_id = class_text.index(member[0].text)
    xmin = int(member[4][0].text)
    ymin = int(member[4][1].text)
    xmax = int(member[4][2].text)
    ymax = int(member[4][3].text)
    # ... compute normalized center_x, center_y, width, height ...
    file_txt.write(f"{class_id} {center_x} {center_y} {box_w} {box_h}
")
file_txt.close()

Training arguments for YOLOv5 are also listed (e.g., epochs, batch size, image size).

2.6 Request Signature Verification – Servers require a signed `analysis` parameter in AJAX calls; cracking the signature requires reverse‑engineering the encryption algorithm.

2.7 Honeypot – Hidden ` ` elements increase the number of URLs a crawler sees, allowing detection of non‑human traffic.

3. Anti‑Anti‑Crawling Techniques

To bypass the above defenses, the article describes counter‑measures such as analyzing CSS offsets to reconstruct obscured numbers, extracting custom fonts to decode characters, using Selenium for dynamic pages, and employing OCR or object‑detection models for captchas. Sample code for CSS‑offset reconstruction is provided:

if __name__ == '__main__':
    url = 'http://www.porters.vip/confusion/flight.html'
    resp = requests.get(url)
    sel = Selector(resp.text)
    em = sel.css('em.rel').extract()
    for element in range(0, 1):
        element = Selector(em[element])
        element_b = element.css('b').extract()
        b1 = Selector(element_b.pop(0))
        base_price = b1.css('i::text').extract()
        print('css偏移前的价格：', base_price)
        alternate_price = []
        for eb in element_b:
            eb = Selector(eb)
            style = eb.css('b::attr("style")').get()
            position = ''.join(re.findall('left:(.*)px', style))
            value = eb.css('b::text').get()
            alternate_price.append({'position': position, 'value': value})
        print('css偏移值：', alternate_price)
        for al in alternate_price:
            position = int(al.get('position'))
            value = al.get('value')
            plus = True if position >= 0 else False
            index = int(position / 16)
            base_price[index] = value
        print('css偏移后的价格：', base_price)

The article concludes that crawling should respect robots.txt and legal constraints, while anti‑crawling aims to protect server stability and data confidentiality.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python anti‑crawling Scrapy Web Crawling

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.