Web Crawling, Anti‑Crawling, and Anti‑Anti‑Crawling Techniques: Principles, Frameworks, and Code Examples
The article explains web‑crawling basics, Python and Scrapy examples, then surveys common anti‑crawling defenses such as CSS offsets, image camouflage, custom fonts, dynamic rendering, captchas, request signatures and honeypots, and finally presents anti‑anti‑crawling countermeasures—including CSS‑offset reversal, font decoding, headless‑browser rendering and YOLOv5‑based captcha cracking, while stressing legal compliance.
In the era of big data, web crawlers are widely used to automatically retrieve web page information. This article introduces the technical principles and implementations of crawlers, as well as various anti‑crawling and anti‑anti‑crawling techniques.
1. Crawler fundamentals
Definition, classification (general vs focused), basic architecture (seed URLs, URL queue, DNS resolution, downloader, URL extraction, loop). Typical workflow: start from one or more URLs, continuously add new qualified URLs to the queue until a stop condition is met.
Common frameworks: Nutch (search‑engine oriented), Pyspider, Scrapy (Python‑based, distributed). Pyspider offers a visual UI, Scrapy is more powerful.
1.3 Simple crawler example (Python)
#获取网页源码
def get_one_page(url):
try:
response = requests.get(url)
if response.status_code == 200:
return response.text
return None
except RequestException:
return None
#正则匹配提取目标信息并形成字典
def parse_one_page(html):
pattern = re.compile('
.*?det.*?>(.*?)
.*?p.*?
(.*?)
.*?
', re.S)
items = re.findall(pattern, html)
j = 1
for item in items[:-1]:
yield {'index': str(j),
'name': item[1],
'class':item[2]}
j = j+1
#结果写入txt
def write_to_file(content):
with open(r'test.txt', 'a', encoding='utf-8') as f:
f.write(json.dumps(content, ensure_ascii=False)+'\n')2. Anti‑crawling techniques
Overview of common measures: text obfuscation, dynamic rendering, captcha verification, request signature validation, big‑data risk control, JS obfuscation, honeypot.
2.1 CSS offset – rearranges characters in HTML using CSS positioning to confuse crawlers.
2.2 Image camouflage – replaces text with images; OCR can be used to bypass.
2.3 Custom fonts – characters are rendered with custom fonts, making raw HTML unreadable without the font mapping.
2.4 Dynamic rendering – client‑side rendering hides data from static HTML; solutions include using developer tools, Selenium, or executing JS with execjs.
2.5 Captcha – various types (image, behavior, SMS, QR code) to block automated access.
2.6 Request signature – signed parameters added to API calls to verify legitimacy.
2.7 Honeypot – hidden links that only bots discover, used to differentiate bots from humans.
3. Anti‑anti‑crawling (counter‑measures)
3.1 CSS offset reverse engineering – analyze CSS positions to reconstruct the true value. Example code:
if __name__ == '__main__':
url = 'http://www.porters.vip/confusion/flight.html'
resp = requests.get(url)
sel = Selector(resp.text)
em = sel.css('em.rel').extract()
for element in range(0,1):
element = Selector(em[element])
element_b = element.css('b').extract()
b1 = Selector(element_b.pop(0))
base_price = b1.css('i::text').extract()
print('css偏移前的价格:', base_price)
alternate_price = []
for eb in element_b:
eb = Selector(eb)
style = eb.css('b::attr("style")').get()
position = ''.join(re.findall('left:(.*)px', style))
value = eb.css('b::text').get()
alternate_price.append({'position': position, 'value': value})
print('css偏移值:', alternate_price)
for al in alternate_price:
position = int(al.get('position'))
value = al.get('value')
plus = True if position >= 0 else False
index = int(position / 16)
base_price[index] = value
print('css偏移后的价格:', base_price)3.2 Custom‑font reverse – extract the font file (WOFF), map encoded characters to real glyphs, then decode the data.
3.3 Dynamic rendering reverse – use Selenium or headless browsers to obtain rendered HTML, or extract data from JS variables.
3.4 Captcha cracking – example using YOLOv5 to detect the gap in a slider captcha. Data collection, manual labeling with labelImg, conversion to YOLO format, and training script:
for member in root.findall('object'):
class_id = class_text.index(member[0].text)
xmin = int(member[4][0].text)
ymin = int(member[4][1].text)
xmax = int(member[4][2].text)
ymax = int(member[4][3].text)
# round(x, 6) 这里我设置了6位有效数字,可根据实际情况更改
center_x = round(((xmin + xmax) / 2.0) * scale / float(image.shape[1]), 6)
center_y = round(((ymin + ymax) / 2.0) * scale / float(image.shape[0]), 6)
box_w = round(float(xmax - xmin) * scale / float(image.shape[1]), 6)
box_h = round(float(ymax - ymin) * scale / float(image.shape[0]), 6)
file_txt.write(str(class_id))
file_txt.write(' ')
file_txt.write(str(center_x))
file_txt.write(' ')
file_txt.write(str(center_y))
file_txt.write(' ')
file_txt.write(str(box_w))
file_txt.write(' ')
file_txt.write(str(box_h))
file_txt.write('\n')
file_txt.close()Training script (argparse excerpt):
parser = argparse.ArgumentParser()
parser.add_argument('--weights', type=str, default='yolov5s.pt', help='initial weights path')
parser.add_argument('--cfg', type=str, default='./models/yolov5s.yaml', help='model.yaml path')
parser.add_argument('--data', type=str, default='data/custom.yaml', help='data.yaml path')
parser.add_argument('--hyp', type=str, default='data/hyp.scratch.yaml', help='hyperparameters path')
parser.add_argument('--epochs', type=int, default=50)
parser.add_argument('--batch-size', type=int, default=8, help='total batch size for all GPUs')
parser.add_argument('--img-size', nargs='+', type=int, default=[640, 640], help='[train, test] image sizes')
parser.add_argument('--rect', action='store_true', help='rectangular training')
parser.add_argument('--resume', nargs='?', const=True, default=False, help='resume most recent training')
parser.add_argument('--nosave', action='store_true', help='only save final checkpoint')
parser.add_argument('--notest', action='store_true', help='only test final epoch')
parser.add_argument('--noautoanchor', action='store_true', help='disable autoanchor check')
parser.add_argument('--evolve', action='store_true', help='evolve hyperparameters')
parser.add_argument('--bucket', type=str, default='', help='gsutil bucket')
parser.add_argument('--cache-images', action='store_true', help='cache images for faster training')
parser.add_argument('--image-weights', action='store_true', help='use weighted image selection for training')
parser.add_argument('--device', default='cpu', help='cuda device, i.e. 0 or 0,1,2,3 or cpu')
parser.add_argument('--multi-scale', action='store_true', help='vary img-size +/- 50%')
parser.add_argument('--single-cls', action='store_true', help='train multi-class data as single-class')
parser.add_argument('--adam', action='store_true', help='use torch.optim.Adam() optimizer')
parser.add_argument('--sync-bn', action='store_true', help='use SyncBatchNorm, only available in DDP mode')
parser.add_argument('--local_rank', type=int, default=-1, help='DDP parameter, do not modify')
parser.add_argument('--workers', type=int, default=8, help='maximum number of dataloader workers')
parser.add_argument('--project', default='runs/train', help='save to project/name')
parser.add_argument('--entity', default=None, help='W&B entity')
parser.add_argument('--name', default='exp', help='save to project/name')
parser.add_argument('--exist-ok', action='store_true', help='existing project/name ok, do not increment')
parser.add_argument('--quad', action='store_true', help='quad dataloader')
parser.add_argument('--linear-lr', action='store_true', help='linear LR')
parser.add_argument('--label-smoothing', type=float, default=0.0, help='Label smoothing epsilon')
parser.add_argument('--upload_dataset', action='store_true', help='Upload dataset as W&B artifact table')
parser.add_argument('--bbox_interval', type=int, default=-1, help='Set bounding-box image logging interval for W&B')
parser.add_argument('--save_period', type=int, default=-1, help='Log model after every "save_period" epoch')
parser.add_argument('--artifact_alias', type=str, default="latest", help='version of dataset artifact to be used')
opt = parser.parse_args()Conclusion: The article provides a concise overview of crawler technology, anti‑crawling defenses, and counter‑measures, emphasizing responsible use and compliance with robots.txt and legal regulations.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.