Comprehensive Guide to Python Libraries for Web Crawling, Web Development, and Asynchronous Programming
This article provides an extensive overview of Python libraries and frameworks for web crawling, data extraction, asynchronous networking, browser automation, and popular web development frameworks, helping developers choose the right tools for backend projects and avoid common misconceptions when selecting a framework.
Introduction
Many people start learning Python with web crawling because abundant resources and open‑source projects are available. The crawling process can be divided into three major stages: fetching, analysis, and storage, and a typical URL request involves four steps: DNS lookup, sending a request to the server, receiving the response, and browser parsing.
General Network Libraries
urllib, requests, grab, pycurl, urllib3, httplib2, RoboBrowser, MechanicalSoup, mechanize, socket, Unirest, hyper, PySocks, and others provide various levels of HTTP handling and low‑level socket access.
Crawling Frameworks
Full‑featured frameworks such as grab, scrapy, pyspider, and cola, as well as auxiliary tools like portia, restkit, and demiurge, simplify large‑scale crawling tasks.
HTML/XML Parsers
Libraries including lxml, cssselect, pyquery, BeautifulSoup, html5lib, feedparser, MarkupSafe, xmltodict, xhtml2pdf, and untangle enable robust parsing and cleaning of HTML/XML content.
Text Processing & NLP
Tools for plain‑text handling (difflib, Levenshtein, fuzzywuzzy, esmre, ftfy) and natural‑language processing (NLTK, Pattern, TextBlob, jieba, SnowNLP, loso) are listed.
Browser Automation
Selenium, Ghost.py, Spynner, and Splinter allow automated interaction with real browsers.
Multiprocessing & Asynchronous Programming
Threading, multiprocessing, celery, concurrent‑futures, asyncio, Twisted, Tornado, pulsar, diesel, gevent, eventlet, and Tomorrow provide various concurrency models.
Queues & Cloud Computing
Queue solutions such as celery, huey, mrq, RQ, simpleq, python‑gearman, and cloud execution services like picloud and dominoup.com are covered.
Web Content Extraction & WebSocket
Libraries for extracting web content (newspaper, html2text, python‑goose, lassie) and WebSocket communication (Crossbar, AutobahnPython, WebSocket‑for‑Python) are included.
DNS Resolution & Computer Vision
dnsyo and pycares handle DNS queries, while OpenCV, SimpleCV, and mahotas serve computer‑vision needs.
Popular Web Frameworks
Django, Flask, Web2py, Tornado, and CherryPy are described with brief feature overviews and usage notes.
Framework Selection Pitfalls
The article warns against searching for the "best" framework or over‑optimizing performance for small sites, emphasizing that the right tool depends on team expertise and project requirements.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.