Introduction and Usage Guide for PyMuPDF (Python Bindings for MuPDF)
This article provides a comprehensive overview of PyMuPDF, the Python binding for the lightweight MuPDF library, covering its installation, core features such as page rendering, text and image extraction, PDF manipulation, and detailed code examples for common document‑processing tasks.
PyMuPDF is the official Python binding for MuPDF, a lightweight PDF, XPS and e‑book viewer library, exposing MuPDF’s rendering engine through a simple Python interface.
Key capabilities include decrypting files, accessing metadata, rendering pages as raster images (PNG) or vector graphics (SVG), searching text, extracting text and images, converting documents to HTML, SVG, PDF, CBZ and other formats, and extensive PDF‑specific functions such as creating, merging, splitting, inserting, deleting, rotating pages, handling annotations, form fields, encryption, watermarks and incremental saves.
Installation is straightforward via pip install PyMuPDF ; wheels are available for Windows, Linux and macOS, supporting Python 3.6‑3.9 (64‑bit) and 32‑bit on Windows. Optional dependencies like Pillow , fontTools and pymupdf‑fonts enable additional image‑saving and font‑subset features.
Basic usage starts with import fitz , then opening a document with doc = fitz.open(filename) . The Document object provides properties such as page_count , metadata , get_toc() and methods like load_page(pno) (or doc[pno] ) to obtain a Page object.
Page handling examples include: <code>page = doc.load_page(pno) # or doc[pno]</code> Retrieving links with links = page.get_links() , iterating annotations via for annot in page.annots(): , and extracting images with pix = page.get_pixmap() followed by pix.save("page-%i.png" % page.number) . Text extraction supports multiple output formats via page.get_text(opt) where opt can be "text", "blocks", "words", "html", "json", "xml", etc.
Searching for a string returns a list of rectangles: <code>areas = page.search_for("mupdf")</code> which can be used for highlighting or cross‑referencing.
PDF‑specific manipulation includes methods such as Document.delete_page() , Document.insert_page() , Document.move_page() , Document.select() to keep only chosen pages, and Document.insert_pdf() to concatenate PDFs. Example to join two PDFs: <code># append doc2 to the end of doc1 doc1.insert_pdf(doc2)</code> and to split a document: <code>doc2 = fitz.open() doc2.insert_pdf(doc1, to_page=9) # first 10 pages doc2.insert_pdf(doc1, from_page=len(doc1)-10) # last 10 pages doc2.save("first-and-last-10.pdf")</code> Saving is performed with Document.save() , optionally using incremental=True for fast updates, and the document should be closed with Document.close() when finished.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.