PyMuPDF (Python bindings for MuPDF) – Introduction, Features, Installation and Usage Guide
This article provides a comprehensive overview of PyMuPDF, the Python binding for the lightweight MuPDF library, covering its purpose, supported document formats, key features such as rendering, text extraction and PDF manipulation, installation methods, and detailed code examples for common operations.
1. Introduction to PyMuPDF
PyMuPDF is the Python interface to MuPDF, a lightweight PDF, XPS, and e‑book viewer library. MuPDF offers high‑quality anti‑aliased rendering, precise text layout, and supports formats like PDF, XPS, OpenXPS, CBZ, EPUB and FictionBook 2. The Python binding (current version 1.18.17) enables access to all MuPDF capabilities.
2. Core Features
Decrypt files
Access metadata, links and bookmarks
Render pages as raster images (PNG, etc.) or vector SVG
Search text
Extract text and images
Convert documents to PDF, (X)HTML, XML, JSON, plain text and more; for PDFs, create, merge or split pages, insert/delete/rearrange pages, and modify annotations and form fields
Extract or insert images and fonts
Full support for embedded files
Reformat PDFs for duplex printing, color separation, watermarks, etc.
Comprehensive password protection handling
Command‑line utility ( python -m fitz … ) with encryption, decryption, optimization, sub‑document creation, document concatenation, and more
3. Installation
Install PyMuPDF via pip install PyMuPDF from PyPI wheels for Windows, Linux and macOS (Python 3.6‑3.9, 64‑bit; 32‑bit wheels are also available for Windows). Optional dependencies such as Pillow, fontTools and pymupdf‑fonts enhance functionality.
4. Basic Usage
Import the library:
import fitzCheck the version:
print(fitz.__doc__)Open a document (from file or memory):
doc = fitz.open('example.pdf') # or doc = fitz.open(stream=data, filetype='pdf')5. Document Methods and Properties
Method/Property
Description
Document.page_countNumber of pages (int)
Document.metadataMetadata dictionary
Document.get_toc()Retrieve table of contents (list)
Document.load_page()Load a specific page
6. Page Handling
Iterate over pages, load a page, and access links, annotations or widgets:
for page in doc:
# process each page
links = page.get_links()
for link in links:
# handle link
passRender a page to a raster image:
pix = page.get_pixmap()
pix.save('page-%i.png' % page.number)Render a page to SVG:
svg = page.get_svg_image()Extract text in various formats ("text", "blocks", "words", "html", "dict", "json", "rawdict", "rawjson", "xhtml", "xml"):
text = page.get_text('text')Search for a string on a page:
areas = page.search_for('mupdf')7. PDF Operations
Modify PDFs (create, merge, split, reorder, delete pages) using methods such as Document.delete_page() , Document.copy_page() , Document.move_page() , Document.insert_page() , and Document.new_page() . Save changes with Document.save() , optionally using incremental=True for fast incremental updates.
Combine PDFs:
doc1.insert_pdf(doc2) # append doc2 to doc1Split a PDF (first 10 pages and last 10 pages example):
doc2 = fitz.open()
doc2.insert_pdf(doc1, to_page=9) # first 10 pages
doc2.insert_pdf(doc1, from_page=len(doc1)-10) # last 10 pages
doc2.save('first-and-last-10.pdf')Close a document when finished:
doc.close()Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.