Backend Development 12 min read

Introduction, Installation, and Usage of PyMuPDF (Python Bindings for MuPDF)

This article provides a comprehensive overview of PyMuPDF, covering its purpose as Python bindings for the lightweight MuPDF viewer, detailed installation instructions, essential dependencies, naming conventions, and extensive usage examples for opening documents, accessing pages, extracting text and images, manipulating PDFs, and saving changes.

Python Programming Learning Circle

Feb 18, 2024

Introduction, Installation, and Usage of PyMuPDF (Python Bindings for MuPDF)

PyMuPDF is the Python interface to the MuPDF library, a lightweight PDF, XPS, and e‑book viewer that supports many document formats and offers high‑quality rendering, annotation, and conversion capabilities.

Key features include decryption, metadata access, page rendering to raster (PNG) or vector (SVG), text search, extraction in various formats (text, html, json, xml), and full PDF manipulation such as creating, merging, splitting, inserting, deleting, and re‑ordering pages.

Installation : install via PyPI with pip install PyMuPDF. Wheels are available for Windows, Linux, macOS, and many‑linux aarch64; the package has no mandatory external dependencies, though optional packages like Pillow, fontTools, and pymupdf-fonts enhance functionality.

Import and naming : the library is imported as import fitz (historically named after the "Fitz" rendering engine derived from MuPDF).

Basic usage :

import fitz
print(fitz.__doc__)  # shows version information

doc = fitz.open('sample.pdf')  # open a document
page = doc.load_page(0)      # load first page (or doc[0])

# Get page count and metadata
print(doc.page_count)
print(doc.metadata)

# Render page to PNG
pix = page.get_pixmap()
pix.save('page-0.png')

# Extract text in different formats
text = page.get_text('text')
html = page.get_text('html')
json_data = page.get_text('json')

# Search for a string
areas = page.search_for('mupdf')

# Modify PDF: insert, delete, move pages
doc.insert_page(-1, width=595, height=842)  # add blank page at end
doc.delete_page(2)                         # delete third page

doc.save('output.pdf', incremental=True)

doc.close()

Additional utilities include doc.get_toc() for table‑of‑contents extraction, page.get_links() for hyperlink retrieval, and page.annots() or page.widgets() for annotation and form field handling.

PDF‑specific operations such as Document.insert_pdf() allow merging documents, while Document.select() and page slicing enable creating new PDFs from selected pages. Saving can be incremental to preserve original file structure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python PDF Library Document Processing MuPDF PyMuPDF

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.