Fundamentals 13 min read

Introduction to PyMuPDF: Features, Installation, and Usage Guide

This article provides a comprehensive overview of PyMuPDF, the Python binding for MuPDF, covering its core features, supported document formats, installation methods, and detailed code examples for opening, rendering, extracting, and manipulating PDF and other documents.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Introduction to PyMuPDF: Features, Installation, and Usage Guide

PyMuPDF Overview

PyMuPDF is the Python binding for the MuPDF library, offering a lightweight, high‑performance engine for viewing and manipulating PDF, XPS, OpenXPS, CBZ, EPUB, FictionBook 2 and several image formats.

Key Features

Decrypt files and access metadata, links, and bookmarks.

Render pages as raster images ( PNG ) or vector graphics ( SVG ).

Search for text, extract text in various formats (plain, HTML, XML, JSON, etc.) and extract images.

Convert documents to other formats such as HTML, SVG, PDF, XML, JSON, and plain text.

Fully support embedded files, password protection, annotations, and form fields.

Command‑line utilities for encryption/decryption, optimization, sub‑document creation, document concatenation, and layout‑preserving text extraction.

Installation

PyMuPDF can be installed from source or via pre‑built wheels on PyPI. It works on Windows, Linux and macOS for Python 3.6‑3.9 (64‑bit) and also provides optional dependencies such as Pillow , fontTools and pymupdf‑fonts for extended functionality.

<code>pip install PyMuPDF</code>

Basic Usage

Import the library (the import name is fitz for historical reasons) and open a document:

<code>import fitz
doc = fitz.open("sample.pdf")  # creates a Document object</code>

You can iterate over pages, load a specific page, or use the document as a context manager.

<code>for page in doc:
    # process each page
    pass

page = doc.load_page(0)  # or doc[0]
</code>

Page Operations

Render a page to a pixmap: pix = page.get_pixmap() and save as PNG: pix.save("page-%i.png" % page.number) .

Render to SVG: svg = page.get_svg_image() .

Extract text: text = page.get_text("text") or use other options such as "html" , "xml" , "json" , "blocks" , etc.

Search for a string: areas = page.search_for("mupdf") returns a list of rectangles.

Access links, annotations, and form fields via page.get_links() , page.annots() , and page.widgets() .

PDF‑Specific Operations

Only PDF documents can be modified (e.g., insert, delete, move, or rotate pages). Use methods like Document.delete_page() , Document.insert_page() , Document.save() (with incremental=True for fast updates), and Document.close() to finalize changes.

Document Concatenation and Splitting

Combine PDFs with Document.insert_pdf() or extract subsets by creating a new empty document and inserting selected pages.

<code># Append doc2 to doc1
doc1.insert_pdf(doc2)

# Create a new PDF with first 10 and last 10 pages of doc1
new_doc = fitz.open()
new_doc.insert_pdf(doc1, to_page=9)
new_doc.insert_pdf(doc1, from_page=len(doc1)-10)
new_doc.save("first-and-last-10.pdf")
</code>

The library provides a rich API for low‑level PDF structure manipulation, metadata access, and conversion to other formats, making it suitable for a wide range of document‑processing tasks.

RenderingPythonpdfMuPDFPyMuPDFDocumentProcessingTextExtraction
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.