Fundamentals 12 min read

Introduction and Usage Guide for PyMuPDF (Python Bindings for MuPDF)

This article provides a comprehensive overview of PyMuPDF, the Python binding for the lightweight MuPDF library, covering its installation, core features such as page rendering, text and image extraction, PDF manipulation, and detailed code examples for common document‑processing tasks.

Python Programming Learning Circle

May 9, 2022

Introduction and Usage Guide for PyMuPDF (Python Bindings for MuPDF)

PyMuPDF is the official Python binding for MuPDF, a lightweight PDF, XPS and e‑book viewer library, exposing MuPDF’s rendering engine through a simple Python interface.

Key capabilities include decrypting files, accessing metadata, rendering pages as raster images (PNG) or vector graphics (SVG), searching text, extracting text and images, converting documents to HTML, SVG, PDF, CBZ and other formats, and extensive PDF‑specific functions such as creating, merging, splitting, inserting, deleting, rotating pages, handling annotations, form fields, encryption, watermarks and incremental saves.

Installation is straightforward via pip install PyMuPDF; wheels are available for Windows, Linux and macOS, supporting Python 3.6‑3.9 (64‑bit) and 32‑bit on Windows. Optional dependencies like Pillow, fontTools and pymupdf‑fonts enable additional image‑saving and font‑subset features.

Basic usage starts with import fitz, then opening a document with doc = fitz.open(filename). The Document object provides properties such as page_count, metadata, get_toc() and methods like load_page(pno) (or doc[pno]) to obtain a Page object.

Page handling examples include: page = doc.load_page(pno) # or doc[pno] Retrieving links with links = page.get_links(), iterating annotations via for annot in page.annots():, and extracting images with pix = page.get_pixmap() followed by pix.save("page-%i.png" % page.number). Text extraction supports multiple output formats via page.get_text(opt) where opt can be "text", "blocks", "words", "html", "json", "xml", etc.

Searching for a string returns a list of rectangles: areas = page.search_for("mupdf") which can be used for highlighting or cross‑referencing.

PDF‑specific manipulation includes methods such as Document.delete_page(), Document.insert_page(), Document.move_page(), Document.select() to keep only chosen pages, and Document.insert_pdf() to concatenate PDFs. Example to join two PDFs:

# append doc2 to the end of doc1
doc1.insert_pdf(doc2)

and to split a document:

doc2 = fitz.open()
doc2.insert_pdf(doc1, to_page=9)          # first 10 pages
doc2.insert_pdf(doc1, from_page=len(doc1)-10)  # last 10 pages
doc2.save("first-and-last-10.pdf")

Saving is performed with Document.save(), optionally using incremental=True for fast updates, and the document should be closed with Document.close() when finished.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python PDF Document Processing MuPDF PyMuPDF text extraction

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.