Python Script for Extracting Text from PDF Files Using PyPDF2
This article introduces a Python utility built with PyPDF2 that extracts text from PDF files, saves it as a TXT file, and provides an interactive command‑line interface with error handling, usage instructions, and code examples for easy document processing.
In the digital age, PDF files are ubiquitous, and extracting their text programmatically can save time compared to manual copying. This guide presents a Python tool based on the PyPDF2 library that reads PDF files, extracts all page text, and writes the output to a similarly named TXT file.
Background and Requirements
Common scenarios for converting PDFs to plain text include importing e‑book content into note‑taking apps, extracting report data for analysis, and preparing data for natural language processing tasks.
Feature Overview
Text Extraction: Reads each page of a PDF and extracts its text.
File Handling: Saves the extracted text to a TXT file using UTF‑8 encoding.
Error Management: Handles missing files, non‑PDF formats, and other exceptions with clear messages.
Interactive Interface: Prompts the user for a file path and allows repeated processing or graceful exit.
Technical Implementation
Dependencies
os : For file path operations.
PyPDF2 : For reading PDF files and extracting text.
Installation
Install PyPDF2 via:
<code>pip install PyPDF2</code>Core Function: pdf_to_txt(pdf_path)
Function: Extracts text from the specified PDF and saves it as a TXT file.
Logic: Verify the file exists and has a .pdf extension. Open the PDF with PdfReader and determine the number of pages. Iterate over each page, calling extract_text() and concatenating results. Write the combined text to a TXT file with the same base name. Return a boolean indicating success.
Error Handling: FileNotFoundError for missing files. ValueError for non‑PDF inputs. General Exception for other issues.
Entry Point: main()
Displays a welcome message and prompts the user for a PDF path (or 'q' to quit).
Calls pdf_to_txt and, on success, asks whether to process another file.
Handles user choices to continue or exit.
Usage Instructions
Ensure Python and PyPDF2 are installed.
Save the script as pdf_to_txt.py .
Run it from the terminal with python pdf_to_txt.py and follow the prompts.
Important Notes
The extract_text() method works only on PDFs that contain actual text; scanned image PDFs require OCR tools such as Tesseract.
UTF‑8 encoding is used to support multilingual content.
Existing TXT files with the same name will be overwritten.
Full Code
<code>import PyPDF2
import os
def pdf_to_txt(pdf_path):
try:
# Check file existence
if not os.path.exists(pdf_path):
raise FileNotFoundError("指定的PDF文件未找到")
# Check file extension
if not pdf_path.lower().endswith('.pdf'):
raise ValueError("文件必须是PDF格式")
file_name = os.path.splitext(pdf_path)[0]
txt_path = f"{file_name}.txt"
# Open PDF
with open(pdf_path, 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
num_pages = len(pdf_reader.pages)
text = ""
for page_num in range(num_pages):
page = pdf_reader.pages[page_num]
text += page.extract_text() + "\n"
# Write to TXT
with open(txt_path, 'w', encoding='utf-8') as txt_file:
txt_file.write(text)
print(f"\n成功提取 {num_pages} 页内容!")
print(f"文字已保存到: {txt_path}")
return True
except FileNotFoundError as e:
print(f"\n错误: {str(e)}")
return False
except ValueError as e:
print(f"\n错误: {str(e)}")
return False
except Exception as e:
print(f"\n发生错误: {str(e)}")
return False
def main():
print("欢迎使用 PDF 文字提取工具!")
print("请输入完整的 PDF 文件路径(或输入 'q' 退出)")
while True:
pdf_path = input("\nPDF 文件路径: ").strip()
if pdf_path.lower() == 'q':
print("程序已退出")
break
success = pdf_to_txt(pdf_path)
if success:
while True:
choice = input("\n是否继续处理其他文件?(y/n): ").lower().strip()
if choice in ['y', 'n']:
break
print("请输入 'y' 或 'n'")
if choice == 'n':
print("程序已退出")
break
else:
print("请检查文件路径后重试")
if __name__ == "__main__":
main()
</code>Conclusion
This simple tool demonstrates Python's practicality in document processing. By leveraging PyPDF2, users can quickly extract text from PDFs and handle the results in a user‑friendly way. For large‑scale tasks, the script can be extended to support batch processing or integrated with OCR for scanned documents.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.