Automatic PDF Slide Transcription Using Deep Learning OCR
This article demonstrates how to automatically convert PDF slide decks into editable markdown text by first converting each page to images, then applying a deep‑learning OCR pipeline (CTPN for detection and CRNN for recognition) with Python code examples, achieving high transcription accuracy.
Many people need to turn PDF slides into editable text, but traditional tools are cumbersome. This article describes a project by Lucas Soares, a senior machine learning engineer at K1 Digital, who uses OCR to automatically transcribe PDF slides.
The workflow consists of three steps: converting each PDF page to an image, detecting and recognizing text in the images, and displaying example outputs.
First, the PDF is converted to PNG images using the pdf2image library:
from pdf2image import convert_from_path
from pdf2image.exceptions import (PDFInfoNotInstalledError,
PDFPageCountError,
PDFSyntaxError)
pdf_path = "path/to/file/intro_RL_Lecture1.pdf"
images = convert_from_path(pdf_path)
for i, image in enumerate(images):
fname = "image" + str(i) + ".png"
image.save(fname, "PNG")Next, the OCR pipeline from the ocr.pytorch repository is used. The CTPN model detects text regions and the CRNN model recognizes the characters. The script processes each image, saves the annotated image and writes the recognized text to a .txt file.
# adapted from this source: https://github.com/courao/ocr.pytorch
%load_ext autoreload
%autoreload 2
import os
from ocr import ocr
import time
import shutil
import numpy as np
import pathlib
from PIL import Image
from glob import glob
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import pytesseract
def single_pic_proc(image_file):
image = np.array(Image.open(image_file).convert('RGB'))
result, image_framed = ocr(image)
return result, image_framed
image_files = glob('./input_images/*.*')
result_dir = './output_images_with_boxes/'
if os.path.exists(result_dir):
shutil.rmtree(result_dir)
os.mkdir(result_dir)
for image_file in sorted(image_files):
result, image_framed = single_pic_proc(image_file) # detecting and recognizing the text
filename = pathlib.Path(image_file).name
output_file = os.path.join(result_dir, image_file.split('/')[-1])
txt_file = os.path.join(result_dir, image_file.split('/')[-1].split('.')[0] + '.txt')
txt_f = open(txt_file, 'w')
Image.fromarray(image_framed).save(output_file)
for key in result:
txt_f.write(result[key][1] + '\n')
txt_f.close()To visualize the results, OpenCV can load an output image, resize it, and display it:
import cv2 as cv
output_dir = pathlib.Path("./output_images_with_boxes")
image = cv.imread(f"{output_dir}/image7.png")
size_reshaped = (int(image.shape[1]), int(image.shape[0]))
image = cv.resize(image, size_reshaped)
cv.imshow("image", image)
cv.waitKey(0)
cv.destroyAllWindows()Finally, the transcribed text can be read from the generated .txt file:
filename = f"{output_dir}/image7.txt"
with open(filename, "r") as text:
for line in text.readlines():
print(line.strip("\n"))The resulting tool can accurately transcribe various documents, from handwritten notes to random text in photos, providing a powerful self‑contained OCR solution without relying on external software.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.