Artificial Intelligence 23 min read

Deep Dive into OCR – Chapter 2: Development and Classification of OCR Technology

This article provides a comprehensive overview of OCR technology, detailing the evolution from traditional hand‑crafted methods to modern deep‑learning approaches, describing image preprocessing, text detection and recognition pipelines, summarizing classic machine‑learning algorithms, and presenting a practical OpenCV implementation with Python code.

Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Deep Dive into OCR – Chapter 2: Development and Classification of OCR Technology

Deep Dive into OCR – Chapter 2: Development and Classification of OCR Technology

After several months of preparation, the author launches a new series Deep Dive into OCR , which aims to cover OCR technology from its history, concepts, algorithms, papers, and datasets, providing a complete tutorial.

Article Directory Chapter 1: OCR Technology Introduction (link) Chapter 2: OCR Technology Development and Classification (this article)

OCR Technology Development Overview

Generally, OCR can be divided into traditional methods and deep‑learning methods . Traditional methods are limited by hand‑crafted features and complex pipelines, while deep‑learning OCR replaces manual steps with CNN models that automatically detect text regions and recognize characters with superior accuracy.

The author summarizes the development timeline in the following diagram:

1. Traditional OCR

Traditional OCR algorithms rely on image‑processing techniques (e.g., projection, dilation, rotation) and statistical machine‑learning to extract text from simple, high‑resolution documents with uniform backgrounds.

1.1 Technical Process

The workflow includes image preprocessing (grayscale, binarization, noise removal, skew correction), layout analysis, character segmentation, recognition, layout reconstruction, post‑processing, and proofreading.

1.1.1 Image Preprocessing

(1) Binarization

Image binarization converts pixel values to 0 or 255, producing a clear black‑and‑white image that reduces data dimensionality and suppresses noise, which is crucial for OCR accuracy.

(2) Skew Detection and Correction

Hough Transform is used to detect straight lines in the image, enabling the estimation of skew angles.

PCA‑based Method computes the principal component of foreground pixels to determine the dominant orientation.

1.1.2 Traditional Text Detection and Recognition

Traditional OCR separates text detection (locating text regions) and recognition (classifying characters). Detection methods include salient‑feature‑based and sliding‑window approaches.

Traditional detection struggles with complex scenes such as heavily distorted or blurry text.

1.2 Traditional Machine‑Learning OCR Methods

After locating text regions and correcting skew, characters are segmented and fed into feature extraction (hand‑crafted or CNN features) followed by a classification model. Post‑processing often uses statistical language models (e.g., HMM) for error correction.

1.2.1 Feature Extraction Methods

Structural Features : contour and region descriptors (e.g., Canny, HOG, Sobel).

Geometric Distribution Features : capture shape information via projection histograms, 2‑D histograms, and grid‑based methods.

Template Matching : computes similarity between a query image and a library of character templates.

1.2.2 Traditional Classification Methods

After feature extraction, characters are classified using various algorithms:

Support Vector Machine (SVM) : effective for small samples and high‑dimensional data.

Bayesian Classifier : predicts class probabilities using Bayes theorem.

K‑Nearest Neighbors (KNN) : simple, non‑parametric method based on majority voting of nearest samples.

Multilayer Perceptron (MLP) : feed‑forward neural network that handles non‑linear problems.

Neural Network Algorithms : either feed the raw pixel matrix directly or use extracted features as input.

2. Deep‑Learning OCR

With the rapid development of deep learning, OCR has shifted from hand‑crafted pipelines to end‑to‑end CNN‑based models that automatically learn visual features, greatly improving recognition performance.

2.1 Technical Pipeline

Image Preprocessing : grayscale, binarization, denoising, skew correction, normalization.

Text Detection : models such as CTPN, EAST, SegLink, TextBoxes, R2CNN, PixelLink, PSENet.

Text Recognition : models such as CRNN, Attention‑OCR.

Post‑Processing : language models, dictionaries, rules, and layout reconstruction.

2.2 Deep‑Learning Text Detection and Recognition

OCR algorithms can be two‑stage (separate detection and recognition) or end‑to‑end (single model handling both).

2.2.1 Deep‑Learning Text Detection

Detection models have evolved from regression‑based to segmentation‑based approaches, and can be categorized as top‑down or bottom‑up.

2.2.2 Deep‑Learning Text Recognition

The mainstream recognition pipeline includes image preprocessing, visual feature extraction, sequence modeling, and prediction.

Recognition methods are classified as:

CTC‑based (e.g., CRNN, Rosetta)

Attention‑based (e.g., RARE, DAN, PREN)

Transformer‑based (e.g., SRN, NRTR, Master, ABINet)

Rectification modules (e.g., RARE, ASTER, SAR)

Segmentation‑based (e.g., Text Scanner, Mask TextSpotter)

Algorithm Category

Main Idea

Key Papers

Traditional

Sliding window, character extraction, dynamic programming

-

CTC

Sequence‑to‑sequence alignment without explicit segmentation

CRNN, Rosetta

Attention

Focus on relevant regions for irregular text

RARE, DAN, PREN

Transformer

Self‑attention based modeling

SRN, NRTR, Master, ABINet

Rectification

Learn text boundaries and rectify to horizontal orientation

RARE, ASTER, SAR

Segmentation

Detect character regions then classify

Text Scanner, Mask TextSpotter

2.3 End‑to‑End Natural‑Scene Detection and Recognition

End‑to‑end OCR models jointly learn detection and recognition, sharing CNN features and achieving smaller model size and faster inference.

Two major categories exist:

Rule‑based text (straight or slightly tilted) – e.g., FOTS, TextSpotter.

Arbitrary‑shape text (curved, distorted) – e.g., Mask TextSpotter, ABCNet, PGNet, PAN++.

3. Practical Traditional OCR with OpenCV

import cv2
import numpy as np
import argparse
import imutils
from imutils import contours
import pytesseract
from PIL import Image
import os

def ShowImage(name, image):
    cv2.imshow(name, image)
    cv2.waitKey(0)  # wait for any key
    cv2.destroyAllWindows()

def order_points(pts):
    # four points
    rect = np.zeros((4, 2), dtype="float32")
    # top‑left, top‑right, bottom‑right, bottom‑left
    s = pts.sum(axis=1)
    rect[0] = pts[np.argmin(s)]
    rect[2] = pts[np.argmax(s)]
    diff = np.diff(pts, axis=1)
    rect[1] = pts[np.argmin(diff)]
    rect[3] = pts[np.argmax(diff)]
    return rect

def four_point_transform(image, pts):
    rect = order_points(pts)
    (tl, tr, br, bl) = rect
    widthA = np.sqrt(((br[0] - bl[0])**2) + ((br[1] - bl[1])**2))
    widthB = np.sqrt(((tr[0] - tl[0])**2) + ((tr[1] - tl[1])**2))
    maxWidth = max(int(widthA), int(widthB))
    heightA = np.sqrt(((tr[0] - br[0])**2) + ((tr[1] - br[1])**2))
    heightB = np.sqrt(((tl[0] - bl[0])**2) + ((tl[1] - bl[1])**2))
    maxHeight = max(int(heightA), int(heightB))
    dst = np.array([
        [0, 0],
        [maxWidth - 1, 0],
        [maxWidth - 1, maxHeight - 1],
        [0, maxHeight - 1]
    ], dtype="float32")
    M = cv2.getPerspectiveTransform(rect, dst)
    warp = cv2.warpPerspective(image, M, (maxWidth, maxHeight))
    return warp

def resize(image, width=None, height=None, inter=cv2.INTER_AREA):
    dim = None
    (h, w) = image.shape[:2]
    if width is None and height is None:
        return image
    if width is None:
        r = height / float(h)
        dim = (int(w * r), height)
    else:
        r = width / float(w)
        dim = (width, int(h * r))
    resized = cv2.resize(image, dim, interpolation=inter)
    return resized

image = cv2.imread('ocr1.png')
ratio = image.shape[0] / 500
orig = image.copy()
image = resize(image, height=500)
# preprocessing
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.GaussianBlur(gray, (5, 5), 0)
edged = cv2.Canny(gray, 75, 200)
ShowImage('edged', edged)
# contour detection
cnts, _ = cv2.findContours(edged.copy(), cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
cnts = sorted(cnts, key=cv2.contourArea, reverse=True)[:5]
for c in cnts:
    peri = cv2.arcLength(c, True)
    approx = cv2.approxPolyDP(c, 0.02 * peri, True)
    if len(approx) == 4:
        screenCnt = approx
        break
cv2.drawContours(image, [screenCnt], -1, (0, 0, 255), 2)
ShowImage('image', image)
warped = four_point_transform(orig, screenCnt.reshape(4, 2) * ratio)
ShowImage('warped', warped)
warped = cv2.cvtColor(warped, cv2.COLOR_BGR2GRAY)
ref = cv2.threshold(warped, 100, 255, cv2.THRESH_BINARY)[1]
ShowImage('binary', ref)
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename, ref)
text = pytesseract.image_to_string(Image.open(filename))
print(text)
os.remove(filename)
ShowImage('image', ref)

Resulting OCR outputs are shown in the following images:

If you find this article helpful, please consider following, liking, and bookmarking the public account.

computer visionPythonDeep Learningimage-processingOCRopencvtraditional methods
Rare Earth Juejin Tech Community
Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.