Fundamentals and Policy Gradient Algorithms in Reinforcement Learning with Applications to Scene Text Recognition
This article introduces the basic concepts of reinforcement learning, derives model‑based and model‑free policy gradient methods—including vanilla policy gradient and Actor‑Critic—explains their mathematical foundations, and demonstrates their use in scene text recognition and image captioning tasks.
1. Basic Concepts
Reinforcement learning (RL) is a key branch of machine learning that mimics human trial‑and‑error learning, where an agent interacts with an environment to maximize cumulative reward. Examples include autonomous driving and Atari Breakout.
1.1 Modeling
The environment is modeled as a Markov decision process with transition probability P(s'|s,a) , state space S , and action space A . A policy π(a|s) defines the probability of taking action a in state s . The goal is to find a trajectory distribution that maximizes expected return.
1.2 Optimization
Optimizing the expected return leads to the policy‑gradient formula. By applying the log‑derivative trick and Monte‑Carlo estimation, the gradient can be expressed as an expectation over sampled trajectories, which is the basis of the REINFORCE algorithm.
1.3 Numerical Computation
Monte‑Carlo sampling is used to estimate the expectation; each iteration samples trajectories, estimates the gradient, and updates the policy parameters with learning rate α . Variance reduction techniques such as baselines are introduced to improve stability.
2. Policy Gradient Algorithms
2.1 Basic Policy Gradient
The vanilla policy gradient uses a baseline (often the state‑value function V(s) ) to reduce variance without biasing the gradient estimate.
2.2 Actor‑Critic
Actor‑Critic methods combine a policy (actor) with a learned value function (critic) to provide a baseline. The advantage function A(s,a)=Q(s,a)-V(s) further reduces variance. Variants such as A3C and Soft Actor‑Critic are mentioned.
3. Practical Applications
3.1 Scene Text Recognition (STR)
STR uses a Seq2Seq framework where a CNN encoder extracts image features and an RNN decoder generates character sequences. Attention mechanisms ( fc layers and + concatenation) align image regions with characters, improving recognition especially for complex backgrounds.
3.1.1 Attention Mechanism
Soft attention computes weighted sums over all image features, while hard attention treats the attention weights as a categorical distribution and can be trained with RL policy gradients.
3.1.2 Experimental Results
Figures show successful and failed STR examples, illustrating that RL‑based attention can focus on relevant regions but may struggle with heavily distorted text.
3.2 Image Captioning
Image captioning generates natural‑language descriptions of images using the same Seq2Seq + attention architecture. Reinforcement learning can be applied to optimize language quality metrics.
4. Conclusion
The article reviewed reinforcement learning fundamentals, detailed model‑free policy‑gradient methods (vanilla and Actor‑Critic), and showcased their application to scene text recognition and image captioning, discussing strengths and robustness challenges.
HomeTech
HomeTech tech sharing
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.