Why a Single Regex Can Crash Your Java Service: Understanding NFA Backtracking
An unexpected CPU spike in a Java service was traced to a complex URL‑validation regex whose NFA backtracking caused catastrophic performance, and the article explains the regex engine’s behavior, identifies the problematic pattern, and shows how to refactor the expression to eliminate excessive backtracking.
We observed a Java service where CPU usage rose to nearly 100% after a request triggered the
validateUrlmethod, which validates URLs with a regular expression.
A simple unit test reproduces the issue:
<code>public static void main(String[] args) {
String badRegex = "^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~\\/])+$";
String bugUrl = "http://www.fapiao.com/dddp-web/pdf/download?request=6e7JGxxxxx4ILd-kExxxxxxxqJ4-CHLmqVnenXC692m74H38sdfdsazxcUmfcOH2fAfY1Vw__%5EDadIfJgiEf";
if (bugUrl.matches(badRegex)) {
System.out.println("match!!");
} else {
System.out.println("no match!!");
}
}
</code>Running the test shows the Java process consuming about 91% CPU.
Regex Engine
Java uses an NFA (Non‑deterministic Finite Automaton) engine. NFA matches by reading the pattern character by character and backtracking when a mismatch occurs. Backtracking can become extremely costly for certain patterns.
NFA Backtracking
When the engine encounters the URL‑validation regex, it repeatedly backtracks because the pattern is greedy and the input URL contains long, complex segments (including underscores and percent signs) that the third part of the regex does not accept.
<code>^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~\/])+$</code>The first part (protocol) is fine, but the second part greedily consumes the domain and the following path, causing the engine to read far beyond the intended dot separator. The third part lacks support for '_' and '%', forcing the engine to backtrack many times.
Solution
Adding the missing characters to the third part and making the second part possessive (using
++) eliminates backtracking:
<code>public static void main(String[] args) {
String badRegex = "^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)++([A-Za-z0-9-~_%\\/])+$";
String bugUrl = "http://www.fapiao.com/dddp-web/pdf/download?request=6e7JGxxxxx4ILd-kExxxxxxxqJ4-CHLmqVnenXC692m74H38sdfdsazxcUmfcOH2fAfY1Vw__%5EDadIfJgiEf";
if (bugUrl.matches(badRegex)) {
System.out.println("match!!");
} else {
System.out.println("no match!!");
}
}
</code>After the change the program prints
match!!instantly and CPU usage stays normal.
For more robust patterns, consider using lazy quantifiers or atomic groups, and always test with a regex debugger to detect catastrophic backtracking.
Conclusion
A seemingly simple URL‑validation regex can cause severe CPU overload due to NFA backtracking. Understanding the engine’s behavior and applying possessive quantifiers or expanding the character class prevents catastrophic performance loss.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.