Understanding Compiler Front‑End: Lexical, Syntax, and Semantic Analysis with Antlr
This article introduces the fundamentals of compiler front‑end development, covering lexical analysis with finite automata, syntax analysis using context‑free grammars and parsing strategies, and semantic analysis concepts, while providing practical Antlr examples for Java code tokenization, parsing, and semantic checks.
The concept of “code visualization” was introduced earlier, and this article focuses on the prerequisite knowledge of the compiler front‑end needed for developing such visualizations.
2. Compiler (Compiler) – The discussion concentrates on front‑end and middle‑end theory, noting that back‑end and target‑machine code are rarely visualized.
2.1 Compiler Workflow – Overview of compiler stages.
2.2 Compiler Front‑End
2.2.1 Lexical Analysis (Scanning)
Lexical analysis reads the source character stream and groups characters into meaningful lexemes, producing tokens of the form <type, attribute> . It is based on finite automata, including nondeterministic (NFA) and deterministic (DFA) machines.
Practice: Using Antlr to perform lexical analysis on Java source code.
# Lexical rule file: Java8Lexer.g4
ABSTRACT : 'abstract';
ASSERT : 'assert';
BOOLEAN : 'boolean';
... // other token definitions
StringLiteral : '"' StringCharacters? '"';
fragment StringCharacters : StringCharacter+;
fragment StringCharacter : ~["\\\r\n] | EscapeSequence; // Sample Java source to analyze
public class HelloWorld {
public static void main(String[] args) {
System.out.println("Hello, World");
}
} # Compile lexer rules
antlr Java8Lexer.g4
# Compile generated Java files
javac Java8Lexer.java
# Run lexer on the example
grun Java8Lexer tokens -tokens ./examples/helloworld.java2.2.2 Syntax Analysis (Parsing)
Parsing transforms token streams into an abstract syntax tree (AST) based on a context‑free grammar (CFG). A CFG consists of non‑terminals, terminals, production rules, and a start symbol.
S → a S b
S → εParsing strategies include top‑down (e.g., recursive‑descent, LL) and bottom‑up (e.g., LR) approaches.
Practice: Using Antlr to parse Java code with a custom grammar.
# Grammar file: PlayScript.g4
grammar PlayScript;
import CommonLexer; // import lexical definitions
@header { package antlrtest; }
expression : assignmentExpression | expression ',' assignmentExpression ;
assignmentExpression : additiveExpression | Identifier assignmentOperator additiveExpression ;
assignmentOperator : '=' | '*=' | '/=' | '%=' | '+=' | '-=' ;
additiveExpression : multiplicativeExpression | additiveExpression '+' multiplicativeExpression | additiveExpression '-' multiplicativeExpression ;
multiplicativeExpression : primaryExpression | multiplicativeExpression '*' primaryExpression | multiplicativeExpression '/' primaryExpression | multiplicativeExpression '%' primaryExpression ; # Compile grammar
antlr PlayScript.g4
# Compile generated Java files
javac *.java
# Run parser and generate AST GUI
grun antlrtest.PlayScript expression -gui2.2.3 Semantic Analysis
Semantic analysis uses the AST and symbol tables to verify that the program conforms to language semantics, performing type checking, variable binding, control‑flow checks, uniqueness checks, and access‑control verification.
Key Java compiler classes involved in semantic analysis include Symbol, Scope, Type, Attr, Check, Resolve, Annotate, Types, Flow, LambdaToMethod, TransTypes, and Lower.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.