Parsing Theory and Algorithms
Introduction to Parsing
Parsing is the process of analyzing a string of symbols according to the rules of a formal grammar. In the context of compilers, parsing transforms a sequence of tokens into a parse tree or abstract syntax tree, which represents the syntactic structure of the program.
Parsing Approaches
Mathematical Foundations
Parse Trees and Derivations
For a context-free grammar , a parse tree is a tree where:
- The root is labeled with the start symbol
- Each leaf node is labeled with a terminal symbol or
- Each interior node is labeled with a non-terminal
- If an interior node is labeled with and its children are labeled (from left to right) with , then is a production in
Sentential Forms
A sentential form is any string of terminals and non-terminals that can be derived from the start symbol.
Where
Top-Down Parsing
Top-down parsing starts with the root symbol and attempts to build the parse tree by expanding non-terminals to match the input string.
LL(k) Parsing
LL(k) parsers use tokens of lookahead to determine which production to apply next. The "LL" stands for "Left-to-right scanning, Leftmost derivation."
For an LL(1) grammar, we can build a parsing table where:
The LL(1) parsing algorithm uses a stack and an input buffer:
Initialize stack with S (start symbol)
Initialize input buffer with input string followed by $
While stack is not empty:
If top of stack is a terminal:
If it matches current input symbol:
Pop stack and advance input
Else:
Error
Else if top of stack is non-terminal A:
Look up production A → γ in parsing table using current input symbol
Pop A and push γ (in reverse order)
Else:
Error
First and Follow Sets
To build an LL(1) parsing table, we need to compute FIRST and FOLLOW sets:
- = the set of terminals that can begin strings derived from
- = the set of terminals that can immediately follow non-terminal
Bottom-Up Parsing
Bottom-up parsing starts with the input string and attempts to build the parse tree by reducing substrings to non-terminals until reaching the start symbol.
LR(k) Parsing
LR(k) parsers use tokens of lookahead to determine when to reduce. The "LR" stands for "Left-to-right scanning, Rightmost derivation in reverse."
LR parsing uses a shift-reduce algorithm with a stack and input buffer:
Initialize stack with $
Initialize input buffer with input string followed by $
While True:
Let s be the state on top of the stack
Let a be the current input symbol
If Action[s, a] = "shift s'":
Push a and s' onto the stack
Advance input
Else if Action[s, a] = "reduce A → β":
Pop 2 * |β| symbols from stack (removing β and its states)
Let s' be the state now on top of the stack
Push A and Goto[s', A] onto the stack
Else if Action[s, a] = "accept":
Success
Else:
Error
LR Parsing Variants
- SLR (Simple LR): Uses FOLLOW sets to resolve reduce-reduce conflicts
- LALR (Look-Ahead LR): Merges similar states of canonical LR parsers
- CLR (Canonical LR): The most powerful deterministic bottom-up parsing technique
Chart Parsing Algorithms
CYK (Cocke-Younger-Kasami) Algorithm
The CYK algorithm is a dynamic programming approach for parsing in time. It requires the grammar to be in Chomsky Normal Form.
For a string , we fill a table where:
Algorithm:
Initialize table T with empty sets
For i = 1 to n:
For each rule A → a_i:
Add A to T[i, i]
For l = 2 to n:
For i = 1 to n-l+1:
j = i+l-1
For k = i to j-1:
For each rule A → BC:
If B in T[i, k] and C in T[k+1, j]:
Add A to T[i, j]
Earley's Algorithm
Earley's algorithm is a chart parsing method that handles all context-free grammars with time complexity in the general case and for unambiguous grammars.
It maintains a set of "Earley items" of the form:
where represents a production with a dot indicating how much has been recognized, and is the position in the input where recognition of this item began.
Time Complexity Comparison
Algorithm | Grammar Class | Time Complexity |
---|---|---|
LL(k) | LL(k) | |
LR(k) | LR(k) | |
CYK | CNF | |
Earley | All CFGs | general, unambiguous |
Error Recovery Strategies
Panic Mode
Skip symbols in the input until a synchronizing token is found.
Phrase Level
Attempt to replace, insert, or delete symbols to correct the error.
Error Productions
Augment the grammar with rules that describe common syntactic errors.
Global Correction
Find the minimum number of edit operations to transform the input into a valid sentence.
Formal Properties of Parsing
Decidability
The problem of determining whether a string is in a context-free language is decidable. This means there exists an algorithm that always terminates and correctly answers whether a given string is in the language generated by a given CFG.
Parser Generation
The theory of parser generation studies the automatic construction of parsers from grammar descriptions. Parser generators like YACC, Bison, and ANTLR implement this theory.
Applications in Compiler Design
In compiler design, parsing is a critical phase that builds the structural representation of the program:
- Lexical Analysis produces tokens
- Parsing organizes tokens into a parse tree
- Semantic Analysis adds meaning to the parse tree
- Intermediate Code Generation translates the parse tree into a form closer to the target language
The choice of parsing algorithm affects the types of languages that can be parsed, the efficiency of the parser, and the complexity of error reporting.
Conclusion
Parsing theory provides the mathematical foundation for understanding how to analyze and process structured languages. The rich variety of parsing algorithms offers different trade-offs between expressiveness, efficiency, and implementation complexity, allowing language designers and compiler developers to choose the approach best suited to their needs.