Skip to main content

Parsing Theory and Algorithms

Introduction to Parsing

Parsing is the process of analyzing a string of symbols according to the rules of a formal grammar. In the context of compilers, parsing transforms a sequence of tokens into a parse tree or abstract syntax tree, which represents the syntactic structure of the program.

Parsing Approaches

Mathematical Foundations

Parse Trees and Derivations

For a context-free grammar G=(V,Σ,R,S)G = (V, \Sigma, R, S), a parse tree is a tree where:

  • The root is labeled with the start symbol SS
  • Each leaf node is labeled with a terminal symbol or ε\varepsilon
  • Each interior node is labeled with a non-terminal
  • If an interior node is labeled with AA and its children are labeled (from left to right) with X1,X2,,XnX_1, X_2, \ldots, X_n, then AX1X2XnA \rightarrow X_1 X_2 \ldots X_n is a production in RR

Sentential Forms

A sentential form is any string of terminals and non-terminals that can be derived from the start symbol.

SαS \Rightarrow^* \alpha

Where α(VΣ)\alpha \in (V \cup \Sigma)^*

Top-Down Parsing

Top-down parsing starts with the root symbol and attempts to build the parse tree by expanding non-terminals to match the input string.

LL(k) Parsing

LL(k) parsers use kk tokens of lookahead to determine which production to apply next. The "LL" stands for "Left-to-right scanning, Leftmost derivation."

For an LL(1) grammar, we can build a parsing table where:

Table[A,a]=Production to use when non-terminal A is at the top of the stack and token a is the next input\text{Table}[A, a] = \text{Production to use when non-terminal } A \text{ is at the top of the stack and token } a \text{ is the next input}

The LL(1) parsing algorithm uses a stack and an input buffer:

Initialize stack with S (start symbol)
Initialize input buffer with input string followed by $
While stack is not empty:
If top of stack is a terminal:
If it matches current input symbol:
Pop stack and advance input
Else:
Error
Else if top of stack is non-terminal A:
Look up production A → γ in parsing table using current input symbol
Pop A and push γ (in reverse order)
Else:
Error

First and Follow Sets

To build an LL(1) parsing table, we need to compute FIRST and FOLLOW sets:

  • FIRST(α)\text{FIRST}(\alpha) = the set of terminals that can begin strings derived from α\alpha
  • FOLLOW(A)\text{FOLLOW}(A) = the set of terminals that can immediately follow non-terminal AA

FIRST(α)={aΣαaβ for some β(VΣ)}{εαε}\text{FIRST}(\alpha) = \{a \in \Sigma \mid \alpha \Rightarrow^* a\beta \text{ for some } \beta \in (V \cup \Sigma)^*\} \cup \{\varepsilon \mid \alpha \Rightarrow^* \varepsilon\}

FOLLOW(A)={aΣSαAaβ for some α,β(VΣ)}\text{FOLLOW}(A) = \{a \in \Sigma \mid S \Rightarrow^* \alpha A a \beta \text{ for some } \alpha, \beta \in (V \cup \Sigma)^*\}

Bottom-Up Parsing

Bottom-up parsing starts with the input string and attempts to build the parse tree by reducing substrings to non-terminals until reaching the start symbol.

LR(k) Parsing

LR(k) parsers use kk tokens of lookahead to determine when to reduce. The "LR" stands for "Left-to-right scanning, Rightmost derivation in reverse."

LR parsing uses a shift-reduce algorithm with a stack and input buffer:

Initialize stack with $
Initialize input buffer with input string followed by $
While True:
Let s be the state on top of the stack
Let a be the current input symbol
If Action[s, a] = "shift s'":
Push a and s' onto the stack
Advance input
Else if Action[s, a] = "reduce A → β":
Pop 2 * |β| symbols from stack (removing β and its states)
Let s' be the state now on top of the stack
Push A and Goto[s', A] onto the stack
Else if Action[s, a] = "accept":
Success
Else:
Error

LR Parsing Variants

  1. SLR (Simple LR): Uses FOLLOW sets to resolve reduce-reduce conflicts
  2. LALR (Look-Ahead LR): Merges similar states of canonical LR parsers
  3. CLR (Canonical LR): The most powerful deterministic bottom-up parsing technique

Chart Parsing Algorithms

CYK (Cocke-Younger-Kasami) Algorithm

The CYK algorithm is a dynamic programming approach for parsing in O(n3)O(n^3) time. It requires the grammar to be in Chomsky Normal Form.

For a string w=a1a2anw = a_1 a_2 \ldots a_n, we fill a table TT where:

T[i,j]={AAaiaj}T[i, j] = \{A \mid A \Rightarrow^* a_i \ldots a_j\}

Algorithm:

Initialize table T with empty sets
For i = 1 to n:
For each rule A → a_i:
Add A to T[i, i]
For l = 2 to n:
For i = 1 to n-l+1:
j = i+l-1
For k = i to j-1:
For each rule A → BC:
If B in T[i, k] and C in T[k+1, j]:
Add A to T[i, j]

Earley's Algorithm

Earley's algorithm is a chart parsing method that handles all context-free grammars with time complexity O(n3)O(n^3) in the general case and O(n2)O(n^2) for unambiguous grammars.

It maintains a set of "Earley items" of the form:

[Aαβ,i][A \rightarrow \alpha \cdot \beta, i]

where αβ\alpha \cdot \beta represents a production with a dot indicating how much has been recognized, and ii is the position in the input where recognition of this item began.

Time Complexity Comparison

AlgorithmGrammar ClassTime Complexity
LL(k)LL(k)O(n)O(n)
LR(k)LR(k)O(n)O(n)
CYKCNFO(n3)O(n^3)
EarleyAll CFGsO(n3)O(n^3) general, O(n2)O(n^2) unambiguous

Error Recovery Strategies

Panic Mode

Skip symbols in the input until a synchronizing token is found.

Phrase Level

Attempt to replace, insert, or delete symbols to correct the error.

Error Productions

Augment the grammar with rules that describe common syntactic errors.

Global Correction

Find the minimum number of edit operations to transform the input into a valid sentence.

Formal Properties of Parsing

Decidability

The problem of determining whether a string is in a context-free language is decidable. This means there exists an algorithm that always terminates and correctly answers whether a given string is in the language generated by a given CFG.

Parser Generation

The theory of parser generation studies the automatic construction of parsers from grammar descriptions. Parser generators like YACC, Bison, and ANTLR implement this theory.

Applications in Compiler Design

In compiler design, parsing is a critical phase that builds the structural representation of the program:

  1. Lexical Analysis produces tokens
  2. Parsing organizes tokens into a parse tree
  3. Semantic Analysis adds meaning to the parse tree
  4. Intermediate Code Generation translates the parse tree into a form closer to the target language

The choice of parsing algorithm affects the types of languages that can be parsed, the efficiency of the parser, and the complexity of error reporting.

Conclusion

Parsing theory provides the mathematical foundation for understanding how to analyze and process structured languages. The rich variety of parsing algorithms offers different trade-offs between expressiveness, efficiency, and implementation complexity, allowing language designers and compiler developers to choose the approach best suited to their needs.