Parsing Theory and Algorithms

Introduction to Parsing

Parsing is the process of analyzing a string of symbols according to the rules of a formal grammar. In the context of compilers, parsing transforms a sequence of tokens into a parse tree or abstract syntax tree, which represents the syntactic structure of the program.

Parsing Approaches

Mathematical Foundations

Parse Trees and Derivations

For a context-free grammar $G = (V, \Sigma, R, S)$ , a parse tree is a tree where:

The root is labeled with the start symbol $S$
Each leaf node is labeled with a terminal symbol or $\varepsilon$
Each interior node is labeled with a non-terminal
If an interior node is labeled with $A$ and its children are labeled (from left to right) with $X_1, X_2, \ldots, X_n$ , then $A \rightarrow X_1 X_2 \ldots X_n$ is a production in $R$

Sentential Forms

A sentential form is any string of terminals and non-terminals that can be derived from the start symbol.

$S \Rightarrow^* \alpha$

Where $\alpha \in (V \cup \Sigma)^*$

Top-Down Parsing

Top-down parsing starts with the root symbol and attempts to build the parse tree by expanding non-terminals to match the input string.

LL(k) Parsing

LL(k) parsers use $k$ tokens of lookahead to determine which production to apply next. The "LL" stands for "Left-to-right scanning, Leftmost derivation."

For an LL(1) grammar, we can build a parsing table where:

$\text{Table}[A, a] = \text{Production to use when non-terminal } A \text{ is at the top of the stack and token } a \text{ is the next input}$

The LL(1) parsing algorithm uses a stack and an input buffer:

Initialize stack with S (start symbol)
Initialize input buffer with input string followed by $
While stack is not empty:
    If top of stack is a terminal:
        If it matches current input symbol:
            Pop stack and advance input
        Else:
            Error
    Else if top of stack is non-terminal A:
        Look up production A → γ in parsing table using current input symbol
        Pop A and push γ (in reverse order)
    Else:
        Error

First and Follow Sets

To build an LL(1) parsing table, we need to compute FIRST and FOLLOW sets:

$\text{FIRST}(\alpha)$ = the set of terminals that can begin strings derived from $\alpha$
$\text{FOLLOW}(A)$ = the set of terminals that can immediately follow non-terminal $A$

$\text{FIRST}(\alpha) = \{a \in \Sigma \mid \alpha \Rightarrow^* a\beta \text{ for some } \beta \in (V \cup \Sigma)^*\} \cup \{\varepsilon \mid \alpha \Rightarrow^* \varepsilon\}$

$\text{FOLLOW}(A) = \{a \in \Sigma \mid S \Rightarrow^* \alpha A a \beta \text{ for some } \alpha, \beta \in (V \cup \Sigma)^*\}$

Bottom-Up Parsing

Bottom-up parsing starts with the input string and attempts to build the parse tree by reducing substrings to non-terminals until reaching the start symbol.

LR(k) Parsing

LR(k) parsers use $k$ tokens of lookahead to determine when to reduce. The "LR" stands for "Left-to-right scanning, Rightmost derivation in reverse."

LR parsing uses a shift-reduce algorithm with a stack and input buffer:

Initialize stack with $
Initialize input buffer with input string followed by $
While True:
    Let s be the state on top of the stack
    Let a be the current input symbol
    If Action[s, a] = "shift s'":
        Push a and s' onto the stack
        Advance input
    Else if Action[s, a] = "reduce A → β":
        Pop 2 * |β| symbols from stack (removing β and its states)
        Let s' be the state now on top of the stack
        Push A and Goto[s', A] onto the stack
    Else if Action[s, a] = "accept":
        Success
    Else:
        Error

LR Parsing Variants

SLR (Simple LR): Uses FOLLOW sets to resolve reduce-reduce conflicts
LALR (Look-Ahead LR): Merges similar states of canonical LR parsers
CLR (Canonical LR): The most powerful deterministic bottom-up parsing technique

Chart Parsing Algorithms

CYK (Cocke-Younger-Kasami) Algorithm

The CYK algorithm is a dynamic programming approach for parsing in $O(n^3)$ time. It requires the grammar to be in Chomsky Normal Form.

For a string $w = a_1 a_2 \ldots a_n$ , we fill a table $T$ where:

$T[i, j] = \{A \mid A \Rightarrow^* a_i \ldots a_j\}$

Algorithm:

Initialize table T with empty sets
For i = 1 to n:
    For each rule A → a_i:
        Add A to T[i, i]
For l = 2 to n:
    For i = 1 to n-l+1:
        j = i+l-1
        For k = i to j-1:
            For each rule A → BC:
                If B in T[i, k] and C in T[k+1, j]:
                    Add A to T[i, j]

Earley's Algorithm

Earley's algorithm is a chart parsing method that handles all context-free grammars with time complexity $O(n^3)$ in the general case and $O(n^2)$ for unambiguous grammars.

It maintains a set of "Earley items" of the form:

$[A \rightarrow \alpha \cdot \beta, i]$

where $\alpha \cdot \beta$ represents a production with a dot indicating how much has been recognized, and $i$ is the position in the input where recognition of this item began.

Time Complexity Comparison

Algorithm	Grammar Class	Time Complexity
LL(k)	LL(k)	$O(n)$
LR(k)	LR(k)	$O(n)$
CYK	CNF	$O(n^3)$
Earley	All CFGs	$O(n^3)$ general, $O(n^2)$ unambiguous

Error Recovery Strategies

Panic Mode

Skip symbols in the input until a synchronizing token is found.

Phrase Level

Attempt to replace, insert, or delete symbols to correct the error.

Error Productions

Augment the grammar with rules that describe common syntactic errors.

Global Correction

Find the minimum number of edit operations to transform the input into a valid sentence.

Formal Properties of Parsing

Decidability

The problem of determining whether a string is in a context-free language is decidable. This means there exists an algorithm that always terminates and correctly answers whether a given string is in the language generated by a given CFG.

Parser Generation

The theory of parser generation studies the automatic construction of parsers from grammar descriptions. Parser generators like YACC, Bison, and ANTLR implement this theory.

Applications in Compiler Design

In compiler design, parsing is a critical phase that builds the structural representation of the program:

Lexical Analysis produces tokens
Parsing organizes tokens into a parse tree
Semantic Analysis adds meaning to the parse tree
Intermediate Code Generation translates the parse tree into a form closer to the target language

The choice of parsing algorithm affects the types of languages that can be parsed, the efficiency of the parser, and the complexity of error reporting.

Conclusion

Parsing theory provides the mathematical foundation for understanding how to analyze and process structured languages. The rich variety of parsing algorithms offers different trade-offs between expressiveness, efficiency, and implementation complexity, allowing language designers and compiler developers to choose the approach best suited to their needs.

Introduction to Parsing​

Parsing Approaches​

Mathematical Foundations​

Parse Trees and Derivations​

Sentential Forms​

Top-Down Parsing​

LL(k) Parsing​

First and Follow Sets​

Bottom-Up Parsing​

LR(k) Parsing​

LR Parsing Variants​

Chart Parsing Algorithms​

CYK (Cocke-Younger-Kasami) Algorithm​

Earley's Algorithm​

Time Complexity Comparison​

Error Recovery Strategies​

Panic Mode​

Phrase Level​

Error Productions​

Global Correction​

Formal Properties of Parsing​

Decidability​

Parser Generation​

Applications in Compiler Design​

Conclusion​