Syntax Analysis: Building a Parser
Syntax analysis, also known as parsing, is the second phase of the compilation process. It takes the tokens produced by the lexer and builds a hierarchical structure that represents the grammatical structure of the token stream according to the language's grammar.
What is a Parser?
A parser is a compiler component that builds a parse tree or abstract syntax tree (AST) from the tokens produced by the lexer. The parser checks if the token sequence conforms to the grammar rules of the programming language and organizes them into a structured representation.
Context-Free Grammars
Most programming languages are defined using context-free grammars (CFGs). A context-free grammar consists of:
- Terminal symbols: Tokens from the lexer (e.g., identifiers, keywords, operators)
- Non-terminal symbols: Grammar variables that can be replaced by other symbols
- Production rules: Rules that specify how non-terminals can be replaced
- Start symbol: The non-terminal that represents the complete program
Example Grammar
Here's a simple grammar for expressions:
expr → term (('+' | '-') term)*
term → factor (('*' | '/') factor)*
factor → INTEGER | '(' expr ')'
This grammar defines the rules for parsing mathematical expressions with proper operator precedence.
Types of Parsers
There are two main approaches to parsing:
Recursive Descent Parsing
Recursive descent parsing is a top-down parsing technique where a set of recursive procedures are used to process the input. Each procedure corresponds to a non-terminal in the grammar.
Implementing a Recursive Descent Parser in JavaScript
Let's implement a recursive descent parser for our simple expression grammar:
class Parser {
constructor(tokens) {
this.tokens = tokens;
this.current = 0;
}
// Helper methods
peek() {
return this.tokens[this.current];
}
previous() {
return this.tokens[this.current - 1];
}
isAtEnd() {
return this.peek().type === 'EOF';
}
advance() {
if (!this.isAtEnd()) this.current++;
return this.previous();
}
check(type, value = null) {
if (this.isAtEnd()) return false;
const token = this.peek();
if (value !== null) {
return token.type === type && token.value === value;
}
return token.type === type;
}
consume(type, value = null, errorMessage = 'Unexpected token') {
if (this.check(type, value)) {
return this.advance();
}
throw new Error(`${errorMessage}. Expected ${type}${value ? ` with value '${value}'` : ''}, got ${this.peek().type} with value '${this.peek().value}'`);
}
// Grammar rules implementation
// expr → term (('+' | '-') term)*
expression() {
let expr = this.term();
while (this.check('OPERATOR', '+') || this.check('OPERATOR', '-')) {
const operator = this.advance().value;
const right = this.term();
expr = {
type: 'BinaryExpression',
operator,
left: expr,
right
};
}
return expr;
}
// term → factor (('*' | '/') factor)*
term() {
let expr = this.factor();
while (this.check('OPERATOR', '*') || this.check('OPERATOR', '/')) {
const operator = this.advance().value;
const right = this.factor();
expr = {
type: 'BinaryExpression',
operator,
left: expr,
right
};
}
return expr;
}
// factor → NUMBER | '(' expr ')'
factor() {
if (this.check('NUMBER')) {
return {
type: 'Literal',
value: this.advance().value
};
}
if (this.check('IDENTIFIER')) {
return {
type: 'Identifier',
name: this.advance().value
};
}
if (this.check('PUNCTUATION', '(')) {
this.advance(); // Consume the opening parenthesis
const expr = this.expression();
this.consume('PUNCTUATION', ')', "Expected ')' after expression");
return {
type: 'GroupExpression',
expression: expr
};
}
throw new Error(`Unexpected token: ${this.peek().type} with value '${this.peek().value}'`);
}
// program → statement*
program() {
const statements = [];
while (!this.isAtEnd()) {
statements.push(this.statement());
}
return {
type: 'Program',
body: statements
};
}
// statement → expressionStatement | variableDeclaration | block
statement() {
if (this.check('KEYWORD', 'let') || this.check('KEYWORD', 'const')) {
return this.variableDeclaration();
}
if (this.check('PUNCTUATION', '{')) {
return this.block();
}
return this.expressionStatement();
}
// expressionStatement → expression ';'
expressionStatement() {
const expr = this.expression();
this.consume('PUNCTUATION', ';', "Expected ';' after expression");
return {
type: 'ExpressionStatement',
expression: expr
};
}
// variableDeclaration → ('let' | 'const') IDENTIFIER ('=' expression)? ';'
variableDeclaration() {
const kind = this.advance().value; // let or const
const name = this.consume('IDENTIFIER', null, 'Expected variable name').value;
let initializer = null;
if (this.check('OPERATOR', '=')) {
this.advance(); // Consume the '='
initializer = this.expression();
} else if (kind === 'const') {
throw new Error("Const declarations must be initialized");
}
this.consume('PUNCTUATION', ';', "Expected ';' after variable declaration");
return {
type: 'VariableDeclaration',
kind,
name,
initializer
};
}
// block → '{' statement* '}'
block() {
this.consume('PUNCTUATION', '{', "Expected '{'");
const statements = [];
while (!this.check('PUNCTUATION', '}') && !this.isAtEnd()) {
statements.push(this.statement());
}
this.consume('PUNCTUATION', '}', "Expected '}'");
return {
type: 'BlockStatement',
body: statements
};
}
parse() {
try {
return this.program();
} catch (error) {
console.error('Parsing error:', error.message);
return null;
}
}
}
// Example usage with tokens from the lexer
const sourceCode = `
let x = 10;
let y = 20;
const result = (x + y) * 2;
`;
// First, generate tokens with our lexer
const lexer = new Lexer(sourceCode); // Assuming Lexer is defined
const tokens = lexer.tokenize();
// Then parse the tokens
const parser = new Parser(tokens);
const ast = parser.parse();
console.log(JSON.stringify(ast, null, 2));
Parse Trees vs. Abstract Syntax Trees
While parsing, we can generate either:
- Parse Tree: A concrete representation that shows every step of the derivation according to the grammar.
- Abstract Syntax Tree (AST): A simplified tree that captures the essential structure and omits derivation details.
Handling Operator Precedence
The grammar rules we defined automatically handle operator precedence:
- Multiplication and division have higher precedence (handled in the
term
rule) - Addition and subtraction have lower precedence (handled in the
expression
rule)
Error Handling in Parsers
Robust parsers include error recovery strategies:
Conclusion
Syntax analysis is a crucial phase of compilation that organizes tokens into a structured representation. Recursive descent parsing is a powerful and intuitive approach for implementing parsers.
In the next section, we'll explore how to work with Abstract Syntax Trees and how they are used for semantic analysis and code generation.