Syntax Analysis: Building a Parser

Syntax analysis, also known as parsing, is the second phase of the compilation process. It takes the tokens produced by the lexer and builds a hierarchical structure that represents the grammatical structure of the token stream according to the language's grammar.

What is a Parser?

A parser is a compiler component that builds a parse tree or abstract syntax tree (AST) from the tokens produced by the lexer. The parser checks if the token sequence conforms to the grammar rules of the programming language and organizes them into a structured representation.

Context-Free Grammars

Most programming languages are defined using context-free grammars (CFGs). A context-free grammar consists of:

Terminal symbols: Tokens from the lexer (e.g., identifiers, keywords, operators)
Non-terminal symbols: Grammar variables that can be replaced by other symbols
Production rules: Rules that specify how non-terminals can be replaced
Start symbol: The non-terminal that represents the complete program

Example Grammar

Here's a simple grammar for expressions:

expr    → term (('+' | '-') term)*
term    → factor (('*' | '/') factor)*
factor  → INTEGER | '(' expr ')'

This grammar defines the rules for parsing mathematical expressions with proper operator precedence.

Types of Parsers

There are two main approaches to parsing:

Recursive Descent Parsing

Recursive descent parsing is a top-down parsing technique where a set of recursive procedures are used to process the input. Each procedure corresponds to a non-terminal in the grammar.

Implementing a Recursive Descent Parser in JavaScript

Let's implement a recursive descent parser for our simple expression grammar:

class Parser {
  constructor(tokens) {
    this.tokens = tokens;
    this.current = 0;
  }

  // Helper methods
  peek() {
    return this.tokens[this.current];
  }

  previous() {
    return this.tokens[this.current - 1];
  }

  isAtEnd() {
    return this.peek().type === 'EOF';
  }

  advance() {
    if (!this.isAtEnd()) this.current++;
    return this.previous();
  }

  check(type, value = null) {
    if (this.isAtEnd()) return false;
    
    const token = this.peek();
    if (value !== null) {
      return token.type === type && token.value === value;
    }
    return token.type === type;
  }

  consume(type, value = null, errorMessage = 'Unexpected token') {
    if (this.check(type, value)) {
      return this.advance();
    }
    
    throw new Error(`${errorMessage}. Expected ${type}${value ? ` with value '${value}'` : ''}, got ${this.peek().type} with value '${this.peek().value}'`);
  }

  // Grammar rules implementation
  
  // expr → term (('+' | '-') term)*
  expression() {
    let expr = this.term();
    
    while (this.check('OPERATOR', '+') || this.check('OPERATOR', '-')) {
      const operator = this.advance().value;
      const right = this.term();
      expr = {
        type: 'BinaryExpression',
        operator,
        left: expr,
        right
      };
    }
    
    return expr;
  }
  
  // term → factor (('*' | '/') factor)*
  term() {
    let expr = this.factor();
    
    while (this.check('OPERATOR', '*') || this.check('OPERATOR', '/')) {
      const operator = this.advance().value;
      const right = this.factor();
      expr = {
        type: 'BinaryExpression',
        operator,
        left: expr,
        right
      };
    }
    
    return expr;
  }
  
  // factor → NUMBER | '(' expr ')'
  factor() {
    if (this.check('NUMBER')) {
      return {
        type: 'Literal',
        value: this.advance().value
      };
    }
    
    if (this.check('IDENTIFIER')) {
      return {
        type: 'Identifier',
        name: this.advance().value
      };
    }
    
    if (this.check('PUNCTUATION', '(')) {
      this.advance(); // Consume the opening parenthesis
      const expr = this.expression();
      this.consume('PUNCTUATION', ')', "Expected ')' after expression");
      return {
        type: 'GroupExpression',
        expression: expr
      };
    }
    
    throw new Error(`Unexpected token: ${this.peek().type} with value '${this.peek().value}'`);
  }
  
  // program → statement*
  program() {
    const statements = [];
    
    while (!this.isAtEnd()) {
      statements.push(this.statement());
    }
    
    return {
      type: 'Program',
      body: statements
    };
  }
  
  // statement → expressionStatement | variableDeclaration | block
  statement() {
    if (this.check('KEYWORD', 'let') || this.check('KEYWORD', 'const')) {
      return this.variableDeclaration();
    }
    
    if (this.check('PUNCTUATION', '{')) {
      return this.block();
    }
    
    return this.expressionStatement();
  }
  
  // expressionStatement → expression ';'
  expressionStatement() {
    const expr = this.expression();
    this.consume('PUNCTUATION', ';', "Expected ';' after expression");
    
    return {
      type: 'ExpressionStatement',
      expression: expr
    };
  }
  
  // variableDeclaration → ('let' | 'const') IDENTIFIER ('=' expression)? ';'
  variableDeclaration() {
    const kind = this.advance().value; // let or const
    const name = this.consume('IDENTIFIER', null, 'Expected variable name').value;
    
    let initializer = null;
    if (this.check('OPERATOR', '=')) {
      this.advance(); // Consume the '='
      initializer = this.expression();
    } else if (kind === 'const') {
      throw new Error("Const declarations must be initialized");
    }
    
    this.consume('PUNCTUATION', ';', "Expected ';' after variable declaration");
    
    return {
      type: 'VariableDeclaration',
      kind,
      name,
      initializer
    };
  }
  
  // block → '{' statement* '}'
  block() {
    this.consume('PUNCTUATION', '{', "Expected '{'");
    
    const statements = [];
    while (!this.check('PUNCTUATION', '}') && !this.isAtEnd()) {
      statements.push(this.statement());
    }
    
    this.consume('PUNCTUATION', '}', "Expected '}'");
    
    return {
      type: 'BlockStatement',
      body: statements
    };
  }
  
  parse() {
    try {
      return this.program();
    } catch (error) {
      console.error('Parsing error:', error.message);
      return null;
    }
  }
}

// Example usage with tokens from the lexer
const sourceCode = `
let x = 10;
let y = 20;
const result = (x + y) * 2;
`;

// First, generate tokens with our lexer
const lexer = new Lexer(sourceCode); // Assuming Lexer is defined
const tokens = lexer.tokenize();

// Then parse the tokens
const parser = new Parser(tokens);
const ast = parser.parse();

console.log(JSON.stringify(ast, null, 2));

Parse Trees vs. Abstract Syntax Trees

While parsing, we can generate either:

Parse Tree: A concrete representation that shows every step of the derivation according to the grammar.
Abstract Syntax Tree (AST): A simplified tree that captures the essential structure and omits derivation details.

Handling Operator Precedence

The grammar rules we defined automatically handle operator precedence:

Multiplication and division have higher precedence (handled in the term rule)
Addition and subtraction have lower precedence (handled in the expression rule)

Error Handling in Parsers

Robust parsers include error recovery strategies:

Conclusion

Syntax analysis is a crucial phase of compilation that organizes tokens into a structured representation. Recursive descent parsing is a powerful and intuitive approach for implementing parsers.

In the next section, we'll explore how to work with Abstract Syntax Trees and how they are used for semantic analysis and code generation.

What is a Parser?​

Context-Free Grammars​

Example Grammar​

Types of Parsers​

Recursive Descent Parsing​

Implementing a Recursive Descent Parser in JavaScript​

Parse Trees vs. Abstract Syntax Trees​

Handling Operator Precedence​

Error Handling in Parsers​

Conclusion​