Lexical Analysis: Building a Lexer

Lexical analysis is the first phase of a compiler. It converts a sequence of characters into a sequence of tokens. These tokens are meaningful units of the programming language, such as keywords, identifiers, operators, and literals.

What is a Lexer?

A lexer (also called a scanner or tokenizer) breaks down the source code into the language's basic components (tokens). It reads the source code character by character to recognize patterns that match the token specifications of the language.

The Tokenization Process

Token Types

Common token types include:

Keywords: Reserved words like if, while, return
Identifiers: Variable and function names
Literals: Numbers, strings, booleans
Operators: +, -, *, /, etc.
Punctuation: ,, ;, (, ), etc.

Building a Lexer in JavaScript

Let's implement a more comprehensive lexer in JavaScript for a simple language:

class Lexer {
  constructor(input) {
    this.input = input;
    this.position = 0;
    this.currentChar = this.input[this.position];
    this.keywords = ['let', 'const', 'if', 'else', 'while', 'for', 'function', 'return'];
  }

  advance() {
    this.position++;
    if (this.position < this.input.length) {
      this.currentChar = this.input[this.position];
    } else {
      this.currentChar = null; // End of input
    }
  }

  skipWhitespace() {
    while (this.currentChar !== null && /\s/.test(this.currentChar)) {
      this.advance();
    }
  }

  isAlpha(char) {
    return /[a-zA-Z_]/.test(char);
  }

  isAlphaNumeric(char) {
    return /[a-zA-Z0-9_]/.test(char);
  }

  isDigit(char) {
    return /[0-9]/.test(char);
  }

  identifier() {
    let result = '';
    while (this.currentChar !== null && this.isAlphaNumeric(this.currentChar)) {
      result += this.currentChar;
      this.advance();
    }
    
    // Check if the identifier is a keyword
    if (this.keywords.includes(result)) {
      return { type: 'KEYWORD', value: result };
    }
    
    return { type: 'IDENTIFIER', value: result };
  }

  number() {
    let result = '';
    let isFloat = false;
    
    while (this.currentChar !== null && 
           (this.isDigit(this.currentChar) || this.currentChar === '.')) {
      if (this.currentChar === '.') {
        // Ensure we only have one decimal point
        if (isFloat) break;
        isFloat = true;
      }
      result += this.currentChar;
      this.advance();
    }
    
    return { 
      type: 'NUMBER', 
      value: isFloat ? parseFloat(result) : parseInt(result) 
    };
  }

  string() {
    const quote = this.currentChar; // Store the opening quote type (' or ")
    let result = '';
    this.advance(); // Skip the opening quote
    
    while (this.currentChar !== null && this.currentChar !== quote) {
      // Handle escape sequences
      if (this.currentChar === '\\') {
        this.advance();
        if (this.currentChar === 'n') result += '\n';
        else if (this.currentChar === 't') result += '\t';
        else result += this.currentChar; // For \', \", \\, etc.
      } else {
        result += this.currentChar;
      }
      this.advance();
    }
    
    if (this.currentChar === quote) {
      this.advance(); // Skip the closing quote
    } else {
      throw new Error('Unterminated string');
    }
    
    return { type: 'STRING', value: result };
  }

  getNextToken() {
    while (this.currentChar !== null) {
      
      // Skip whitespace
      if (/\s/.test(this.currentChar)) {
        this.skipWhitespace();
        continue;
      }
      
      // Identify tokens
      if (this.isAlpha(this.currentChar)) {
        return this.identifier();
      }
      
      if (this.isDigit(this.currentChar)) {
        return this.number();
      }
      
      if (this.currentChar === '"' || this.currentChar === "'") {
        return this.string();
      }
      
      // Handle operators and punctuation
      switch (this.currentChar) {
        case '+':
          this.advance();
          return { type: 'OPERATOR', value: '+' };
        case '-':
          this.advance();
          return { type: 'OPERATOR', value: '-' };
        case '*':
          this.advance();
          return { type: 'OPERATOR', value: '*' };
        case '/':
          this.advance();
          return { type: 'OPERATOR', value: '/' };
        case '=':
          this.advance();
          if (this.currentChar === '=') {
            this.advance();
            return { type: 'OPERATOR', value: '==' };
          }
          return { type: 'OPERATOR', value: '=' };
        case ';':
          this.advance();
          return { type: 'PUNCTUATION', value: ';' };
        case '(':
          this.advance();
          return { type: 'PUNCTUATION', value: '(' };
        case ')':
          this.advance();
          return { type: 'PUNCTUATION', value: ')' };
        case '{':
          this.advance();
          return { type: 'PUNCTUATION', value: '{' };
        case '}':
          this.advance();
          return { type: 'PUNCTUATION', value: '}' };
        default:
          throw new Error(`Unexpected character: ${this.currentChar}`);
      }
    }
    
    return { type: 'EOF', value: null };
  }

  tokenize() {
    const tokens = [];
    let token = this.getNextToken();
    
    while (token.type !== 'EOF') {
      tokens.push(token);
      token = this.getNextToken();
    }
    
    tokens.push(token); // Push the EOF token
    return tokens;
  }
}

// Example usage
const sourceCode = `
function calculateArea(width, height) {
  let area = width * height;
  return area;
}
`;

const lexer = new Lexer(sourceCode);
const tokens = lexer.tokenize();
console.log(JSON.stringify(tokens, null, 2));

Understanding the Output

For the example above, the lexer would produce tokens like:

[
  { type: 'KEYWORD', value: 'function' },
  { type: 'IDENTIFIER', value: 'calculateArea' },
  { type: 'PUNCTUATION', value: '(' },
  { type: 'IDENTIFIER', value: 'width' },
  { type: 'PUNCTUATION', value: ',' },
  { type: 'IDENTIFIER', value: 'height' },
  { type: 'PUNCTUATION', value: ')' },
  { type: 'PUNCTUATION', value: '{' },
  { type: 'KEYWORD', value: 'let' },
  { type: 'IDENTIFIER', value: 'area' },
  { type: 'OPERATOR', value: '=' },
  { type: 'IDENTIFIER', value: 'width' },
  { type: 'OPERATOR', value: '*' },
  { type: 'IDENTIFIER', value: 'height' },
  { type: 'PUNCTUATION', value: ';' },
  { type: 'KEYWORD', value: 'return' },
  { type: 'IDENTIFIER', value: 'area' },
  { type: 'PUNCTUATION', value: ';' },
  { type: 'PUNCTUATION', value: '}' },
  { type: 'EOF', value: null }
]

Regular Expressions in Lexical Analysis

Many lexers use regular expressions for pattern matching. Here's a diagram showing how regex engines process patterns:

Lexer State Management

Lexers often maintain internal state to handle context-dependent tokens:

Conclusion

The lexer is the foundation of a compiler's front-end. It breaks down the source code into tokens that the parser can use to build the syntax tree. A well-designed lexer makes the parsing phase more straightforward and efficient.

In the next section, we'll learn about syntax analysis and how to build a parser that can understand the structure of our programming language.

What is a Lexer?​

The Tokenization Process​

Token Types​

Building a Lexer in JavaScript​

Understanding the Output​

Regular Expressions in Lexical Analysis​

Lexer State Management​

Conclusion​