Skip to main content

Lexical Analysis: Building a Lexer

Lexical analysis is the first phase of a compiler. It converts a sequence of characters into a sequence of tokens. These tokens are meaningful units of the programming language, such as keywords, identifiers, operators, and literals.

What is a Lexer?

A lexer (also called a scanner or tokenizer) breaks down the source code into the language's basic components (tokens). It reads the source code character by character to recognize patterns that match the token specifications of the language.

The Tokenization Process

Token Types

Common token types include:

  • Keywords: Reserved words like if, while, return
  • Identifiers: Variable and function names
  • Literals: Numbers, strings, booleans
  • Operators: +, -, *, /, etc.
  • Punctuation: ,, ;, (, ), etc.

Building a Lexer in JavaScript

Let's implement a more comprehensive lexer in JavaScript for a simple language:

class Lexer {
constructor(input) {
this.input = input;
this.position = 0;
this.currentChar = this.input[this.position];
this.keywords = ['let', 'const', 'if', 'else', 'while', 'for', 'function', 'return'];
}

advance() {
this.position++;
if (this.position < this.input.length) {
this.currentChar = this.input[this.position];
} else {
this.currentChar = null; // End of input
}
}

skipWhitespace() {
while (this.currentChar !== null && /\s/.test(this.currentChar)) {
this.advance();
}
}

isAlpha(char) {
return /[a-zA-Z_]/.test(char);
}

isAlphaNumeric(char) {
return /[a-zA-Z0-9_]/.test(char);
}

isDigit(char) {
return /[0-9]/.test(char);
}

identifier() {
let result = '';
while (this.currentChar !== null && this.isAlphaNumeric(this.currentChar)) {
result += this.currentChar;
this.advance();
}

// Check if the identifier is a keyword
if (this.keywords.includes(result)) {
return { type: 'KEYWORD', value: result };
}

return { type: 'IDENTIFIER', value: result };
}

number() {
let result = '';
let isFloat = false;

while (this.currentChar !== null &&
(this.isDigit(this.currentChar) || this.currentChar === '.')) {
if (this.currentChar === '.') {
// Ensure we only have one decimal point
if (isFloat) break;
isFloat = true;
}
result += this.currentChar;
this.advance();
}

return {
type: 'NUMBER',
value: isFloat ? parseFloat(result) : parseInt(result)
};
}

string() {
const quote = this.currentChar; // Store the opening quote type (' or ")
let result = '';
this.advance(); // Skip the opening quote

while (this.currentChar !== null && this.currentChar !== quote) {
// Handle escape sequences
if (this.currentChar === '\\') {
this.advance();
if (this.currentChar === 'n') result += '\n';
else if (this.currentChar === 't') result += '\t';
else result += this.currentChar; // For \', \", \\, etc.
} else {
result += this.currentChar;
}
this.advance();
}

if (this.currentChar === quote) {
this.advance(); // Skip the closing quote
} else {
throw new Error('Unterminated string');
}

return { type: 'STRING', value: result };
}

getNextToken() {
while (this.currentChar !== null) {

// Skip whitespace
if (/\s/.test(this.currentChar)) {
this.skipWhitespace();
continue;
}

// Identify tokens
if (this.isAlpha(this.currentChar)) {
return this.identifier();
}

if (this.isDigit(this.currentChar)) {
return this.number();
}

if (this.currentChar === '"' || this.currentChar === "'") {
return this.string();
}

// Handle operators and punctuation
switch (this.currentChar) {
case '+':
this.advance();
return { type: 'OPERATOR', value: '+' };
case '-':
this.advance();
return { type: 'OPERATOR', value: '-' };
case '*':
this.advance();
return { type: 'OPERATOR', value: '*' };
case '/':
this.advance();
return { type: 'OPERATOR', value: '/' };
case '=':
this.advance();
if (this.currentChar === '=') {
this.advance();
return { type: 'OPERATOR', value: '==' };
}
return { type: 'OPERATOR', value: '=' };
case ';':
this.advance();
return { type: 'PUNCTUATION', value: ';' };
case '(':
this.advance();
return { type: 'PUNCTUATION', value: '(' };
case ')':
this.advance();
return { type: 'PUNCTUATION', value: ')' };
case '{':
this.advance();
return { type: 'PUNCTUATION', value: '{' };
case '}':
this.advance();
return { type: 'PUNCTUATION', value: '}' };
default:
throw new Error(`Unexpected character: ${this.currentChar}`);
}
}

return { type: 'EOF', value: null };
}

tokenize() {
const tokens = [];
let token = this.getNextToken();

while (token.type !== 'EOF') {
tokens.push(token);
token = this.getNextToken();
}

tokens.push(token); // Push the EOF token
return tokens;
}
}

// Example usage
const sourceCode = `
function calculateArea(width, height) {
let area = width * height;
return area;
}
`;

const lexer = new Lexer(sourceCode);
const tokens = lexer.tokenize();
console.log(JSON.stringify(tokens, null, 2));

Understanding the Output

For the example above, the lexer would produce tokens like:

[
{ type: 'KEYWORD', value: 'function' },
{ type: 'IDENTIFIER', value: 'calculateArea' },
{ type: 'PUNCTUATION', value: '(' },
{ type: 'IDENTIFIER', value: 'width' },
{ type: 'PUNCTUATION', value: ',' },
{ type: 'IDENTIFIER', value: 'height' },
{ type: 'PUNCTUATION', value: ')' },
{ type: 'PUNCTUATION', value: '{' },
{ type: 'KEYWORD', value: 'let' },
{ type: 'IDENTIFIER', value: 'area' },
{ type: 'OPERATOR', value: '=' },
{ type: 'IDENTIFIER', value: 'width' },
{ type: 'OPERATOR', value: '*' },
{ type: 'IDENTIFIER', value: 'height' },
{ type: 'PUNCTUATION', value: ';' },
{ type: 'KEYWORD', value: 'return' },
{ type: 'IDENTIFIER', value: 'area' },
{ type: 'PUNCTUATION', value: ';' },
{ type: 'PUNCTUATION', value: '}' },
{ type: 'EOF', value: null }
]

Regular Expressions in Lexical Analysis

Many lexers use regular expressions for pattern matching. Here's a diagram showing how regex engines process patterns:

Lexer State Management

Lexers often maintain internal state to handle context-dependent tokens:

Conclusion

The lexer is the foundation of a compiler's front-end. It breaks down the source code into tokens that the parser can use to build the syntax tree. A well-designed lexer makes the parsing phase more straightforward and efficient.

In the next section, we'll learn about syntax analysis and how to build a parser that can understand the structure of our programming language.