Introduction to Formal Languages

Foundations of Language Theory

Formal language theory is a branch of mathematics and computer science that studies sets of strings (finite sequences of symbols) drawn from a finite alphabet. This field provides the mathematical framework for describing the syntax of programming languages, natural languages, and other structured communication systems.

Basic Definitions

Alphabets

An alphabet, denoted by $\Sigma$ (Sigma), is a non-empty finite set of symbols. Examples include:

Binary alphabet: $\Sigma = \{0, 1\}$
ASCII character set
Set of tokens in a programming language

Strings

A string is a finite sequence of symbols from an alphabet. The empty string, denoted by $\varepsilon$ (epsilon), is the string with zero symbols.

If $\Sigma = \{a, b\}$ , then some examples of strings are: $\varepsilon$ , $a$ , $b$ , $ab$ , $aba$ , $bab$ , etc.

String Operations

Concatenation: Joining two strings end to end.
- If $x = abc$ and $y = def$ , then $xy = abcdef$
Length: The number of symbols in a string, denoted by $|w|$ .
- $|abc| = 3$
- $|\varepsilon| = 0$
Reversal: The reverse of a string $w$ , denoted by $w^R$ , is $w$ written backward.
- If $w = abc$ , then $w^R = cba$
Powers: Repeated concatenation of a string with itself.
- $w^0 = \varepsilon$
- $w^1 = w$
- $w^n = ww^{n-1}$ for $n > 1$

Language Mathematically Defined

A formal language $L$ over an alphabet $\Sigma$ is a subset of $\Sigma^*$ , where:

$\Sigma^*$ (Sigma star) is the set of all possible strings over $\Sigma$ , including $\varepsilon$

Formally: $L \subseteq \Sigma^*$

Language Operations

Union

The union of two languages $L_1$ and $L_2$ is the set of strings that are in either $L_1$ or $L_2$ or both:

$L_1 \cup L_2 = \{w \mid w \in L_1 \text{ or } w \in L_2\}$

Concatenation

The concatenation of languages $L_1$ and $L_2$ is the set of strings formed by concatenating a string from $L_1$ with a string from $L_2$ :

$L_1 L_2 = \{xy \mid x \in L_1 \text{ and } y \in L_2\}$

Kleene Star

The Kleene star of a language $L$ , denoted by $L^*$ , is the set of all strings that can be formed by concatenating any number (including zero) of strings from $L$ :

$L^* = \{\varepsilon\} \cup L \cup L^2 \cup L^3 \cup \ldots$

where $L^n = \underbrace{L \cdot L \cdot \ldots \cdot L}_{n \text{ times}}$

Types of Languages

Languages can be classified based on the computational mechanisms needed to recognize or generate them. The Chomsky hierarchy, which we'll explore in the next section, provides a framework for classifying languages based on their complexity.

Examples of Formal Languages

Regular Language: $L = \{a^n b^n \mid n \geq 0\}$ This language consists of strings with any number of 'a's followed by an equal number of 'b's.
Context-Free Language: $L = \{a^n b^n \mid n \geq 1\}$ This language consists of strings with at least one 'a' followed by an equal number of 'b's.
Programming Language: The set of all syntactically valid programs in a language like JavaScript or Python.

Computational Significance

Formal languages provide the theoretical foundation for:

Compiler Design: Grammar specifications for programming languages
Natural Language Processing: Mathematical models of human language
Pattern Matching: Regular expressions and search algorithms
Computability Theory: Understanding the limits of what can be computed

In the next section, we'll explore the Chomsky hierarchy, which classifies formal grammars based on their generative power and the types of languages they can describe.

Foundations of Language Theory​

Basic Definitions​

Alphabets​

Strings​

String Operations​

Language Mathematically Defined​

Language Operations​

Union​

Concatenation​

Kleene Star​

Types of Languages​

Examples of Formal Languages​

Computational Significance​