Skip to main content

Context-Free Grammars

Introduction to Context-Free Grammars

Context-free grammars (CFGs) are a formal grammar class in the Chomsky hierarchy that can describe the syntax of programming languages, natural language fragments, and other nested structures. They are particularly important in compiler design and language processing.

Formal Definition

A context-free grammar GG is defined as a 4-tuple:

G=(V,Σ,R,S)G = (V, \Sigma, R, S)

Where:

  • VV is a finite set of non-terminal symbols
  • Σ\Sigma is a finite set of terminal symbols (the alphabet), where VΣ=V \cap \Sigma = \emptyset
  • RR is a finite set of production rules of the form AγA \rightarrow \gamma where AVA \in V and γ(VΣ)\gamma \in (V \cup \Sigma)^*
  • SVS \in V is the start symbol

Example: Arithmetic Expression Grammar

A simple CFG for arithmetic expressions:

EE+TETTTTFT/FFF(E)number\begin{align} E &\rightarrow E + T \mid E - T \mid T \\ T &\rightarrow T * F \mid T / F \mid F \\ F &\rightarrow (E) \mid \text{number} \end{align}

This grammar generates expressions like:

  • 42
  • 2 + 3
  • (5 + 6) * 7
  • 2 * 3 + 4 * (5 - 6)

Derivations

A derivation is a sequence of production rule applications that transform the start symbol into a string of terminals.

Leftmost Derivation

In a leftmost derivation, at each step, the leftmost non-terminal is replaced.

Example for the string (5 + 6) * 7:

ETTFFF(E)F(E+T)F(F+T)F(5+T)F(5+F)F(5+6)F(5+6)7\begin{align} E &\Rightarrow T \\ &\Rightarrow T * F \\ &\Rightarrow F * F \\ &\Rightarrow (E) * F \\ &\Rightarrow (E + T) * F \\ &\Rightarrow (F + T) * F \\ &\Rightarrow (5 + T) * F \\ &\Rightarrow (5 + F) * F \\ &\Rightarrow (5 + 6) * F \\ &\Rightarrow (5 + 6) * 7 \end{align}

Rightmost Derivation

In a rightmost derivation, at each step, the rightmost non-terminal is replaced.

Parse Trees

A parse tree visually represents the derivation of a string from a CFG:

Ambiguity in CFGs

A CFG is ambiguous if there exists a string that has more than one leftmost derivation (or, equivalently, more than one parse tree).

Example: Ambiguous Expression Grammar

EE+EEE(E)number\begin{align} E &\rightarrow E + E \mid E * E \mid (E) \mid \text{number} \end{align}

This grammar is ambiguous because an expression like 2 + 3 * 4 can be parsed in two different ways:

  • (2 + 3) * 4 = 20
  • 2 + (3 * 4) = 14

Resolving Ambiguity

Ambiguity can be resolved by:

  1. Rewriting the grammar (as in our first example with EE, TT, and FF)
  2. Specifying precedence and associativity rules
  3. Using disambiguation rules in the parser

Applications of CFGs

Programming Languages

Almost all programming language syntax is defined using context-free grammars. For example, JavaScript's grammar defines constructs like:

if_statement → "if" "(" expression ")" statement ["else" statement]

Natural Language Processing

CFGs can model aspects of natural language, such as noun phrases and verb phrases:

SNPVPNPDetNDetNPPVPVNPVNPPPPPPNPDet"the""a"N"dog""cat""man"V"saw""chased"P"in""on""with"\begin{align} S &\rightarrow NP \: VP \\ NP &\rightarrow Det \: N \mid Det \: N \: PP \\ VP &\rightarrow V \: NP \mid V \: NP \: PP \\ PP &\rightarrow P \: NP \\ Det &\rightarrow \text{"the"} \mid \text{"a"} \\ N &\rightarrow \text{"dog"} \mid \text{"cat"} \mid \text{"man"} \\ V &\rightarrow \text{"saw"} \mid \text{"chased"} \\ P &\rightarrow \text{"in"} \mid \text{"on"} \mid \text{"with"} \end{align}

Markup Languages

XML, HTML, and other markup languages can be described using CFGs.

Limitations of CFGs

CFGs cannot express certain language features that require counting or checking equality:

  1. L={anbncnn1}L = \{a^n b^n c^n \mid n \geq 1\} (not context-free)
  2. L={www{a,b}}L = \{ww \mid w \in \{a, b\}^*\} (not context-free)

These limitations are proven using the Pumping Lemma for Context-Free Languages.

Closure Properties

Context-free languages are closed under:

  • Union
  • Concatenation
  • Kleene star
  • Substitution
  • Homomorphism

They are NOT closed under:

  • Intersection
  • Complement
  • Set difference

Normal Forms

Chomsky Normal Form (CNF)

All productions are of the form:

  • ABCA \rightarrow BC where B,CB, C are non-terminals
  • AaA \rightarrow a where aa is a terminal
  • SεS \rightarrow \varepsilon (only if SS doesn't appear on the right side)

Greibach Normal Form (GNF)

All productions are of the form:

  • AaαA \rightarrow a\alpha where aa is a terminal and α\alpha is a (possibly empty) string of non-terminals

Recognition Algorithms

  • CYK Algorithm: O(n3)O(n^3) time, requires grammar in CNF
  • Earley's Algorithm: O(n3)O(n^3) in general, O(n2)O(n^2) for unambiguous grammars
  • LL Parsing: Top-down parsing for LL(k) grammars
  • LR Parsing: Bottom-up parsing for LR(k) grammars

Conclusion

Context-free grammars are a powerful formalism for describing the syntax of programming languages and many other structured languages. While they have limitations, they strike a balance between expressiveness and computational tractability, making them ideal for compiler design and many other applications in computer science and linguistics.