Context-Free Grammars
Introduction to Context-Free Grammars
Context-free grammars (CFGs) are a formal grammar class in the Chomsky hierarchy that can describe the syntax of programming languages, natural language fragments, and other nested structures. They are particularly important in compiler design and language processing.
Formal Definition
A context-free grammar is defined as a 4-tuple:
Where:
- is a finite set of non-terminal symbols
- is a finite set of terminal symbols (the alphabet), where
- is a finite set of production rules of the form where and
- is the start symbol
Example: Arithmetic Expression Grammar
A simple CFG for arithmetic expressions:
This grammar generates expressions like:
42
2 + 3
(5 + 6) * 7
2 * 3 + 4 * (5 - 6)
Derivations
A derivation is a sequence of production rule applications that transform the start symbol into a string of terminals.
Leftmost Derivation
In a leftmost derivation, at each step, the leftmost non-terminal is replaced.
Example for the string (5 + 6) * 7
:
Rightmost Derivation
In a rightmost derivation, at each step, the rightmost non-terminal is replaced.
Parse Trees
A parse tree visually represents the derivation of a string from a CFG:
Ambiguity in CFGs
A CFG is ambiguous if there exists a string that has more than one leftmost derivation (or, equivalently, more than one parse tree).
Example: Ambiguous Expression Grammar
This grammar is ambiguous because an expression like 2 + 3 * 4
can be parsed in two different ways:
(2 + 3) * 4
= 202 + (3 * 4)
= 14
Resolving Ambiguity
Ambiguity can be resolved by:
- Rewriting the grammar (as in our first example with , , and )
- Specifying precedence and associativity rules
- Using disambiguation rules in the parser
Applications of CFGs
Programming Languages
Almost all programming language syntax is defined using context-free grammars. For example, JavaScript's grammar defines constructs like:
if_statement → "if" "(" expression ")" statement ["else" statement]
Natural Language Processing
CFGs can model aspects of natural language, such as noun phrases and verb phrases:
Markup Languages
XML, HTML, and other markup languages can be described using CFGs.
Limitations of CFGs
CFGs cannot express certain language features that require counting or checking equality:
- (not context-free)
- (not context-free)
These limitations are proven using the Pumping Lemma for Context-Free Languages.
Closure Properties
Context-free languages are closed under:
- Union
- Concatenation
- Kleene star
- Substitution
- Homomorphism
They are NOT closed under:
- Intersection
- Complement
- Set difference
Normal Forms
Chomsky Normal Form (CNF)
All productions are of the form:
- where are non-terminals
- where is a terminal
- (only if doesn't appear on the right side)
Greibach Normal Form (GNF)
All productions are of the form:
- where is a terminal and is a (possibly empty) string of non-terminals
Recognition Algorithms
- CYK Algorithm: time, requires grammar in CNF
- Earley's Algorithm: in general, for unambiguous grammars
- LL Parsing: Top-down parsing for LL(k) grammars
- LR Parsing: Bottom-up parsing for LR(k) grammars
Conclusion
Context-free grammars are a powerful formalism for describing the syntax of programming languages and many other structured languages. While they have limitations, they strike a balance between expressiveness and computational tractability, making them ideal for compiler design and many other applications in computer science and linguistics.