Module V·Article II·~5 min read

Context-Free Languages and Grammars

Automata and Formal Languages

Turn this article into a podcast

Pick voices, format, length — AI generates the audio

Context-Free Grammars

CFG and CFL

A context-free grammar G = (V, Σ, R, S): V — nonterminals, Σ — terminals, R — rules of the form A → α (A∈V, α∈(V∪Σ)*), S — start nonterminal.

Language L(G) = {w ∈ Σ*: S ⟹* w}.

Example: Language {aⁿbⁿ: n≥0}: S → ε | aSb.

Chomsky normal form: A → BC or A → a. Every CFG is equivalent to a grammar in CNF.

Pushdown Automata

PDA = (Q, Σ, Γ, δ, q₀, Z₀, F): a stack (pushdown store) with alphabet Γ is added.

Transitions: δ(q, a, Z) = {(p, γ)}: read a, remove Z, push γ.

Theorem: CFL = languages recognized by PDA.

CFLs are closed under: union, concatenation, Kleene star. Not closed under: intersection, complement (in general).

Chomsky Hierarchy

Type 0 (no restrictions) → recursively enumerable languages, Turing machines. Type 1 (context-sensitive) → LBA (linear-bounded automata). Type 2 (context-free) → PDA. Type 3 (regular) → DFA.

Each type is strictly contained in the previous one.

Algorithmic Problems

CFL: emptiness, finiteness, membership (CYK algorithm O(n³|G|)) — decidable. Equivalence of CFGs — undecidable. Intersection of two CFLs is not always a CFL.

Normal Forms of CFG and CYK Algorithm

Chomsky normal form (CNF): Each rule has the form A → BC or A → a. Any CFG can be converted to CNF in polynomial time. CNF forms the basis for the CYK (Cocke–Younger–Kasami) algorithm: parsing in O(n³|G|) using dynamic programming. Table dp[i][j][A] = 1 if nonterminal A generates the substring s[i..j].

Greibach normal form: Each rule A → aα, where a is a terminal, α is a string of nonterminals. Convenient for recursive descent (LL parsers). In programming, normal forms of grammars are fundamental for compilers (Yacc, Bison use LALR(1) grammars).

Chomsky Hierarchy

Regular languages (type 3) ⊂ Context-free (type 2) ⊂ Context-sensitive (type 1) ⊂ Recursively enumerable (type 0). The inclusions are strict:

  • Type 3: DFA / regular expressions.
  • Type 2: PD automata / CFG (grammars of programming languages).
  • Type 1: Linear-bounded TM (LBA).
  • Type 0: Arbitrary TM.

Type 1 languages are always decidable (LBA solves in exponential time), type 0 — recursively enumerable (not necessarily decidable).

Deterministic CFL and Parsing

DCFL (deterministic CFL) — a subclass of CFL recognized by deterministic PDA. Strictly smaller than CFL: {aⁿbⁿcⁿ: n≥0} — not a CFL (proved by the pumping lemma for CFL), so this is not a CFL example, but an example outside CFL. Example of DCFL: {aⁿbⁿ: n≥0}. Example of CFL, but not DCFL: {ww^R: w ∈ {a,b}*} (even-length palindromes) — DPDA cannot handle due to non-determinism in the center.

Programming languages: LL(k) and LR(k) grammars are deterministic subclasses for which efficient (linear-in-input-length) parsers exist. LR(1) grammars cover most practical languages.

Pumping Lemma for CFL

Pumping lemma (context-free): For any CFL L there exists p such that any string w ∈ L of length ≥ p can be decomposed w = uvxyz: |vy| ≥ 1, |vxy| ≤ p, uvⁿxyⁿz ∈ L for all n ≥ 0. Proof via parse tree: long string → tree of height > |G| → repeated nonterminal → "pump" between two occurrences. Example of application: {aⁿbⁿcⁿ} is not a CFL.

Attribute Grammars

An attribute grammar extends CFG with attributes for nonterminals. Synthesized attributes: calculated from children (bottom-up). Inherited attributes: passed from parent or siblings (top-down). Used for semantic analysis in compilers: type computations, checking variable declarations. Notation: S-attributed grammars (only synthesized) — particularly convenient for LR parsers.

Transducers and Translating Automata

Finite transducer: like a DFA, but on a transition outputs an output symbol (Moore machine or Mealy machine). Mealy: output on transition; Moore: output in state. Application: lexical transformations, transliteration, speech synthesis. Automata morphisms: for regular languages — closure under homomorphisms, inverse homomorphisms, and intersection with regulars → another method to prove nonregularity.

Comparative Complexity of Grammar Classes

CFLs are closed under: union, concatenation, Kleene star, homomorphism, inverse homomorphism, intersection with regular. Not closed under: intersection, complement. CFL ∩ CFL can be non-CFL: classic example — {aⁿbⁿcᵐ: n,m≥0} ∩ {aᵐbⁿcⁿ: n,m≥0} = {aⁿbⁿcⁿ: n≥0}, but {aⁿbⁿcⁿ} is not a CFL by the pumping lemma. Context-sensitive: closed under intersection and complement, but undecidable for emptiness. Recursively enumerable: only intersection and union.

CYK Algorithm and Probabilistic CFG

CYK (Cocke–Younger–Kasami): check whether a string w belongs to CFL in O(n³|G|). Requires Chomsky normal form (CNF): A → BC or A → a. PCFG (probabilistic CFG): each rule A → α has probability P(α|A). The sum of probabilities for all rules for A is 1. Most probable parse — Viterbi algorithm on PCFG. Parameter training: Inside–Outside algorithm (analogous to Baum–Welch for HMM). Used in statistical syntactic text analysis.

Numerical Example: Derivation in CFG and CYK Algorithm

Task: CFG G: S→AB, A→a, B→b. String w="ab". Does w belong to L(G)?

Step 1: CYK table (Chomsky normal form is satisfied). Steps of length 1: w₁=a: dp[1,1]={A}; w₂=b: dp[2,2]={B}.

Step 2: Step of length 2: dp[1,2]. Check: is there a rule X→YZ, where Y∈dp[1,1]={A} and Z∈dp[2,2]={B}? S→AB: A∈{A} ✓, B∈{B} ✓. Therefore, dp[1,2]={S}.

Step 3: S∈dp[1,2] → w="ab"∈L(G). Parse tree: S⟹AB⟹aB⟹ab. Check w="ba": dp[1,1]={B}, dp[2,2]={A}. dp[1,2]: S→AB — A∈{B}? No. dp[1,2]={}. S∉dp[1,2] → "ba"∉L(G).

Step 4: CYK complexity: O(n³·|G|). For a sentence with n=20 words and |G|=50 rules: 20³·50=400,000 operations. PCFG adds weights: P(S→AB)=0.7, P(A→a)=1.0, P(B→b)=1.0. Parse probability for "ab": 0.7·1·1=0.70. Viterbi algorithm on PCFG selects the most probable syntactic tree for the sentence.

§ Act · what next