Finite Automata and Regular Languages

Finite Automata

Deterministic Finite Automaton (DFA)

DFA = (Q, Σ, δ, q₀, F): Q — the set of states, Σ — the alphabet, δ: Q×Σ → Q — the transition function, q₀ — the initial state, F ⊆ Q — accepting states.

A word w ∈ Σ* is accepted if δ̂(q₀, w) ∈ F (extended transition function).

Language L(A) — the set of all accepted words.

Example: DFA for binary strings divisible by 3: states — residues {0, 1, 2}. q₀ = F = {0}. δ(q, 0) = 2q mod 3, δ(q, 1) = (2q+1) mod 3.

Nondeterministic Automata (NFA)

NFA: δ: Q×(Σ∪{ε}) → 2^Q (set of states, ε-transitions).

Accepts w if there exists a path leading to an accepting state.

Theorem (Rabin–Scott): DFA and NFA are equivalent in recognized languages (subset construction).

Regular Languages and Expressions

Regular expressions are built from: ∅, ε, a∈Σ, R₁∪R₂, R₁·R₂, R*.

Kleene’s Theorem: A language is regular ⟺ recognized by a DFA ⟺ defined by a regular expression.

Pumping Lemma: For proving non-regularity. If L is regular, then there exists p such that for any w∈L with |w|≥p, w can be written as w=xyz, where y≠ε, |xy|≤p, xyⁿz∈L for all n≥0.

Cantor Language {aⁿbⁿ: n≥0} is not regular — pumping lemma leads to contradiction.

Minimal Automata and Syntactic Monoids

For each regular language, there exists a unique (up to isomorphism) minimal DFA with the smallest number of states. Hopcroft’s algorithm minimizes DFA in O(n log n). Idea: partitioning states into equivalence classes — refinement based on transition behavior.

Syntactic monoid of language L — factor-monoid (Σ*)/(≡_L), where u ≡_L v ⟺ for all x,y: xuy ∈ L ↔ xvy ∈ L. L is regular ↔ syntactic monoid is finite. L is star-free ↔ aperiodic (no nonzero nilpotent elements). The connection with semigroup theory is deep.

Deterministic vs. Nondeterministic Automata

DFA and NFA recognize the same languages (regular), but NFA can be exponentially more compact. It is known: there exists a language whose minimal NFA has n states, but minimal DFA — 2ⁿ (subset construction from n-state NFA gives 2ⁿ states in the worst case). This “gap” between determinism and nondeterminism is one of the key separations in complexity theory.

Two-way Automata (2DFA): automaton with a head moving in both directions on the input. Recognizes exactly the same languages as DFA (regular). Proof: constructive translation 2DFA → DFA (Shepherdson, 1959). However, 2DFA can be exponentially more compact than DFA.

Regular Languages and Boolean Algebra

Operations on regular languages (union, intersection, complement, concatenation, Kleene star) preserve regularity. The algebra of regular expressions = Kleene algebra. Kleene’s theorem (1956): a language is described by a regular expression if and only if it is recognized by a finite automaton — a fundamental equivalence that laid the foundation of the formal language theory.

Algorithms: Emptiness of a regular language: O(n) (DFS/BFS). Inclusion L₁ ⊆ L₂: check L₁ ∩ ¬L₂ = ∅. Equivalence of two DFAs: O(n²) via partitioning, or O(n log n) via Hopcroft. All these tasks are decidable — in contrast to CFG and Turing machine problems.

ω-Automata and Infinite Languages

Büchi Automata: accept infinite words (ω-words). They accept an ω-word if an accepting state is visited infinitely often. Analogue: several final states for finite words. Deterministic vs. Nondeterministic: deterministic Büchi automata are strictly weaker — they cannot recognize all ω-regular languages. Muller, Rabin, Streett automata: more powerful classes of deterministic ω-automata. Application: verification of infinite computations, LTL model checking.

Formal Grammars and Natural Languages

The Chomsky hierarchy is applied in linguistics: natural languages are not completely regular (dependency through sentence), but not completely context-sensitive either. Chomsky's thesis: transformational-generative grammars better describe syntax. In NLP: CFG and probabilistic CFG (PCFG) — foundation for parsing (CYK algorithm O(n³|G|)). Modern transformers (BERT, GPT) empirically outperform formal grammars in NLP tasks.

Regular Expressions and Their Computational Complexity

Searching for a regular expression of length m in a text of length n: deterministic FA → O(n·2^m) compilation, O(n) search. Nondeterministic FA → O(nm) compilation, O(nm) search. Exponential backtracking problem (ReDoS): poorly written regular expressions on engines with backtracking lead to exponential time in the worst case. Safe engines: Russ Cox RE2, Google RE2, Rust’s regex — based on NFA without backtracking → linear time. Application in security: ReDoS vulnerability (Catastrophic backtracking) in cloudflare, Atom IDE.

Automata and Cryptographic Stream Ciphers

Linear Feedback Shift Register (LFSR): sequence aₙ = Σcᵢaₙ₋ᵢ mod 2, cᵢ ∈ {0,1}. Maximum period = 2ⁿ−1 (m-sequence) with irreducible and primitive characteristic polynomial. Autocorrelation of m-sequences is almost zero — good pseudo-random properties. Nonlinear combinations of LFSR: modern stream ciphers (Grain, Trivium). Geffe’s linear approximation attack: linear complexity of short LFSR threatens security.

Numerical Example: Constructing a DFA and Its Operation

Task: Construct a DFA for L={w∈{0,1}*: w ends with “01”} and check several strings.

Step 1: States: q₀ (nothing significant), q₁ (last symbol is “0”), q₂ (last two — “01”, final).

Step 2: Transition table: δ(q₀,0)=q₁; δ(q₀,1)=q₀; δ(q₁,0)=q₁; δ(q₁,1)=q₂; δ(q₂,0)=q₁; δ(q₂,1)=q₀.

Step 3: Check w=“1001”: q₀→(1)q₀→(0)q₁→(0)q₁→(1)q₂∈F. ∈L ✓. Check w=“010”: q₀→(0)q₁→(1)q₂→(0)q₁∉F. ∉L ✓.

Step 4: Minimality: all three states are distinguishable. Regular expression: (0+1)*01. Connection to LFSR: for LFSR of length 3 (period ≤7) deterministic FA with 2³=8 states (register content) recognizes its cycle — Geffe’s attack relies precisely on this linear structure for attacking weakly nonlinear ciphers.