grammatical inference - inria

13
Grammatical Inference Fran¸coisCoste SML, Master SIF 2020-2021 F. Coste (Inria) Grammatical Inference SML 2020-2021 1 / 123 Grammatical Inference Learn the grammar of a language from correct (and incorrect) sentences N. Chomsky, Syntactic Structure, Mouton, 1957, PhD thesis MIT 1955 E. M. Gold, Language Identification in the Limit, Information and Control, 1967 ... (targeted) Applications Syntactic pattern recognition [Fu, 1982] Natural language, Molecular biology, Structured texts, Web, action planning, intrusion detection . . . Field Theoretical (learnability) Practical (algorithms) F. Coste (Inria) Grammatical Inference SML 2020-2021 2 / 123 Formal languages theory Sequence of symbols s 1 s 2 ...s p : word Set of words {m 1 ,m 2 ,...}: language Set of production rules generating a language: grammar Learning a grammar by induction: Grammatical Inference (covers more broadly inductive learning of languages, even if the representation is not grammatical) F. Coste (Inria) Grammatical Inference SML 2020-2021 3 / 123 Grammar Grammar : G = hΣ,N,S,Ri Σ finite set of terminals (a,b,c,. . . ) N finite set of non-terminals (S,T,U,. . . ) S(N) axiom (start symbol) R set of rewriting rules Each rule is written as: α β, α (N Σ) * N(N Σ) * (N Σ) * When some rules have the same left hand side, we write: α β 1 |β 2 |··· F. Coste (Inria) Grammatical Inference SML 2020-2021 4 / 123 Grammars and languages Elementary derivation: G : μαδ G μβδ iff α β R, μ, δ (N Σ) * Derivation * G : finite sequence of elementary derivations Language generated by a grammar G, L(G) : L(G)= {m Σ * |S * G m} Free Monoid Σ * : set of all the words on Σ Empty word: or λ Empty language: (6= {}) F. Coste (Inria) Grammatical Inference SML 2020-2021 5 / 123 Example Dyck1’s grammar (balanced parenthesis) G = hΣ,N,S,Ri Σ= {a, b} N = {S} R = {S aSbS, S } Derivation S aSbS aaSbSbS aabSbS aabbS aabb F. Coste (Inria) Grammatical Inference SML 2020-2021 6 / 123 Exercises Find the grammars generating the following languages: {aaba, aaa} All the words on {a, b} (Σ * ) Words on {a, b} beginning by a Codons on {a, c, g, t} (letter’s count is a multiple of 3) Palindromes on {a, b} R = {S aSa|bSb|a|b|} Biological palindromes (on {a, c, g, t}, a - t, c - g) exercise. . . {a n b n c n |n 1} R = {S abc|aSAc, bA bb, cA Ac} S aSAc aabcAc aabAcc aabbcc Copy : {ww/w ∈{a, b} * } exercise. . . F. Coste (Inria) Grammatical Inference SML 2020-2021 7 / 123 Chomsky Hierarchy Hierarchy of recursively enumerable languages: 0 Unrestricted 1 Context sensitive (grammaires contextuelles) α β, |α|≤|β| 2 context-free (grammaires alg´ ebriques) A β, A N 3 regular (grammaires r´ eguli` eres, automates) A aB or A a, A, B N,a Σ ∪{} F. Coste (Inria) Grammatical Inference SML 2020-2021 8 / 123

Upload: others

Post on 17-Feb-2022

9 views

Category:

Documents


0 download

TRANSCRIPT

Grammatical Inference

Francois Coste

SML, Master SIF

2020-2021

F. Coste (Inria) Grammatical Inference SML 2020-2021 1 / 123

Grammatical Inference

Learn the grammar of a language from correct (and incorrect) sentences

N. Chomsky, Syntactic Structure, Mouton, 1957, PhD thesis MIT 1955

E. M. Gold, Language Identification in the Limit, Information and Control, 1967

. . .

(targeted) Applications

Syntactic pattern recognition [Fu, 1982]

Natural language, Molecular biology, Structured texts, Web, actionplanning, intrusion detection . . .

Field

Theoretical (learnability)Practical (algorithms)

F. Coste (Inria) Grammatical Inference SML 2020-2021 2 / 123

Formal languages theory

Sequence of symbols s1s2 . . . sp: word

Set of words {m1,m2, . . .}: language

Set of production rules generating a language: grammar

Learning a grammar by induction: Grammatical Inference

(covers more broadly inductive learning of languages, even if the representation is not

grammatical)

F. Coste (Inria) Grammatical Inference SML 2020-2021 3 / 123

Grammar

Grammar : G = 〈Σ, N, S,R〉Σ finite set of terminals (a,b,c,. . . )

N finite set of non-terminals (S,T,U,. . . )

S(∈ N) axiom (start symbol)

R set of rewriting rulesEach rule is written as:

α→ β, α ∈ (N ∪ Σ)∗N(N ∪ Σ)∗, β ∈ (N ∪ Σ)∗

When some rules have the same left hand side, we write:

α→ β1|β2| · · ·

F. Coste (Inria) Grammatical Inference SML 2020-2021 4 / 123

Grammars and languages

Elementary derivation: ⇒G :

µαδ ⇒G µβδ iff ∃ α→ β ∈ R, µ, δ ∈ (N ∪ Σ)∗

Derivation ⇒∗G : finite sequence of elementary derivations

Language generated by a grammar G, L(G) :

L(G) = {m ∈ Σ∗|S ⇒∗G m}

Free Monoid Σ∗ : set of all the words on Σ

Empty word: ε or λ

Empty language: ∅ (6= {ε})

F. Coste (Inria) Grammatical Inference SML 2020-2021 5 / 123

ExampleDyck1’s grammar (balanced parenthesis)

G = 〈Σ, N, S,R〉Σ = {a, b}N = {S}R = {S → aSbS, S → ε}

Derivation

S ⇒ aSbS⇒ aaSbSbS⇒ aabSbS⇒ aabbS⇒ aabb

F. Coste (Inria) Grammatical Inference SML 2020-2021 6 / 123

Exercises

Find the grammars generating the following languages:

{aaba, aaa}All the words on {a, b} (Σ∗)

Words on {a, b} beginning by a

Codons on {a, c, g, t} (letter’s count is a multiple of 3)

Palindromes on {a, b}R = {S → aSa|bSb|a|b|ε}Biological palindromes (on {a, c, g, t}, a− t, c− g)exercise. . .

{anbncn|n ≥ 1}R = {S → abc|aSAc, bA→ bb, cA→ Ac}S ⇒ aSAc⇒ aabcAc⇒ aabAcc⇒ aabbcc

Copy : {ww/w ∈ {a, b}∗}exercise. . .

F. Coste (Inria) Grammatical Inference SML 2020-2021 7 / 123

Chomsky Hierarchy

Hierarchy of recursively enumerable languages:

0 Unrestricted

1 Context sensitive (grammaires contextuelles)

α→ β, |α| ≤ |β|

2 context-free (grammaires algebriques)

A→ β, A ∈ N

3 regular (grammaires regulieres, automates)

A→ aB or A→ a, A,B ∈ N, a ∈ Σ ∪ {ε}

F. Coste (Inria) Grammatical Inference SML 2020-2021 8 / 123

The Chomsky Hierarchy

F. Coste (Inria) Grammatical Inference SML 2020-2021 9 / 123

Regular languages are worth inferring

For practical applications, powerful recursive models may not be required

Regular languages can account for short term dependencies (likeN-Gramms), but also some long-term dependencies.

Any language can be approximated by a regular language (each finitelanguage is regular!).

Properties of regular languages are well studied; this makes the developmentof inference methods easier

Simple and efficient parsing of string (O(|m|) for DFA).

F. Coste (Inria) Grammatical Inference SML 2020-2021 10 / 123

Outline

1 Learning automataDefinitionsLearning automata from positive and negative examplesLearning automata from positive example

F. Coste (Inria) Grammatical Inference SML 2020-2021 11 / 123

Automata

A = 〈Σ, Q,Q0, QF , δ〉

Multiples of 3 (binary):

Σ finite alphabet {0, 1}Q finite set of states {q0, q1, q2}

Q0(⊆ Q) initial states {q0}QF (⊆ Q) final states {q0}

δ transition function: Q× Σ→ P(Q)(δ∗ : P(Q)× Σ∗ → P(Q) denotes the extension to words of δ)

Language accepted by A

L(A) = {m ∈ Σ∗|δ∗(Q0,m) ∩QF 6= ∅}

F. Coste (Inria) Grammatical Inference SML 2020-2021 12 / 123

Automata and languages

Language accepted/recognized by automata: regular language. + ∗ ()

Exces

Find automatas on Σ = {a, b} recognizing:

- {abba, aab}. (show that each finite language is regular)

- all the words on Σ : (a+ b)∗ = {a, b}∗ = Σ∗

- all the words containing the motif aa

- all the words with 3 letter (extension to codons ?)

- all the words with an even number of a.

Deterministic finite state automata (DFA) : |δ(q, a)| ≤ 1Any non deterministic automata (NFA) can be determinized

⇒ LAFN = LAFD

Canonical automaton of L, A(L) : smallest DFA accepting L

F. Coste (Inria) Grammatical Inference SML 2020-2021 13 / 123

Can we learn regular languagesfrom positive examples only?

Theoretical framework: identification in the limit [Gold67]

Presentation : infinite sequence of examples

P : x1 x2 x3 . . . xk . . . xi . . .↓ ↓ ↓ ↓ ↓H1 H2 H3 Hk Hi ≡ Hk ≡ H0

Identification in the limit of H0 :

∀P,∃k, ∀i > k,Hi ≡ H0

F. Coste (Inria) Grammatical Inference SML 2020-2021 14 / 123

Let’s try!

a, aa, aaa . . .

F. Coste (Inria) Grammatical Inference SML 2020-2021 15 / 123

Limit point

If a limit point exists:

L1 ⊂ L2 ⊂ L3 ⊂ · · · ⊂ L∞ =⋃i

Li

Then

The class of languages is not identifiable in the limit from positiveexamples

F. Coste (Inria) Grammatical Inference SML 2020-2021 16 / 123

Results [Gold67]

No superfinite class of language (⊃ regular) can be identified in thelimit from text (i.e. positive examples only)

The class of primitive recursive function (“fonction recursiveprimitive”) can be identified in the limit from informant (examplesand counter-examples)(False for the class of total recursive functions)

→ rationale for using counter-examples

Time needed for learning ???

F. Coste (Inria) Grammatical Inference SML 2020-2021 17 / 123

Polynomial Time and Data Identification in the Limit[Gold 78] [Pitt 89] [Higuera 95]

Identification in the limit from Polynomial Time and Data (IPTD)

A representation class R is identifiable in the limit from polynomial timeand data iff there exists two polynomials p and q, a learning algorithm As.t.:

Given any sample S = 〈S+, S−〉 of size m,A returns a representation R in R compatible with S in p(m) time

For each representation R of size n,there exists a characteristic sample of size less than q(n)

Characteristic sample CS = 〈CS+, CS−〉: for any S = 〈S+, S−〉, s.t.CS+ ⊆ S+, CS− ⊆ S−, A returns a representation R′ equivalent with R

F. Coste (Inria) Grammatical Inference SML 2020-2021 18 / 123

Are automata IPTD?

Outline

1 Learning automataDefinitionsLearning automata from positive and negative examplesLearning automata from positive example

F. Coste (Inria) Grammatical Inference SML 2020-2021 19 / 123

1 Learning automataDefinitionsLearning automata from positive and negative examples

Problem definitionRPNIStructural completeness hypothesisUtility of counter-examplesEDSM heuristic

Learning automata from positive example

F. Coste (Inria) Grammatical Inference SML 2020-2021 20 / 123

Remark:Given a sample S = 〈S+, S−〉, an infinite number of automata are

compatible with S

Searching for the smallest compatible DFA

Smallest compatible DFA problem

Given S+ ⊂ Σ∗ (examples) and S− ⊂ Σ∗ (counter-examples),Find smallest DFA A st S+ ⊂ L(A) and S− ∩ L(A) = ∅

Application of Occam’s razor

Canonical automata of language . . .

NP-Complete problem [Gold78] [Angluin78]

Proof: reduction to SAT

To find a DFA (only) polynomially bigger than the smallest DFA compatiblewith 〈S+, S−〉 is NP-Complete [Pitt, Warmuth 93]

PAC-Learning DFA is as hard as breaking the RSA cryptosystem [Pitt,

Warmuth 88] [Kearn, Valiant 89]

F. Coste (Inria) Grammatical Inference SML 2020-2021 22 / 123

PAC (Probably Approximatively Correct) - Learning[Valiant 84]

Approximatively Correct Error upper bound ε

Rreal(h) = P (h(o) 6= f(o)) < ε

For any concept f in FFor any error ε and any confidence 1− δThere exists Nε,δ such that for the set of h learnt from Nε,δ examples:

P (Rreal(h) < ε) > 1− δ

F. Coste (Inria) Grammatical Inference SML 2020-2021 23 / 123

1 Learning automataDefinitionsLearning automata from positive and negative examples

Problem definitionRPNIStructural completeness hypothesisUtility of counter-examplesEDSM heuristic

Learning automata from positive example

F. Coste (Inria) Grammatical Inference SML 2020-2021 24 / 123

RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]

S+ = {aaa, bba, baaa} S− = {aaaa, baab, bbabab}

Maximal Canonical Automaton MCA(S+) Determinisation. . .

F. Coste (Inria) Grammatical Inference SML 2020-2021 25 / 123

RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]

S+ = {aaa, bba, baaa} S− = {aaaa, baab, bbabab}

Prefix Tree Automaton PTA(S+) Rote learning! Generalisation throughstate merging under control of S− Merge 0 and 1

F. Coste (Inria) Grammatical Inference SML 2020-2021 26 / 123

RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]

S+ = {aaa, bba, baaa} S− = {aaaa, baab, bbabab}

Result of merging 0 and 1Non deterministic automaton!

Merging for determinisation . . .

F. Coste (Inria) Grammatical Inference SML 2020-2021 27 / 123

Merging for determinisationHow to consider only DFAs

Merging for determinisation

∀q ∈ Q,∀a ∈ Σ,∀s1, s2 ∈ δ(q, a),Merge(s1, s2)

(6= determinisation algorithm of a NFA : language can grow here!)

PTA(S+) = merging for determinisation of MCA(S+)

Deterministic merge

Merging states + merging for determinisationF. Coste (Inria) Grammatical Inference SML 2020-2021 28 / 123

RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]

S+ = {aaa, bba, baaa} S− = {aaaa, baab, bbabab}

Result of merging 0 and 1Non deterministic automaton!

Merging for determinisation . . .

F. Coste (Inria) Grammatical Inference SML 2020-2021 29 / 123

RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]

S+ = {aaa, bba, baaa} S− = {aaaa, baab, bbabab}

After merging for determinisation A counter-example is accepted!

F. Coste (Inria) Grammatical Inference SML 2020-2021 30 / 123

RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]

S+ = {aaa, bba, baaa} S− = {aaaa, baab, bbabab}

Backtrack Merge 0 and 2

F. Coste (Inria) Grammatical Inference SML 2020-2021 31 / 123

RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]

S+ = {aaa, bba, baaa} S− = {aaaa, baab, bbabab}

Merged 0 and 2 Merging determinisation

F. Coste (Inria) Grammatical Inference SML 2020-2021 32 / 123

RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]

S+ = {aaa, bba, baaa} S− = {aaaa, baab, bbabab}

After merging for determinisation Merge 0 and 3

F. Coste (Inria) Grammatical Inference SML 2020-2021 33 / 123

RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]

S+ = {aaa, bba, baaa} S− = {aaaa, baab, bbabab}

Merged 0 et 3 Merging for determinisation

F. Coste (Inria) Grammatical Inference SML 2020-2021 34 / 123

RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]

S+ = {aaa, bba, baaa} S− = {aaaa, baab, bbabab}

After merging for determinisation No more possible merging... Solution !

F. Coste (Inria) Grammatical Inference SML 2020-2021 35 / 123

RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]

RPNI

A← PTA(S+)for all (p, q) in standard order 1 doA′ ← Deterministic merge(A, p, q)if A′ accepts no counter-example from S− thenA← A′

end ifend for

Complexity : O((|S+|+ |S−|).|S+|2)

1Standard order u ≺ v : (|u| < |v|) ∨ (|u| = |v| ∧ ∃k, ∀i < k, ui = vi ∧ uk < vk)F. Coste (Inria) Grammatical Inference SML 2020-2021 36 / 123

Success / amount of sequences in training sample

fig. from [Lang, 1992]

F. Coste (Inria) Grammatical Inference SML 2020-2021 37 / 123

Identification ?

Requirements for finding the solution with RPNI?

1. The target automata has to be in the search space

and

2. The good merges have to be chosen

F. Coste (Inria) Grammatical Inference SML 2020-2021 38 / 123

1 Learning automataDefinitionsLearning automata from positive and negative examples

Problem definitionRPNIStructural completeness hypothesisUtility of counter-examplesEDSM heuristic

Learning automata from positive example

F. Coste (Inria) Grammatical Inference SML 2020-2021 39 / 123

Structural completeness hypothesis

S+ is structurally complete wrt A if an acceptation of S+ by A exists st:

Every transition of A is used

Every final state of A is used for acceptation

S+ = {aaa, bba, baaa} A =

F. Coste (Inria) Grammatical Inference SML 2020-2021 40 / 123

Maximal Canonical Automaton

Rote learning of S+ = {aaa, bba, baaa}

Union :

MCA(S+)

Only one initial state (classical but not required):

MCA(S+)

F. Coste (Inria) Grammatical Inference SML 2020-2021 41 / 123

Merging states

Language generalisation operator

Preserve structural completeness

F. Coste (Inria) Grammatical Inference SML 2020-2021 42 / 123

Merging states

Language generalisation operator

Preserve structural completeness

F. Coste (Inria) Grammatical Inference SML 2020-2021 43 / 123

Merging states

Language generalisation operator

Preserve structural completeness

F. Coste (Inria) Grammatical Inference SML 2020-2021 44 / 123

Merging states

Language generalisation operator

Preserve structural completeness

F. Coste (Inria) Grammatical Inference SML 2020-2021 45 / 123

Merging states

Language generalisation operator

Preserve structural completeness

F. Coste (Inria) Grammatical Inference SML 2020-2021 46 / 123

Merging states

Language generalisation operator

Preserve structural completeness

F. Coste (Inria) Grammatical Inference SML 2020-2021 47 / 123

Merging states

Language generalisation operator

Preservation of structural completeness

ua(S+)

Theorem

All automata A st S+ is structurally complete wrt A can be built bymerging states of MCA(S+)

F. Coste (Inria) Grammatical Inference SML 2020-2021 48 / 123

Search space

F. Coste (Inria) Grammatical Inference SML 2020-2021 49 / 123

DFA search space

operator: deterministic merge

Theorem

All automata A st S+ is structurally complete wrt A can be build bydeterministic merges of states in MCA(S+) (or PTA(S+))

F. Coste (Inria) Grammatical Inference SML 2020-2021 50 / 123

1 Learning automataDefinitionsLearning automata from positive and negative examples

Problem definitionRPNIStructural completeness hypothesisUtility of counter-examplesEDSM heuristic

Learning automata from positive example

F. Coste (Inria) Grammatical Inference SML 2020-2021 51 / 123

Limiting generalisation with a set of counter-examples S−

Border Set : set of most general elements(Greater generalisation under control of S−)

Occam’s razor → looking for smallest automaton

S− guides also the search. . .

F. Coste (Inria) Grammatical Inference SML 2020-2021 52 / 123

Limiting generalisation with a set of counter-examples S−

Border Set : set of most general elements(Greater generalisation under control of S−)

Occam’s razor → looking for smallest automaton

S− guides also the search. . .

F. Coste (Inria) Grammatical Inference SML 2020-2021 53 / 123

Characteristic sample for RPNI

How to ensure that RPNI returns A(L) ?

Ideas :

Sample has to be structurally complete wrt A(L)

Sample is informative enough to prevent merging distinct states

F. Coste (Inria) Grammatical Inference SML 2020-2021 54 / 123

Characteristic sample for RPNIShort prefixes and Kernel

Let Pr(L) denote the set of prefixes of a language L: Pr(L) = {u ∈ Σ∗ : uv ∈ L}

Short prefixesSmallest sequences enabling to reach each state of the target

Sp(L) = {u ∈ Pr(L) : @v ∈ Pr(L), v < u and δA(L)(q0, v) = δA(L)(q0, u)}

KernelSequences of Sp concatenated with one letter allowing to reach a new state(exercise all the possible transitions)

N(L) = {ua ∈ Pr(L) : u ∈ Sp(L), a ∈ Σ} ∪ {ε}

What would be N(L) for the following DFA target ?

F. Coste (Inria) Grammatical Inference SML 2020-2021 55 / 123

Characteristic sample for RPNI

S = 〈S+, S−〉 is a characteristic sample of A(L) for RPNI if:

∀x ∈ N(L) :∃u ∈ Σ∗, xu ∈ S+(u = ε if x ∈ L)

∀x, y ∈ N(L), δA(L)(q0, x) 6= δA(L)(q0, y) :

∃u ∈ Σ∗, ((xu ∈ S+ and yu ∈ S−) or (xu ∈ S− and yu ∈ S+))

What would be a characteristic sample for ?

Is the characteristic sample unique for an automate?

It can shown that:

- Adding new examples to the characteristic sample does not change theautomata returned by RPNI

- For each A(L), there exists a characteristic sample of size O(|A(L)|2)

F. Coste (Inria) Grammatical Inference SML 2020-2021 56 / 123

What about merging states in random order?Trakhtenbrot et Barzdin 1973

Algorithm : deterministic merge of pair of states not resulting inincompatible automata in random order

Algorithm complexity?At most |PTA|.|A|2 [Lang92] (where A is the target automaton)

Characteristic sample?{w ∈ Σ∗/|w| ≤ d+ 1 + ρ}d : depth of automataρ : distinguishably degree(length of suffix required to distinguish pairs ofstates, i.e. allowing to reach a final state and a non final state)

Worst case d = ρ = |A| − 1In average, ρ = log|Σ| log2 |A| et d = C log|Σ|(where C : constant wrt Σ)For |Σ| = 2, average size is: ∼ 16|A|2 − 1|A| = 32 → 16383 seq., 65 → 67599, 506 → 4096575 ...

F. Coste (Inria) Grammatical Inference SML 2020-2021 57 / 123

RPNI

The solution returned by RPNI is:

a DFA belonging to the Border Set

the canonical automata of the language that it accepts

if the sample is characteristic, it is the smallest compatible DFA(Contradiction with NP-Completeness of the problem ?

No, sample has to be characteristic!)

Complexity : O((|S+|+ |S−|).|S+|2)Characteristic sample: O(n2)⇒

DFA are identifiable in the limit from polynomial time and data (IPTD)

F. Coste (Inria) Grammatical Inference SML 2020-2021 58 / 123

Positive results

Deterministic automata are IPTD⇒ Even linear grammars[Takada 88,94], [Sempere, Garcıa 94], [Makinen 96]

⇒ Sub-sequential transducers[Oncina, Garcıa, Vidal 93]

⇒ Context-free grammars from structure[Sakakibara 90]

⇒ Tree automata[Knuutila 93]

F. Coste (Inria) Grammatical Inference SML 2020-2021 59 / 123

Simple PAC

[Denis, D’Halluin, Gilleron 96]

PAC Learning but for “simple distribution” only

Simple example have a higher probability in the training sampleSo unseen simple example are counter-examples

DFA are Simple PAC learnable [Parekh, Honavar 97]

DFA are Simple PAC learnable from positive examples [Denis 98]

F. Coste (Inria) Grammatical Inference SML 2020-2021 60 / 123

Negative results

The classes below are not IPTD for |Σ| ≥ 2 :

Context-free grammars

Linear grammars

Non-deterministic automata

F. Coste (Inria) Grammatical Inference SML 2020-2021 61 / 123

1 Learning automataDefinitionsLearning automata from positive and negative examples

Problem definitionRPNIStructural completeness hypothesisUtility of counter-examplesEDSM heuristic

Learning automata from positive example

F. Coste (Inria) Grammatical Inference SML 2020-2021 67 / 123

Unbiased/symmetrical learning

Defining a regular language⇔ Definition of complementary language

[Alquezar, Sanfeliu 95]:

Consider symmetrically S+ and S−→ learn L+ and L−

Classification of words: +, - or ?

Related to learning Mealy, Moore finite states machines [Biermann,

Feldmann 72], and automata [Lang 92, Oncina, Garcia 92]

F. Coste (Inria) Grammatical Inference SML 2020-2021 68 / 123

Maximal Canonical Automaton

S+ = {aaa, bba, baaa} ; S− = {aaaa, baab, bbabab}MCA(S+,S−) :

Rote learning

F. Coste (Inria) Grammatical Inference SML 2020-2021 69 / 123

EDSM Heuristic

Evidence Driven State MergingR. Price, K. Lang, Abbadingo One, 1998

Data driven heuristic

Dynamic choice of best pair of states to merge at each step accordingto evidence of a good merge

Evidence measure: maximise count of final states merge fordeterminization

( Rem. : → similarity between subtrees)

F. Coste (Inria) Grammatical Inference SML 2020-2021 70 / 123

Example

S+ = {a, aaa, ba, baaa} S− = {aab, baab, baba}

PTA(S+, S−)

f(1,2) = 0, f(0,2) = 3, . . . , f(3,8) = 2, . . . , f(8,9) = −∞, . . .

F. Coste (Inria) Grammatical Inference SML 2020-2021 71 / 123

Example

S+ = {a, aaa, ba, baaa} S− = {aab, baab, baba}

PTA(S+, S−)

f(1,2) = 0, f(0,2) = 3, . . . , f(3,8) = 2, . . . , f(8,9) = −∞, . . .

F. Coste (Inria) Grammatical Inference SML 2020-2021 72 / 123

Example

S+ = {a, aaa, ba, baaa} S− = {aab, baab, baba}

PTA(S+, S−)

f(0 2, 1 4) = −∞, f(0 2, 3 8) = 1 , . . . ,

F. Coste (Inria) Grammatical Inference SML 2020-2021 73 / 123

Example

S+ = {a, aaa, ba, baaa} S− = {aab, baab, baba}

PTA(S+, S−)

f(1 4 6 11, 9) = 1 , . . . ,

F. Coste (Inria) Grammatical Inference SML 2020-2021 74 / 123

Example

S+ = {a, aaa, ba, baaa} S− = {aab, baab, baba}

PTA(S+, S−)

F. Coste (Inria) Grammatical Inference SML 2020-2021 75 / 123

EDSM : a good heuristic . . .for automata randomly and uniformly generated

fig. from Merge order count K. Lang, 1997

Abbadingo : pb 506 states, 60 000 seq. (R. Price)

would require ∼ 100 000 seq. with RPNI.

F. Coste (Inria) Grammatical Inference SML 2020-2021 76 / 123

. . . but expensive

Evaluation of O(n2) merges at each step of the algorithm!

Remark: scores for merging states far from the root are smaller

→ window w (Lang, Price ?)→ Blue-Fringe. . .

F. Coste (Inria) Grammatical Inference SML 2020-2021 77 / 123

Blue Fringe, H. Juille, Abbadingo One, 1998

fig. from Faster algorithms for finding minimal consistent DFA, K. Lang, 1999

Any state B not mergeable with any state in R is promoted to R

Merge pairs of states in B ×REasy to implement: states of B are roots of subtrees

Blue-Fringe + EDSM (+ SAGE, H. Juille)Abbadingo : pb 65 states, 1 521 seq.

would require ∼ 4 000 seq. with RPNIF. Coste (Inria) Grammatical Inference SML 2020-2021 78 / 123

Learning from positive and negative examples

[Gold 67]:

No superfinite class of language can be identified in the limit frompositive examples onlyThe class of primitive recursive function can be identified in the limitfrom positive and negative examples

Efficient learning

DFA are IPTD from positive and negative examples (RPNI)Extension to some closely related classesNFA are not! CFG neither . . .An heuristic (EDSM) that seems to perform better . . . (?)

What if negative examples are not available?

F. Coste (Inria) Grammatical Inference SML 2020-2021 79 / 123

Outline

1 Learning automataDefinitionsLearning automata from positive and negative examplesLearning automata from positive example

F. Coste (Inria) Grammatical Inference SML 2020-2021 80 / 123

Learning from positive example (only)

Statistical criteria for not merging pair of states: ALERGIA

“Characterizable” methods: k-RI, k-testable languages

Heuristics methods: ECGI

F. Coste (Inria) Grammatical Inference SML 2020-2021 81 / 123

1 Learning automataDefinitionsLearning automata from positive and negative examplesLearning automata from positive example

Alergiak-reversibles languagesECGI

F. Coste (Inria) Grammatical Inference SML 2020-2021 82 / 123

ALERGIA

[Carrasco, Oncina 99]

Input: S+, precision parameter αOutput : (probabilistic) DFA AA← PPTA(S+)for all (p, q) in standard order do

if compatible(p, q, α) thenA← deterministic merge(A, p, q)

end ifend for

F. Coste (Inria) Grammatical Inference SML 2020-2021 83 / 123

ALERGIA

Compatibility between two 2 states q1 and q2 :

Transition probabilities are similar enough:∀a ∈ Σ ∪ {#},

∣∣∣∣C(q1, a)

C(q1)−C(q2, a)

C(q2)

∣∣∣∣ <√

1

2ln

2

α

(1√C(q1)

+1√C(q2)

)

Compatibility of successors :

∀a ∈ Σ, δ(q1, a) et δ(q2, a) sont α-compatibles

F. Coste (Inria) Grammatical Inference SML 2020-2021 84 / 123

ALERGIA

Local measure of suffix language similarity

Other measures . . .→ Learning probabilistic automata→ Identification of probability distributions on words

See:

PAC-learnability of Probabilistic Deterministic Finite State Automata,A. Clark and F. Thollard, Journal of Machine Learning Research, 2004.Towards feasible PAC-learning probabilistic deterministic finiteautomata, J. C. Rabal and R. Gavalda, ICGI 2008Learning Rational stochastic languages, F. Denis, Y. Esposito, A.Habrard, COLT 2006.Spectral learning of weighted automata - A forward-backwardperspective, B. Balle, X. Carreras, F. M. Luque, A. Quattoni, MachineLearning, 2014. . .

F. Coste (Inria) Grammatical Inference SML 2020-2021 85 / 123

1 Learning automataDefinitionsLearning automata from positive and negative examplesLearning automata from positive example

Alergiak-reversibles languagesECGI

F. Coste (Inria) Grammatical Inference SML 2020-2021 86 / 123

Characterizable learning

Negative result of [Gold67] applies to superfinite languages.To avoid over-generalization, an approach performing minimalgeneralisation at each step ensure identification for particular classes oflanguages.

F. Coste (Inria) Grammatical Inference SML 2020-2021 87 / 123

0-reversible languages

0−reversible automata: deterministic automata whose mirror isdeterminisistic

0−reversible language = recognized by a 0−reversible automata

Learnable from positive sample [Angluin 82]

F. Coste (Inria) Grammatical Inference SML 2020-2021 88 / 123

k-reversibles language

k-reversible automata: deterministic automata A whose reverse Ar isdeterministic with look-ahead k:∀q, q′ ∈ Q, q 6= q′,((q, q′ ∈ Q0) ∨ (q, q′ ∈ δ(q′′, a))⇒ @u ∈ Σk : (δ(q, u) 6= ∅) ∧ (δ(q′, u) 6= ∅)).k-reversible language iff a k-reversible automata recognize it(⇔ u1vw, u2vw ∈ L et |v| = k ⇒ SL(u1v) = SL(u2v)).

A : Ar :A est 1-reversible

Are the following languages 0-reversible, 1-reversible?

Σ∗, 1∗01, 0∗1+, 11∗

Find one non 1-reversible language. . . Does it exists non reversibles languages (i.e.

non k-reversible for all k) ? 0∗(1 + ε)0∗

F. Coste (Inria) Grammatical Inference SML 2020-2021 89 / 123

k-RI [Angluin82]

Input : k, S+

Output : Ak, canonical automaton accepting smallest k-reversible languageincluding S+

A← PTA(S+)while ∃(q1, q2)← (non-k-reversible (A)) doA← deterministicmerge(A, q1, q2)

end while

Temporal complexity: O(Σk|S+|k+3) Source: [TD2013]

Memory complexity: O(|S+|)Non incremental algorithm

[TD2013] How Symbolic Learning Can Help Statistical Learning (and vice versa),I. Tellier and Y. Dupont, RANLP 2013

F. Coste (Inria) Grammatical Inference SML 2020-2021 90 / 123

k-RIExample, k = 0

S = {ε, aa, bb, aaaa, abab, abba, baba}Prefix tree acceptor (PTA)

F. Coste (Inria) Grammatical Inference SML 2020-2021 91 / 123

k-RIExample, k = 0

S = {ε, aa, bb, aaaa, abab, abba, baba}Merging all final states

F. Coste (Inria) Grammatical Inference SML 2020-2021 91 / 123

k-RIExample, k = 0

S = {ε, aa, bb, aaaa, abab, abba, baba}Merging for determinisation of states B

F. Coste (Inria) Grammatical Inference SML 2020-2021 91 / 123

k-RIExample, k = 0

S = {ε, aa, bb, aaaa, abab, abba, baba}B predecessors of A by a, D predecessors of A by b have to be merged

F. Coste (Inria) Grammatical Inference SML 2020-2021 91 / 123

k-RIExample, k = 0

S = {ε, aa, bb, aaaa, abab, abba, baba}C predecessors by b of B to merge

F. Coste (Inria) Grammatical Inference SML 2020-2021 91 / 123

k-RIExample, k = 0

S = {ε, aa, bb, aaaa, abab, abba, baba}Solution

F. Coste (Inria) Grammatical Inference SML 2020-2021 91 / 123

k-RI

Remarks : returns smallest language, not smallest automaton!

[Angluin82]

the class Ck−rev is identifiable from positive examples (proof: existence ofa characteristic sample)

see also: distinguishing functions [Fernau2000]

Choice of k ?Pertinence of the subclass for the application ?Exercise: Automata returned for S = {a, aa, aaa} (k = 0)

F. Coste (Inria) Grammatical Inference SML 2020-2021 92 / 123

1 Learning automataDefinitionsLearning automata from positive and negative examplesLearning automata from positive example

Alergiak-reversibles languagesECGI

F. Coste (Inria) Grammatical Inference SML 2020-2021 94 / 123

ECGI Heuristic

Error Correcting GI [Rulot, Vidal 88]

Learns regular grammars which are non deterministic and without cycles,st ∀A,B,C ∈ N, ∀b, a ∈ Σ :

if (B → aA) ∈ P and (C → bA) ∈ P then b = a

Positive examples

Incremental algorithm:

First grammar G0 = first example s0

Minimal modification of Gi−1 to accept new example si

F. Coste (Inria) Grammatical Inference SML 2020-2021 95 / 123

ECGI

Error rules:Insertion of a : A→ aA,∀(A→ bB) ∈ P,∀a ∈ Σ

Subst. of b by a : A→ aB, ∀(A→ bB) ∈ P,∀a ∈ ΣA→ a,∀(A→ b) ∈ P,∀a ∈ Σ

Deletion of b : A→ B, ∀(A→ bB) ∈ P,∀a ∈ ΣA→ ε, ∀(A→ b) ∈ P,∀a ∈ Σ

By extending Gi−1 with these errors rules, one can compute (dyn. prog.) the

optimal error correcting parsing of si (using a minimal number of error rules)

Gi is Gi−1 extended with the minimal set of rules required to parse si

F. Coste (Inria) Grammatical Inference SML 2020-2021 96 / 123

ECGI

Example for S+ = {aabb, abbb, abbab, bbb} :

F. Coste (Inria) Grammatical Inference SML 2020-2021 97 / 123

ECGI

Input : I+Output : a grammar “ECGI” G compatible with I+x← I1

+ ; n← |x|N ← {A0, . . . , An−1} ; Σ← {a1, . . . , an}P ← {(Ai → aiAi), i = 1, . . . , n− 1} ∪ {An−1 → an}S ← A0 ; G1 ← (N,Σ, P, S)for i = 2 a |I+| doG← Gi−1 ; x← Ii+ ;P ′ ←optimal derivation(x,Gi−1)for j = 1 a P ′ doG←extend gram (G, pj)

end forGi ← G

end forReturn G

F. Coste (Inria) Grammatical Inference SML 2020-2021 98 / 123

ECGI

No recursivity

Heuristic capturing variations of a family of sequences

Order of examples can change result

Easy extension to stochastic grammars

Is there a link between the structural completeness hypothesis and ECGI?

F. Coste (Inria) Grammatical Inference SML 2020-2021 99 / 123

Learning languages: conclusion and perspective

Learning to classify sequences:

Classical machine learning approachTransformation into attribute-value representations and use classical ML.

Words embedding in multiple dimension. . .

Learning automataWell studied. Established learnability results.

Recent advances on learning regular distributions. . .

Learning grammarsHot topic nowadays. Substituability as a central concept for practical algorithms

and learnability results even beyond CFG (midly context sensitive languages).

Learning graphs is an emerging domain. . .

F. Coste (Inria) Grammatical Inference SML 2020-2021 120 / 123

Some references

Inference grammaticale reguliere : fondements theoriques et principauxalgorithmes,Dupont, Miclet, RR-INRIA 3449, 1998

Recent advances of grammatical inference,Sakakibara, TCS vol 185, pp 15-45, 1997

A bibliographical study of Grammatical Inference,De la Higuera, 2002,http://pagesperso.lina.univ-nantes.fr/~cdlh/papers/bibliography_survey.pdf

Grammatical Inference,Colin de la Higuera

Inference grammaticaleChap7, support de cours de Laurent Micletftp: // ftp. irisa. fr/ local/ cordial/ polyAC0304. ps

Learnable classes of categorial grammars,Kanazawa, Cambridge University Press, 1998

Topics in Grammatical Inference,Editors: Heinz, Jeffrey, Sempere, Jose M (Eds.), Springer, 2016

Grammatical Inference Homepage : http://www.grammarlearning.org/

F. Coste (Inria) Grammatical Inference SML 2020-2021 121 / 123

Biological palindrome : S → aSt|cSg|tSa|gSc|εDerivation tree of atgttcgaacat ?Consequence of adding a new rewriting rule:S → SS|aSt|cSg|tSa|gSc|ε ?Derivation tree of caaatcgatcatcgaagagctcttgttg ?de gaatattcgaatattc ?

CopyS → AaS|CcS|GgS|TtS|XX → εAa→ aA ; Ac→ cA ; Ag → gA ; At→ tACa→ aC ; Cc→ cC ; Cg → gC ; Ct→ tCGa→ aG ; Gc→ cG ; Gg → gG ; Gt→ tGTa→ aT ; Tc→ cT ; Tg → gT ; Tt→ tTAX → Xa ; CX → Xc ; GX → Xg ; TX → Xt

Derivation tree of ctaacctaac ?

F. Coste (Inria) Grammatical Inference SML 2020-2021 122 / 123

What we have seen in SML so far

Introduction to machine learning

Generalisation, necessity of a bias. . .How to define properly a machine learning problem: choice of objectdescription, choice of hypothesis space, choice of ’best’ hypothesis, i.e.setting biasesExploration of search spaceEvaluation of the risk

Learning on sequences

Vectorization of texts and Naive BayesAutomata and learnability

Next: State-of-the art algorithms for attribute-valuerepresentations of instances. . .

F. Coste (Inria) Grammatical Inference SML 2020-2021 123 / 123