colin de la higuera university of nantes - lis-lab.fr · colin de la higuera university of nantes....

203
Pascal Bootcamp, Marseille 1 Grammatical inference Colin de la Higuera University of Nantes

Upload: others

Post on 24-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 1

Grammatical inference

Colin de la HigueraUniversity

of Nantes

Page 2: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 2

Cdlh 2010

AcknowledgementsLaurent Miclet, Jose Oncina, Tim Oates, Anne-Muriel Arigon, Leo Becerra-Bonache, Rafael Carrasco, Paco Casacuberta, Pierre Dupont, Rémi Eyraud, Philippe Ezequel, Henning Fernau, Jean-Christophe Janodet, Satoshi Kobayachi, Thierry Murgue, Frédéric Tantini, Franck Thollard, Enrique Vidal, Menno van Zaanen,...

http://pages-perso.univ-nantes.fr/~cdlh/http://videolectures.net/colin_de_la_higuera/

Page 3: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 3

Cdlh 2010

Outline

1.

What is learning automata about?2.

A (detailed) introductory example

3.

Validation issues4.

Some criteria

5.

Learning from an informant6.

Learning from text

7.

Learning by observing8.

Learning actively

9.

Extensions (PFA, transducers, tree automata)10.

Conclusions

Page 4: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 4

Cdlh 2010

1 Grammatical inference

is about learning a grammar

given information about a languageInformation is strings, trees or graphsInformation can be

Text: only positive informationInformant: labelled dataActively sought (query learning, teaching)

Above lists are not exclusive

Page 5: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 5

Cdlh 2010

The functions/goals

Languages and grammars from the Chomsky hierarchyProbabilistic automata and context-free grammarsHidden Markov ModelsPatternsTransducers…

Page 6: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 6

Cdlh 2010

The data: examples of strings

A string in Gaelic and its translation to English:

Tha thu cho duaichnidh ri èarr àirde de a’ coisichdeas damhYou are as ugly as the north end of a southward traveling ox

Page 7: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 7

Cdlh 2010

Page 8: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 8

Cdlh 2010

Page 9: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 9

Cdlh 2010

>A BAC=41M14 LIBRARY=CITB_978_SKBAAGCTTATTCAATAGTTTATTAAACAGCTTCTTAAATAGGATATAAGGCAGTGCCATGTAGTGGATAAAAGTAATAATCATTATAATATTAAGAACTAATACATACTGAACACTTTCAATGGCACTTTACATGCACGGTCCCTTTAATCCTGAAAAAATGCTATTGCCATCTTTATTTCAGAGACCAGGGTGCTAAGGCTTGAGAGTGAAGCCACTTTCCCCAAGCTCACACAGCAAAGACACGGGGACACCAGGACTCCATCTACTGCAGGTTGTCTGACTGGGAACCCCCATGCACCTGGCAGGTGACAGAAATAGGAGGCATGTGCTGGGTTTGGAAGAGACACCTGGTGGGAGAGGGCCCTGTGGAGCCAGATGGGGCTGAAAACAAATGTTGAATGCAAGAAAAGTCGAGTTCCAGGGGCATTACATGCAGCAGGATATGCTTTTTAGAAAAAGTCCAAAAACACTAAACTTCAACAATATGTTCTTTTGGCTTGCATTTGTGTATAACCGTAATTAAAAAGCAAGGGGACAACACACAGTAGATTCAGGATAGGGGTCCCCTCTAGAAAGAAGGAGAAGGGGCAGGAGACAGGATGGGGAGGAGCACATAAGTAGATGTAAATTGCTGCTAATTTTTCTAGTCCTTGGTTTGAATGATAGGTTCATCAAGGGTCCATTACAAAAACATGTGTTAAGTTTTTTAAAAATATAATAAAGGAGCCAGGTGTAGTTTGTCTTGAACCACAGTTATGAAAAAAATTCCAACTTTGTGCATCCAAGGACCAGATTTTTTTTAAAATAAAGGATAAAAGGAATAAGAAATGAACAGCCAAGTATTCACTATCAAATTTGAGGAATAATAGCCTGGCCAACATGGTGAAACTCCATCTCTACTAAAAATACAAAAATTAGCCAGGTGTGGTGGCTCATGCCTGTAGTCCCAGCTACTTGCGAGGCTGAGGCAGGCTGAGAATCTCTTGAACCCAGGAAGTAGAGGTTGCAGTAGGCCAAGATGGCGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTATGTCCAAAAAAAAAAAAAAAAAAAAGGAAAAGAAAAAGAAAGAAAACAGTGTATATATAGTATATAGCTGAAGCTCCCTGTGTACCCATCCCCAATTCCATTTCCCTTTTTTGTCCCAGAGAACACCCCATTCCTGACTAGTGTTTTATGTTCCTTTGCTTCTCTTTTTAAAAACTTCAATGCACACATATGCATCCATGAACAACAGATAGTGGTTTTTGCATGACCTGAAACATTAATGAAATTGTATGATTCTAT

Page 10: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 10

Cdlh 2010

����1

� � � � � � �� � � � � � � � � � � �

Page 11: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 11

Cdlh 2010

Page 12: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 12

Cdlh 2010

Page 13: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 13

Cdlh 2010

<book><part><chapter>

<sect1/><sect1><orderedlist

numeration="arabic">

<listitem/><f:fragbody/>

</orderedlist></sect1>

</chapter></part>

</book>

Page 14: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 14

Cdlh 2010

<?xml version="1.0"?> <?xml-stylesheet href="carmen.xsl" type="text/xsl"?> <?cocoon-process type="xslt"?><!DOCTYPE pagina [ <!ELEMENT pagina (titulus?, poema)> <!ELEMENT titulus (#PCDATA)> <!ELEMENT auctor (praenomen, cognomen, nomen)> <!ELEMENT praenomen (#PCDATA)> <!ELEMENT nomen (#PCDATA)> <!ELEMENT cognomen (#PCDATA)> <!ELEMENT poema (versus+)> <!ELEMENT versus (#PCDATA)> ]><pagina> <titulus>Catullus II</titulus> <auctor> <praenomen>Gaius</praenomen> <nomen>Valerius</nomen> <cognomen>Catullus</cognomen> </auctor>

Page 15: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 15

Cdlh 2010

Page 16: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 16

Cdlh 2010

And also

Business processesBird songsImages (contours and shapes)Robot movesWeb servicesMalware…

Page 17: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 17

Cdlh 2010

2 An introductory exampleD. Carmel and S. Markovitch. Model-based learning of interaction strategies in multi-agent systems. Journal of Experimental and Theoretical Artificial Intelligence, 10(3):309–332, 1998D. Carmel and S. Markovitch. Exploration strategies for model-based learning in multiagentsystems. Autonomous Agents and Multi-agent Systems, 2(2):141–172, 1999

Page 18: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 18

Cdlh 2010

The problem:

An agent must take cooperative decisions in a multi-agent worldHis decisions will depend:

on what he hopes to win or loseon the actions of other agents

Page 19: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 19

Cdlh 2010

Hypothesis: the opponent follows a rational strategy (given by a DFA/Moore machine):

e e

pp

l l d

p e

e e p e p →

le e e →

d

You: listen or doze

Me: equations or pictures

Page 20: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 20

Cdlh 2010

Example: (the prisoner’s dilemma)

Each prisoner can admit (a) or stay silent (s)

If both admit: 3 years each If A admits but not B: A=0 years, B=5 yearsIf B admits but not A: B=0 years, A=5 yearsIf neither admits: 1 year each

Page 21: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 21

Cdlh 2010

a

a

s

s

-3

-3

0

-5

0

-5

-1

-1

AB

Page 22: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 22

Cdlh 2010

Here an iterated version against an opponent that follows a rational strategyGain Function: limit of meansA game is a string in

(His_moves ×

My_moves)*!

Example [as] [as] [ss] [aa]

Page 23: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 23

Cdlh 2010

The general problem

We suppose that the strategy of the opponent is given by a deterministic finite automaton

Can we imagine an optimal strategy?

Page 24: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 24

Cdlh 2010

Suppose we know the opponent’s strategy:

Then (game theory):Consider the opponent’s graph in which we value the edges by our own gain

Page 25: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 25

Cdlh 2010

a s

a

s

-3 0

-5 -1

s s

aa

a s s

a s-3

-5 -1

0

-1

0

Page 26: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 26

Cdlh 2010

a s

a

s

-3 0

-5 -1

s s

aa

a s s

a s

Mean= -0.5

Best path

-3

-5 -1

0

-1

0

1

Find the cycle of maximum mean weight

2

Find the best path leading to this cycle of maximum mean weight

3 Follow the path and stay in the cycle

Page 27: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 27

Cdlh 2010

Question

Can we play a game against this opponent and…

can we reconstruct his strategy ?

Page 28: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 28

Cdlh 2010

Data (him, me) : {aa as sa aa as ss ss ss sa}

HIM MEa aa ss aa aa ss ss ss as a

I play asa, his move is a

λ→ aa→aas → sasa → aasaa → aasaas → sasaass → s

Page 29: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 29

Cdlh 2010

λ→ aa→? a

First move: I play

a, he

plays

a

a

a

a

a

Sure: Have to deal with:

Try:

Page 30: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 30

Cdlh 2010

λ → aa → aas →

?

a

Second move: I play

s, he

plays

a

a

sa

Confirmed: Have to deal with:

Try:a

a

a, s

Page 31: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 31

Cdlh 2010

λ → aa → aas → s asa →

?

a

Third

move: I play

a, he

plays

s

Inconsistent: Consistent:

Try:

a, s

sa

a

s

Have to deal with:

sa

a

sa

sa

a

s

a

Page 32: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 32

Cdlh 2010

Fourth

move: I play

a, he

plays

a

Consistent:

sa

a

s

a

λ → aa → aas → sasa → aasaa →

?

Page 33: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 33

Cdlh 2010

Fifth

move: I play

s, he

plays

a

Consistent:

sa

a

s

a

λ → aa → aas → sasa → aasaa → aasaas → ?asaass → s

Page 34: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 34

Cdlh 2010

Sixth

move: I play

s, he

plays

s

Consistent:

sa

a

s

aλ → aa → aas → sasa → aasaa → aasaas → sasaass → ?

But have to deal with:sa

a

s

a

s

Page 35: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 35

Cdlh 2010

Sixth

move: I play

s, he

plays

s

Try

this:

sa

a

s

a,sλ → aa → aas → sasa → aasaa → aasaas → sasaass → ?

Page 36: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 36

Cdlh 2010

Seventh

move: I play

s, he

plays

s

λ → aa → aas → sasa → aasaa → aasaas → sasaass → sasaasss →

?

Inconsistent:

sa

a

s

a,s

Consistent:

sa

a

s

a,s

s

Page 37: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 37

Cdlh 2010

Eighth

move: I play

a, he

plays

s

λ → aa → aas → sasa → aasaa → aasaas → sasaass → sasaasss → s asaasssa →

?

Inconsistent:

sa

a

s

a,s

Consistent:

sa

a

s

a,s

s

Page 38: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 38

Cdlh 2010

Ninth

move: I play

a, he

plays

s

λ → aa → aas → sasa → aasaa → aasaas → sasaass → sasaasss → s asaasssa → s asaasssas →

?

Inconsistent:

Consistent:

sa

a

s

a

s

Page 39: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 39

Cdlh 2010

λ → aa → aas → sasa → aasaa → aasaas → sasaass → s

asaasss → sasaasssa → s

ssa

a

s

a

s

Page 40: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 40

Cdlh 2010

λ → aa → aas → sasa → aasaa → aasaas → sasaass → s

asaasss → sasaasssa → s

ssa

a

s

a

s

s

Page 41: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 41

Cdlh 2010

λ → aa → aas → sasa → aasaa → aasaas → sasaass → s

asaasss → sasaasssa → s

s

ssa

a

s

a

s

Page 42: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 42

Cdlh 2010

λ → aa → aas → sasa → aasaa → aasaas → sasaass → s

asaasss → sasaasssa → s

s

ssa

a

s

a

s

Page 43: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 43

Cdlh 2010

λ → aa → aas → sasa → aasaa → aasaas → sasaass → s

asaasss → sasaasssa → s

s

ssa

a

s

a

s

a

Page 44: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 44

Cdlh 2010

λ → aa → aas → sasa → aasaa → aasaas → sasaass → s

asaasss → sasaasssa → s

s

ssa

a

s

a

s

a

Page 45: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 45

Cdlh 2010

s

ssa

a

s

a

s

a

Result

Page 46: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 46

Cdlh 2010

How do we get hold of the learning data?

a) through observationb) through exploration (like here)

Page 47: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 47

Cdlh 2010

An open problem

The strategy is probabilistic:

s

a :70%s :30%

a :50%s :50%

a :20%s :80%

a

s

a

s

a

Page 48: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 48

Cdlh 2010

Tit for Tat

sa

a

s

a

s

Page 49: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 49

Cdlh 2010

3 What does learning mean?

Suppose we write a program that can learn FSM… are we done?The first question is: « why bother? »If my programme works, why do something more about it?Why should we do something when other researchers in Machine Learning are not?

Page 50: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 50

Cdlh 2010

Motivating question #1

Is 17 a random number?

Is 0110110110110101011000111101 a random sequence?

(Is FSM A the correct FSM for sample S?)

Page 51: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 51

Cdlh 2010

Motivating question #2

Statement “I have learnt” does not make sense

Statement “I am learning” makes sense

Page 52: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 52

Cdlh 2010

Motivating question #3

In the case of languages, learning is an ongoing process.

Is there a moment where we can say we have learnt a language?

Page 53: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 53

Cdlh 2010

What usually is called “having learnt”

That the FSM is the smallest, best (re a score) Combinatorial characterisationThat some optimisation problem has been solvedThat the “learning” algorithm has converged (EM)

Page 54: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 54

Cdlh 2010

What we would like to say

That having solved some complex combinatorial question we have an Occam, Compression, MDL, Kolmogorov complexity like argument which gives us some guarantee with respect to the futureComputational learning theory is full of such results

Page 55: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 55

Cdlh 2010

Why should we bother and those working in statistical machine learning not?

Whether with numerical functions or with symbolic functions, we are all trying to do some sort of optimisationThe difference is (perhaps) that numerical optimisation works much better than combinatorial optimisation![they actually do bother, only differently]mbinatorics are harder (in this case) that optimisation

Page 56: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 56

Cdlh 2010

4 Some convergence criteria

What would we like to say?That in the near future, given some string, we can predict if this string belongs to the language or notIt would be nice to be able to bet €1000 on this

Page 57: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 57

Cdlh 2010

(if not) What would we like to say?

That if the solution we have returned is not good, then that is because the initial data was bad (insufficient, biased)Idea: blame the data, not the algorithm

Page 58: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 58

Cdlh 2010

Suppose we cannot say anything of the sort?

Then that means that we may be terribly wrong even in a favourable settingThus there is a hidden bias

Page 59: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 59

Cdlh 2010

4.1 Non probabilistic setting

Identification in the limitResource bounded identification in the limitActive learning (query learning)

Page 60: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 60

Cdlh 2010

Example

2{2}

3

7

5{2, 3}

Fibonacci numbers

Prime numbers103

31

23

11

Page 61: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 61

Cdlh 2010

Identification in the limit

E. M. Gold. Language identification in the limit. Information and Control, 10(5):447–474, 1967E. M. Gold. Complexity of automaton identification from given data. Information and Control, 37:302–320, 1978

Page 62: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 62

Cdlh 2010

The general idea

Information is presented to the learner who updates its hypothesis after each piece of dataAt some point, always, the learner will have found the correct concept and not move from it

Page 63: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 63

Cdlh 2010

A presentation is

a function ϕ : ℕ→Xwhere X is some set,and such that ϕ is associated to a language L through a function yields: yields(ϕ) =L.If ϕ(ℕ)=ψ(ℕ) then yields(ϕ)= yields(ψ)

Page 64: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 64

Cdlh 2010

Some types of presentations (1)

A text presentation of a language L⊆Σ* is a function ϕ : ℕ → Σ* such that f(ℕ)=L

ϕ is an infinite succession of all the elements of L

(note : small technical difficulty with ∅)

Page 65: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 65

Cdlh 2010

Some types of presentations (2)

An informed presentation (or an informant) of L⊆Σ* is a function ϕ : ℕ → Σ* × {-,+} such that ϕ(ℕ)=(L,+)∪(L,-)ϕ is an infinite succession of all the elements of Σ* labelled to indicate if they belong or not to L

Page 66: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 66

Cdlh 2010

Presentation for {anbn: n ∈ℕ}

Legal presentation from text: λ, a2b2, a7b7…Illegal presentation from text: ab, ab, ab,…Legal presentation from informant : (λ,+), (abab,-), (a2b2,+), (a7b7…,+), (aab,-),…

Page 67: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 67

Cdlh 2010

Learning function

Given a presentation ϕ, ϕn is the set of the first n elements in fA learning algorithm a is a function that takes as input a set ϕn and returns a representation of a languageGiven a grammar G, L(G) is the language generated/recognised/ represented by G

Page 68: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 68

Cdlh 2010

Convergence to a hypothesis

Let L be a language from a class L, let ϕbe a presentation of L and let ϕn be the first n elements in f,a converges to G with ϕ if:

∀n∈ℕ: a(ϕn) halts and gives an answer∃n0∈ℕ: n≥n0 ⇒ a(ϕn) =G

Page 69: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 69

Cdlh 2010

Identification in the limit

L Pres ⊆

ℕ→XA class of languages

A class of grammarsG

L A learnerThe naming function

yields

a

ϕ(ℕ)=ψ(ℕ) ⇒yields(ϕ)=yields(ψ)L(a(ϕ))=yields(ϕ)

Page 70: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 70

Cdlh 2010

Consistency

We say that the learning function a is consistent if ϕn is consistent with a(ϕn) ∀n

A consistent learner is always consistent with the past

Page 71: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 71

Cdlh 2010

Conservatism

We say that the learning functiona is conservative if whenever ϕ(n+1) is consistent with a(ϕn), we have a(ϕn)= a(ϕn+1)

A conservative learner doesn’t change his mind needlessly

Page 72: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 72

Cdlh 2010

What about efficiency?

We can try to boundglobal timeupdate timeerrors before converging (IPE)mind changes (MC)queriesgood examples needed

Page 73: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 73

Cdlh 2010

More precise definition of convergence

∃n∈ℕ

such that ∀k≥n

L(a(ϕk))=L(a(ϕn))= yields(ϕ)

ϕk

is the sequence of the first k elements of presentation

ϕ

Page 74: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 74

Cdlh 2010

Resource bounded identification in the limit

Definitions of IPE, CS, MC, update time, etc…What should we try to measure?

The size of M ?The size of L ?The size of f ?The size of ϕn ?

Page 75: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 75

Cdlh 2010

4.2 Probabilistic settings

PAC learningIdentification with probability 1PAC learning distributions

Page 76: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 76

Cdlh 2010

Learning a language from sampling

We have a distribution over Σ*We sample twice:

Once to learnOnce to see how well we have learned

The PAC setting

Page 77: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 77

Cdlh 2010

PAC-learning (Valiant 84, Pitt 89)

L a class of languages

M a class of machines

ε >0 and δ>0m a maximal length over the stringsn a maximal size of machines

Page 78: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 78

Cdlh 2010

H is ε

-AC

(approximately correct)*

if

PrD [H(x)≠G(x)]< ε

Page 79: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 79

Cdlh 2010

L(G) L(H)

Errors: we want L1 (D(G),D(H))<ε

Page 80: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 80

Cdlh 2010

(French radio)

Unless there is a surprise there should be no surprise(after the last primary elections, on 3rd of June 2008)

Page 81: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 81

Cdlh 2010

Results

Using cryptographic assumptions, we cannot PAC-learn DFACannot PAC-learn NFA, CFGs with membership queries either

Page 82: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 82

Cdlh 2010

Alternatively

Instead of learning classifiers in a probabilistic world, learn directly the distributions!Learn probabilistic finite automata (deterministic or not)

Page 83: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 83

Cdlh 2010

No error

This calls for identification in the limit with probability 1Means that the probability of not converging is 0

Page 84: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 84

Cdlh 2010

Results

If probabilities are computable, we can learn with probability 1 finite state automataBut not with bounded (polynomial) resourcesOr it becomes very tricky (with added information)

Page 85: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 85

Cdlh 2010

With error

PAC definitionBut error should be measured by a distance between the target distribution and the hypothesisL1, L2, L∞ ?

Page 86: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 86

Cdlh 2010

Results

Too easy with L∞

Too hard with L1Nice algorithms for biased classes of distributions

Page 87: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 87

Cdlh 2010

Conclusion

A number of paradigms to study identification of learning algorithmsSome to learn classifiersSome to learn distributions

Page 88: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 88

Cdlh 2010

5 Learning from an informantAlgorithm RPNIRegular Positive and Negative Grammatical Inference

Inferring regular languages in polynomial time. Jose Oncina

& Pedro García.

Pattern recognition and

image analysis, 1992http://pagesperso.lina.univ-nantes.fr/~cdlh/slides/

Chapter 12

Page 89: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 89

Cdlh 2010

MotivationWe are given a set of strings S+ and a set of strings S-Goal is to build a classifierThis is a traditional (or typical) machine learning questionHow should we solve it?

Σ*

Page 90: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 90

Cdlh 2010

Ideas

Use a distance between strings and try k-NNEmbed strings into vectors and use some off-the-shelf technique (decision trees, SVMs, other kernel methods)

Page 91: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 91

Cdlh 2010

Alternative

Suppose the classifier is some grammatical formalismThus we have L and Σ*\L

Σ*

L

Page 92: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 92

Cdlh 2010

Obviously many possible candidates

Any Grammar G such thatS+ ⊆ L(G) S- ∩ L(G) =∅

Page 93: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 93

Cdlh 2010

Two types of final states

2

a

a

a

S+ ={λ, aaa}S- ={aa, aaaaa}

1

3

1 is accepting3 is rejectingWhat about state 2?

Page 94: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 94

Cdlh 2010

What is determinism about?

2

1

4

3

a

a

Merge

1 and 3?2

1a

2,41

a

4a

But…

Page 95: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 95

Cdlh 2010

The prefix tree acceptor

The smallest tree-like DFA consistent with the dataIs a solution to the learning problemCorresponds to a rote learner

Page 96: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 96

Cdlh 2010

From the sample to the PTA

a

a

aa

b

b

b

a

a

a

ba b

a

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

S+ ={λ, aaa, aaba, ababa, bb, bbaaa}S- ={aa, ab, aaaa, ba}

PTA(S+ )

Page 97: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 97

Cdlh 2010

From the sample to the PTA (full PTA)

a

a

aa

b

b

b

a

a

a

ba b

a

1

2

3

4

5

7

8

9

10

11

13

14

15

16

17

S+ ={λ, aaa, aaba, ababa, bb, bbaaa}S- ={aa, ab, aaaa, ba}

12a

6a

PTA(S+ ,S- )

Page 98: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 98

Cdlh 2010

Red, Blue and White states

a

a

ab

b

b

a

ba

-Red

states are confirmed states-Blue

states are the (non Red)

successors of the Red states-White states are the others

Page 99: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 99

Cdlh 2010

Merge and fold

1

8

2

5

3

a

a

ab

b

b

7a

64

ba

9

Suppose we want to merge state 3 with state 2

Page 100: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 100

Cdlh 2010

Merge and fold

1

8

2

5

3

a,b

a

a

b

b

7a

64

ba

9

First disconnect 3 and reconnect to 2

Page 101: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 101

Cdlh 2010

Merge and fold

1

8

2

5

3

a,b

a

ab

b

7a

64

ba 9

Then fold subtree

rooted in 3 into the DFA starting in 2

Page 102: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 102

Cdlh 2010

Merge and fold

1

8

2a,b

a

a

b

64

ba

9

Then fold subtree

rooted in 3 into the DFA starting in 2

Page 103: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 103

Cdlh 2010

RPNI is a state merging algorithmRPNI identifies any regular language in the limitRPNI works in polynomial timeRPNI admits polynomial characteristic sets

Page 104: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 104

Cdlh 2010

A=PTA(S+); Blue ={δ(qI

,a): a∈Σ

}; Red ={qI

}While

Blue≠∅

do

choose

q from

Blueif

∃p∈Red: L(merge_and_fold(A,p,q))∩S-=∅

then

A = merge_and_fold(A,p,q)else

Red = Red ∪

{q}

Blue = {δ(q,a): q∈Red} –

{Red}

Page 105: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 105

Cdlh 2010

S+ ={λ, aaa, aaba, ababa, bb, bbaaa}

a

a

aa

b

b

b

a

a

a

ba b

a

S- ={aa, ab, aaaa, ba}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Page 106: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 106

Cdlh 2010

Try to merge 2 and 1

a

a

aa

b

b

b

a

a

a

ba b

a

S- ={aa, ab, aaaa, ba}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Page 107: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 107

Cdlh 2010

First merge, then fold

a

aa

b

b

b

a

a

a

ba b

a

S- ={aa, ab, aaaa, ba}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

a

Page 108: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 108

Cdlh 2010

But now string aaaa is accepted, so the merge must be rejected, and state 2 is promoted

a

b

b a

a

a

ab

a

S- ={aa, ab, aaaa, ba}

1,2,4,7

3,5,86

9, 11

10

12

13

14

15

Page 109: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 109

Cdlh 2010

Try to merge 3 and 1

a

a

aa

b

b

b

a

a

a

ba b

a

S- ={aa, ab, aaaa, ba}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Page 110: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 110

Cdlh 2010

First merge, then fold

1

a

a

aa

b b

b

a

a

a

ba b

a

S- ={aa, ab, aaaa, ba}

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Page 111: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 111

Cdlh 2010

No counter example is accepted so the merge is kept

a

a

aa

bb

a ba

S- ={aa, ab, aaaa, ba}

1,3,6

2,10

4,13

5

7,15

8

9

11

12

14

b

Page 112: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 112

Cdlh 2010

Next possible merge to be checked is {4,13} with {1,3,6}

a

a

aa

bb

a ba

S- ={aa, ab, aaaa, ba}

1,3,6

2,10

4,13

5

7,15

8

9

11

12

14

b

Page 113: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 113

Cdlh 2010

Merged. Needs folding subtree in {4,13} with {1,3,6}

a

aa

bb

a ba

S- ={aa, ab, aaaa, ba}

1,3,6

2,10

4,13

5

7,15

8

9

11

12

14

b

a

Page 114: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 114

Cdlh 2010

ab

a ba

S- ={aa, ab, aaaa, ba}

1,3,4,6,8,13

2,7,10,11,15

5

9

12

14

b

a

But now aa is accepted

Page 115: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 115

Cdlh 2010

So we try {4,13} with {2,10}

a

a

aa

bb

a ba

S- ={aa, ab, aaaa, ba}

1,3,6

2,10

4,13

5

7,15

8

9

11

12

14

b

Page 116: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 116

Cdlh 2010

Negative string aa is again accepted.Since we have tried all Red for merging, state 4 is promoted.

a ba b

a

S- ={aa, ab, aaaa, ba}

1,3,62,4,7,10,13,15

5,8

9,11 12

14

b

a

Page 117: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 117

Cdlh 2010

So we try 5 with {1,3,6}

a

a

aa

bb

a ba

S- ={aa, ab, aaaa, ba}

1,3,6

2,10

4,13

5

7,15

8

9

11

12

14

b

Page 118: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 118

Cdlh 2010

But again we accept ab

aa

aa

b

b

S- ={aa, ab, aaaa, ba}

1,3,5,6,12

2,9,10,144,13

7,15

8

11

b

Page 119: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 119

Cdlh 2010

So we try 5 with {2,10}

a

a

aa

bb

a ba

S- ={aa, ab, aaaa, ba}

1,3,6

2,10

4,13

5

7,15

8

9

11

12

14

b

Page 120: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 120

Cdlh 2010

Which is OK. So next possible merge is {7,15} with {1,3,6}

a

a

a

ab

b

S- ={aa, ab, aaaa, ba}

1,3,6

2,5,10

4,9,13

7,15

8,12

11,14

b

Page 121: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 121

Cdlh 2010

Which is OK. Now try to merge {8,12} with {1,3,6,7,15}

aa

a

ab

a

S- ={aa, ab, aaaa, ba}

1,3,6,7,15

2,5,10

4,9,13

8,12

11,14

b

b

Page 122: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 122

Cdlh 2010

And ab is accepted

a

a

b

a

S- ={aa, ab, aaaa, ba}

1,3,6,7,8,12,15

2,5,10,11,14

4,9,13

b

b

Page 123: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 123

Cdlh 2010

Now try to merge {8,12} with {4,9,13}

aa

a

ab

a

S- ={aa, ab, aaaa, ba}

1,3,6,7,15

2,5,10

4,9,13

8,12

11,14

b

b

Page 124: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 124

Cdlh 2010

This is OK and no more merge is possible so the algorithm halts

aa

a

b

a

S- ={aa, ab, aaaa, ba}

1,3,6,7,11,14,15

2,5,10

4,8,9,12,13

b

b

Page 125: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 125

Cdlh 2010

A characteristic sample

A sample is characteristic (for RPNI) whenever, when included in the learning sample, the algorithm returns the correct DFAParticularity: the characteristic sample is of polynomial sizeThere is an algorithm which given a DFA builds a characteristic sample

Page 126: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 126

Cdlh 2010

About characteristic samplesIf you add more strings to a characteristic sample it still is characteristicThere can be many different characteristic samples (EDSM, tree version,…)Change the ordering (or the exploring function in RPNI) and the characteristic sample will change

Page 127: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 127

Cdlh 2010

ExercicesRun RPNI on

S+={a,bba,bab,aabb}S-={b,ab,baa,baabb}

Find a characteristic sample for:

0

1

2

a

a

b

a

1

b b3

Page 128: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 128

Cdlh 2010

Open problems

RPNI’s complexity is not a tight upper bound. Find the correct complexityThe definition of the characteristic sample is not tight either. Find a better definition

Page 129: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 129

Cdlh 2010

ConclusionRPNI identifies any regular language in the limitRPNI works in polynomial timeThere are many significant variants of RPNIParallel version can be efficientRPNI can be extended to other classes of grammars

Page 130: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 130

Cdlh 2010

6 Learning from textOnly positive examples are availableDanger of over-generalization: why not return Σ*?The problem is “basic”:

Negative examples might not be availableOr they might be heavily biased: near-misses, absurd examples…

Base line: all the rest is learning with help

Page 131: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 131

Cdlh 2010

Σ

PTA

?

GI as a search problem

Page 132: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 132

Cdlh 2010

The theory

Gold 67: No super-finite class can be identified from positive examples (or text) only

Necessary and sufficient conditions for learningLiterature:

inductive inference, ALT series, …

Page 133: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 133

Cdlh 2010

Limit point

A class L of languages has a limit point iffthere exists an infinite sequence Ln n∈ℕ of languages in L such that L0 ⊂ L1 ⊂ … Ln ⊂

…, and there exists another language L∈ Lsuch that L = ∪n∈ℕLn

L is called a limit point of L

Page 134: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 134

Cdlh 2010

L is a limit point

L0 L1L2L3

Li

L

Page 135: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 135

Cdlh 2010

Theorem

If L

admits a limit point, then L

is not learnable from text

Proof:Proof:

Let

si

be a presentation in length-lex order for

Li

, and s be a presentation in length-lex

order for

L. Then

∀n∈ℕ

∃i / ∀k≤n

sik

= sk

Note:

having a limit point is a sufficient condition for non learnability; not a necessary condition

Page 136: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 136

Cdlh 2010

Mincons classes

A class is mincons if there is an algorithm which, given a sample S, builds a G∈G such that S ⊆ L ⊆ L(G) ⇒L = L(G)

Page 137: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 137

Cdlh 2010

Existence of an accumulation point (Kapur 91)

A class L

of languages has an accumulation point iff there exists an infinite sequence Sn n∈ℕ

of sets such that S0 ⊆ S1 ⊆

… Sn

…, and L= ∪n∈ℕSn

L…and for any n∈ℕ

there exists a language Ln

in L such that Sn

Ln

’ ⊂

L. The language L is called an accumulation point

of L

Page 138: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 138

Cdlh 2010

L is an accumulation point

L

Ln ’

S0 S1S2S3

Sn

Page 139: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 139

Cdlh 2010

Theorem (for Mincons classes)

L

admits an accumulation point iff

L

is not learnable from text

Page 140: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 140

Cdlh 2010

Infinite Elasticity

If a class of languages has a limit point there exists an infinite ascending chain of languages L0 ⊂ L1 ⊂ … ⊂ Ln ⊂ ….This property is called infinite elasticity

Page 141: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 141

Cdlh 2010

Infinite Elasticity

x0 x1x2x3

xi Xi+1 Xi+2 Xi+3 Xi+4

Page 142: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 142

Cdlh 2010

Finite elasticity

L

has finite elasticity if it does not have infinite elasticity

Page 143: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 143

Cdlh 2010

Theorem (Wright)

If L(G) has finite elasticity and is mincons, then G

is learnable.

Page 144: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 144

Cdlh 2010

Tell tale sets

L(G)

L(G’)TG

x4

x3

x2

x1

Forbidden

Page 145: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 145

Cdlh 2010

Theorem (Angluin)

G

is learnable iff

there is a computable partial function ψ: G

×ℕ→Σ*

such that:

1)

∀n∈ℕ, ψ(G,n) is defined iff G∈G

and L(G)≠∅;

2)

∀G∈G, TG

={ψ(G,n): n∈ℕ} is a finite subset of L(G) called a tell-tale subset;

3)

∀G,G’∈G, if TG

L(G’) then L(G’)⊄

L(G).

Page 146: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 146

Cdlh 2010

Proposition (Kapur 91)

A language L in L

has a tell-tale subset iff L is not

an accumulation point.

(for mincons)

Page 147: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 147

Cdlh 2010

7 Learning by observingInference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition. García & Vidal et al. 1990

Page 148: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 148

Cdlh 2010

Definition

Let k≥0, a k-testable language in the strict sense (k-TSS) is a 5-tuple Zk

=(Σ, I, F, T, C) with:Σ a finite alphabetI, F ⊆ Σk-1 (allowed prefixes of length k-1 and suffixes of length k-1)T ⊆ Σk (allowed segments)C ⊆ Σ<k contains all strings of length less than kNote that I∩F=C∩Σk-1

Page 149: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 149

Cdlh 2010

The k-testable language is L(Zk

)=IΣ*

∩ Σ*F -

Σ*(Σk-T)Σ*∪CStrings (of length at least k) have to use a good prefix and a good suffix of length k-1, and all sub-strings have to belong to T. Strings of length less than k should be in COr: Σk-T defines the prohibited segmentsKey idea: use a window of size k

Page 150: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 150

Cdlh 2010

An example (2-testable)

I={a}

F={a}

T={aa, ab, ba}C={λ,a}

ab

λa

a

ba

Page 151: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 151

Cdlh 2010

Window language

By sliding a window of size 2 over a string we can parseababaaababababaaaab OKaaabbaaaababab not OK

Page 152: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 152

Cdlh 2010

The hierarchy of k-TSS languages

k-TSS(Σ)={L⊆Σ*: L is k-TSS}All finite languages are in k-TSS(Σ) if k is large enough!k-TSS(Σ) ⊂ [k+1]-TSS(Σ) (bak)* ∈ [k+1]-TSS(Σ) (bak)* ∉ k-TSS(Σ)

Page 153: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 153

Cdlh 2010

A language that is not k- testable

b

λa

a

b

a

a

Page 154: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 154

Cdlh 2010

K-TSS inference

Given a sample S, ak-TSS

(S)= L(Zk

) where Zk

=(Σ(S), I(S), F(S), T(S), C(S) ) andΣ(S) is the alphabet used in SC(S)=Σ(S)<k∩SI(S)=Σ(S)k-1∩Pref(S)F(S)= Σ(S)k-1∩Suff(S)T(S)=Σ(S)k ∩ {v: uvw∈S}

Page 155: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 155

Cdlh 2010

Example

S={a, aa, abba, abbbba}Let k=3Σ(S)={a, b} I(S)= {aa, ab}F(S)= {aa, ba}C(S)= {a , aa}T(S)={abb, bbb, bba}

Hence ak-TSS(S)= ab*a+a

Page 156: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 156

Cdlh 2010

Building the corresponding automaton

Each string in I∪C and PREF(I∪C) is a stateEach substring of length k-1 of strings in T is a stateλ is the initial stateAdd a transition labeled b from u to ub for each state ubAdd a transition labeled b from au to ub for each aub in TEach state/substring that is in F is a final stateEach state/substring that is in C is a final state

Page 157: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 157

Cdlh 2010

Running the algorithm

S={a, aa, abba, abbbba}

I={aa, ab}

F={aa, ba}

T={abb, bbb, bba}C={a, aa}

ab

babb

aaa

b

b

b

a

a

ab

babb

aa

Page 158: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 158

Cdlh 2010

Properties (1)

S ⊆ ak-TSS(S)

ak-TSS(S) is the smallest k-TSS language that contains S

If there is a smaller one, some prefix, suffix or substring has to be absent

Page 159: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 159

Cdlh 2010

Properties (2)

ak-TSS identifies any k-TSS language in the limit from polynomial data

Once all the prefixes, suffixes and substrings have been seen, the correct automaton is returned

If Y⊆S, ak-TSS(Y) ⊆ ak-TSS(S)

Page 160: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 160

Cdlh 2010

Properties (3)

ak+1-TSS(S) ⊆ ak-TSS(S)In Ik+1 (resp. Fk+1 and Tk+1) there are less allowed prefixes (resp. suffixes or substrings) than in Ik (resp. Fk and Tk)

∀k>maxx∈S⏐x⏐, ak-TSS(S)=SBecause for a large k, Tk(S)=∅

Page 161: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 161

Cdlh 2010

Extensions

These languages have been studied and adapted to:

Local languagesN-gramsTree languages

Page 162: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 162

Cdlh 2010

8 Learning actively

Learning regular sets from queries and counter-examples, D. Angluin, Information and computation, 75, 87-106, 1987Queries and Concept learning, D. Angluin, Machine Learning, 2, 319-342, 1988Negative results for Equivalence Queries, D. Angluin, Machine Learning, 5, 121-150, 1990

Page 163: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 163

Cdlh 2010

8.1 About learning with queries

Ideas:define a credible learning modelmake use of additional information that can be measuredexplain thus the difficulty of learning certain classes

Page 164: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 164

Cdlh 2010

The Oracle

knows the language and has to answer correctlyno probabilitiesworse case policy: the Oracle does not want to help

Page 165: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 165

Cdlh 2010

Some queries

membership queriesequivalence queries (weak)equivalence queries (strong)inclusion queries

Page 166: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 166

Cdlh 2010

Membership queries.

x x∈L

L is the target language

Page 167: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 167

Cdlh 2010

Equivalence (weak) queries.

h Yes if L≡

L(h)No if ∃x∈Σ*:x∈

L(h)⊕L

A⊕B is the symmetric difference

Page 168: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 168

Cdlh 2010

Equivalence (strong) queries.

h Yes if L≡h x∈Σ*: x∈

L(h)⊕L if not

Page 169: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 169

Cdlh 2010

Subset queries.

h Yes if L(h) ⊆

Lx∈Σ*: x∈

L(h) ∧

x∉L

if not

Page 170: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 170

Cdlh 2010

Correct learning

A class C

is identifiable with a polynomial number of queries of type T if there exists an algorithm a

that:

1)

∀L∈C

identifies L

with a polynomial number of queries of type T

2)

does each update in time polynomial in ⎪f⎪

and in

Σ⎪xi

⎪, {xi

} counter-examples seen so far

Page 171: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 171

Cdlh 2010

8.2 The Minimal Adequate Teacher

You are allowed:strong equivalence queriesmembership queries

Page 172: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 172

Cdlh 2010

General idea of L*

find a consistent table (representing a DFA)submit it as an equivalence queryuse counterexample to update the tablesubmit membership queries to make the table completeiterate

Page 173: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 173

Cdlh 2010

8.3 An observation table

λ

λ

a

a

abaab

1 0

00

010001

Page 174: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 174

Cdlh 2010

The states (S)

The transitions (T)

The experiments (E)

λ

λ

a

a

abaab

1 0

00

010001

Page 175: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 175

Cdlh 2010

Meaning

δ(qI , λ.λ)∈F⇔

λ ∈L

λ

λ

a

a

abaab

1 0

00

010001

Page 176: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 176

Cdlh 2010

δ(qI , ab.a)∉ F⇔

aba ∉ L

λ

λ

a

a

abaab

1 0

00

010001

Page 177: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 177

Cdlh 2010

Equivalent prefixes

These two rows are equal,

hence

δ(qI ,λ)= δ(qI ,ab)

λ

λ

a

a

abaab

1 0

00

010001

Page 178: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 178

Cdlh 2010

Building a DFA from a table

λ

λ

a

a

abaab

1 0

00

010001

λ

aa

Page 179: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 179

Cdlh 2010

λ

λ

a

a

abaab

1 0

00

010001

λ

a

a

b

a

b

Page 180: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 180

Cdlh 2010

λ

λ

a

a

abaab

1 0

00

010001

λ

a

a

b

a

b

Some rules

This set is prefix-closed

This set is suffix-closed

RedΣ\Red =Blue

Page 181: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 181

Cdlh 2010

An incomplete table

λ

λ

a

a

abaab

1 0

0

01001

λ

a

a

b

a

b

Page 182: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 182

Cdlh 2010

Good idea

We can complete the table by submitting membership queries...

u

v

?uv∈L ?

Membership query:

Page 183: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 183

Cdlh 2010

A table is

closed if any row of Blue corresponds to some row in Red

λ

λ

a

a

abaab

1 0

00

011001

Not closed

Page 184: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 184

Cdlh 2010

And a table that is not closed

λ

λ

a

a

abaab

1 0

00

011001

λ

a

a

b

a

b

?

Page 185: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 185

Cdlh 2010

What do we do when we have a table that is not closed?

Let s be the row (of Blue) that does not appear in RedAdd s to Red, and ∀a∈Σ sa to Blue

Page 186: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 186

Cdlh 2010

An inconsistent table

λ

λ a

abaa

1 0a

b00

00

0101

bbba 01

00

Are a and bequivalent?

Page 187: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 187

Cdlh 2010

A table is consistent ifEvery equivalent pair of rows in Red

remains equivalent in Red

Blue

after appending any symbol

row(s1

)=row(s2

) ⇒

∀a∈Σ, row(s1

a)=row(s2

a)

Page 188: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 188

Cdlh 2010

What do we do when we have an inconsistent table?

Let a∈Σ

be such that row(s1

)=row(s2

) but row(s1

a)≠row(s2

a)

If row(s1a)≠row(s2a), it is so for experiment eThen add experiment ae to the table

Page 189: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 189

Cdlh 2010

What do we do when we have a closed and consistent table ?

We build the corresponding DFA

We make an equivalence query!!!

Page 190: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 190

Cdlh 2010

What do we do if we get a counter-example?

Let u be this counter-example

∀w∈Pref(u) doadd w to Red∀a∈Σ, such that wa∉Red add wa to Blue

Page 191: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 191

Cdlh 2010

8.4 Run of the algorithm

λ

λ

a

b

1

1

1 Table is now closed

and consistentλ

b

a

Page 192: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 192

Cdlh 2010

An equivalence query is made!

λ

b

a

Counter example baa is returned

Page 193: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 193

Cdlh 2010

λ

λ

a

b1

1

0baaba

baaa

bbbab

baab

1

01

1

11

Not consistent

Because of

Page 194: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 194

Cdlh 2010

λ

λ

a

a

b1 1

1

0 0 baaba

baaa

bbbab

baab

1 1

0 1

1 0

1 1

Table is now closed and consistent

λ ba

baa

a

b

a

b b

a

0

0 0

1 1

Page 195: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 195

Cdlh 2010

Polynomial

|E| ≤ nat most n-1 equivalence queries|membership queries| ≤ n(n-1)m where m is the length of the longest counter-example returned by the oracle

Page 196: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 196

Cdlh 2010

Conclusion (1)With an MAT you can learn DFA

but also a variety of other classes of grammarsit is difficult to see how powerful is really an MATprobably as much as PAC learningEasy to find a class, a set of queries and provide and algorithm that learns with themmore difficult for it to be meaningful

Discussion: why are these queries meaningful?

Page 197: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 197

Cdlh 2010

Conclusion (2)

Active learning is an exciting topic, and good strategies for choosing the queries are still largely unexploredZulu competition can be a great opportunity to start research in this areahttp://cian.univ-st-etienne.fr/zulu/

Page 198: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 198

Cdlh 2010

9 Extensions (PFA, transducers, tree automata)

Theory, algorithms and applications have extended to:

TransducersProbabilistic finite automataContext free grammars (with special interest in linear grammars)String kernelsRegular expressionspatterns

Page 199: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 199

Cdlh 2010

Main results for learning PFA

There are now several DPFA learning algorithms

ALERGIA (Carrasco & Oncina

94)DSAI (Ron el al. 94)MDI (Thollard

et al. 99)

DEES (Denis et al. 05) [also PFA]

Page 200: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 200

Cdlh 2010

Main results for learning transducers

One basic algorithm : OSTIA (Oncina et al. 93)State merging algorithm, based on a normal form for subsequencial transducers

Page 201: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 201

Cdlh 2010

10 ConclusionsWhy should one pick up grammatical inference as a

research topic?Nice communityBroad fieldCan use ideas from algorithmics, formal language theory, combinatorics, statistics, machine learning, natural language processing, bio-informatics, pattern recognition…Theory and applications

Page 202: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 202

Cdlh 2010

Open problems

C. de la Higuera. A bibliographical study of grammatical inference. Pattern Recognition, 38:1332–1348, 2005C. de la Higuera. Ten open problems in grammatical inference. In proceedings of ICGI 2006, pages 32–44

Page 203: Colin de la Higuera University of Nantes - lis-lab.fr · Colin de la Higuera University of Nantes. Pascal Bootcamp, Marseille 2 C d l h 2 0 1 0 Acknowledgements z Laurent Miclet,

Pascal Bootcamp, Marseille 203

Cdlh 2010

Some addresses to start workinghttp://pages-perso.univ-nantes.fr/~cdlh/http://videolectures.net/colin_de_la_higuera/http://cian.univ-st-etienne.fr/zulu/

Grammatical Inference: Learning Automata and Grammars, Colin de la Higuera, Cambridge University Press