interconnecting lexicographic resources. in search for a model dan cristea “alexandru ioan cuza”...
TRANSCRIPT
Interconnecting lexicographic resources. In search for a model
Dan Cristea“Alexandru Ioan Cuza” University of Iași
Institute of Computer Science of the Romanian Academy
Topics
• Why would one want to connect linguistic resources?
• Parameterising the needs• Standardisation helps interconnecting• A bunch of notorious resources• How would this work?• Final remarks
COST-ENeL, Bled, 29-30 September 2014
Why would one want to connect linguistic resources?
• Use case 1: 100 Romanian dictionaries aligned– CLRE. Essential Romanian Lexicographical Corpus.
100 dictionaries aligned at entry and, partially, sense levels (2010 –2013, at Institute A.Philippide of the Romanian Academy, in Iași• dictionaries’ list at:
http://85.122.23.90/resurse/Lista-dictionarelor.doc• written in 3 types of alphabets: Cyrillic, transition and Latin• large diversity of formatting styles
Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014
CLREEssential Romanian Lexicographical Corpus
Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014
ISER
The CLRE project
Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014
Petri
The CLRE project
Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014
DN II
The CLRE project
Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014
Bălăşescu
The CLRE project
Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014
Lexicon Militar
The CLRE project
Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014
Dicționar de informatică
The CLRE project
Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014
• Scanning• OCR – Abby Fine Reader 9• Parsing entries => XML • Manual verification• Indexing and alignment
Processing in CLRE
Iași, 25-26 September 2013 COST-ENeL, Bled, 29-30 September 2014
CLRE – manual verification
Iași, 25-26 September 2013 COST-ENeL, Bled, 29-30 September 2014
Why would one want to connect linguistic resources?
• Use case 2: align WN with an explanatory dictionary– a WN synset:
pos (def, ex, w1s1 …wk
sk … wnsn)
– an explanatory dictionary entry: wk, pos, <wk
s1,def1,ex1>… <wksk,defk,exk>… <wk
sm,defm,exm>
COST-ENeL, Bled, 29-30 September 2014
Why would one want to connect linguistic resources?
• Use case 2: align WN with an explanatory dictionary– synsets of a word wk, pos:
(def1, ex1, …wks1 …)
… (defk, exk, …wk
sk …)
…
(defm, exm, …wksm …)
– the explanatory dictionary entry of the word wk, pos: wk, pos, <wk
s1,def1,ex1>… <wksk,defk,exk>… <wk
sm,defm,exm>COST-ENeL, Bled, 29-30 September 2014
Why would one want to connect linguistic resources?
• Use case 2: align WN with an explanatory dictionary– a WN synset:
pos (def, ex, w1s1 …wk
sk … wnsn)
– explanatory dictionary entries: w1, pos, … <w1
s1, def, ex>…
wk, pos, … <wksk, def, ex>…
wn, pos, … <wnsn, def, ex>…
COST-ENeL, Bled, 29-30 September 2014
Why would one want to connect linguistic resources?
• Use case 3: the TOT problem or the forgotten word
Mic
hael
Zoc
k
COST-ENeL, Bled, 29-30 September 2014
The first thought: standardisation
• Lexical Markup Framework (LMF)– What is it?
• a common model for creation and use of lexical resources
– With what goal?• to manage the exchange of data between and among
these resources• to enable the merging of a large number of individual
electronic resources to form extensive global electronic resources
COST-ENeL, Bled, 29-30 September 2014
Near-standard• Text Encoding Initiative (TEI)
– What is it?• an inventory of the features most often deployed for
computer-based text processing • recommendations about suitable ways of representing
these features
– With what goal?• to facilitate processing by computer programs• to facilitate the loss-free interchange of data amongst
individuals and research groups using different programs, computer systems, or application software
COST-ENeL, Bled, 29-30 September 2014
Standardisation• Text Encoding Initiative (TEI)
– Example of a dictionary entry serialisation (from TEI Guidelines)
<entry> <form> <orth>disproof</orth> <pron>dIs"pru:f</pron> </form> <gramGrp> <pos>n</pos> </gramGrp> <sense n="1"> <def>facts that disprove something.</def> </sense> <sense n="2"> <def>the act of disproving.</def> </sense> </entry>
disproof (dIs"pru:f) n. 1. facts that disprove something. 2. the act of disproving. CED
COST-ENeL, Bled, 29-30 September 2014
LMF and TEI content and… discontent
• The TEI format may be used as an interchange format, permitting sharing of resources even when their local encoding schemes differ.
• Both LMF and TEI model lexical material at a deep representational detail…
COST-ENeL, Bled, 29-30 September 2014
LMF and TEI content and… discontent
• TEI intention: – guidance for individual or local practice in text
creation and data capture– support of data interchange – support of application-independent local processing
• Opening good possibilities of querying• But how would function the interconnection?...
COST-ENeL, Bled, 29-30 September 2014
Parameterising the needs
• If I want to connect two resources, simply merge the contents
• Then be able to interrogate the merged resource by taking advantage of peculiarities in each resource
COST-ENeL, Bled, 29-30 September 2014
Parameterising the needs
• Able to represent variations in word forms, alternate orthography, diachronic morphology
• Easy navigation by applying various filtering criteria
COST-ENeL, Bled, 29-30 September 2014
Parameterising the needs
• Very often lexicographic data is hierarchical– for instance, a sense of a dictionary entry contains
a definition, examples, but also sub-senses• Organise even recursive searches
– give me the definition neighbouring sphere of depth 2 of the word captain (take all senses of the entry captain and form the list of words in the corresponding definitions, then for each of them take all their senses and collect again words in their definitions)
COST-ENeL, Bled, 29-30 September 2014
The idea
• Representing lexical information as feature structures centred on word’s lemmas
– disproof (dIs"pru:f) n. 1. facts that disprove something. 2. the act of disproving. CED
[lemma=disproof, entry=[pron=dIs"pru:f, pos=n, sense=[n=1, def=facts that disprove something], sense=[n=2, def=the act of disproving], res=CED]]
COST-ENeL, Bled, 29-30 September 2014
Representing lexical entries as feature structures
lemma=disproof
entry=
pron=dIs"pru:fpos=n
sense=
sense=
res=CED
n=1def=facts that disprove something
n=2def=the act of disproving
Cam
brid
ge E
nglis
h D
ictio
nary
COST-ENeL, Bled, 29-30 September 2014
Representing lexical entries as feature structures
• Graph representation
lemma
disproof
entry
pron
dIs"pru:f
npos
1sense
sense
n
facts that disproves smthdef2
n
the act of disprovingdef
res
CED
COST-ENeL, Bled, 29-30 September 2014
Representing lexical entries as feature structures
lemma=disproof
entry=
pron=dIs"pru:fpos=n
sense=
sense=
res=MWCD
n=2def=evidence that disproves
n=1def=the action of disproving
Mer
riam
-Web
ster
’s C
olle
giat
e D
ictio
nary
COST-ENeL, Bled, 29-30 September 2014
Representing lexical entries as feature structures
• Graph representation
lemma
disproof
entry
pron
dIs"pru:f
npos
1sense
sense
n
the action of disprovingdef2
n
evidence that disprovesdef
res
MWCD
COST-ENeL, Bled, 29-30 September 2014
How could lexical entries be merged?• Entries of the same word from different
dictionaries
lemma
disproof
entry
pron
dIs"pru:f
npos
1sense
sense
n
facts that…def2
n
the act of…def
res
CED
lemma
disproof
entry
pron
dIs"pru:f
npos
1sense
sense
n
the action of…def2
n
evidence that…def
res
MWCD
COST-ENeL, Bled, 29-30 September 2014
Merging lexical entries
• Distinct parts
lemma
disproof
entry
pron
dIs"pru:f
npos
1sense
sense
n
facts that…def2
n
the act of…def
res
CED
lemma
disproof
entry
pron
dIs"pru:f
npos
1sense
sense
n
the action of…def2
n
evidence that…def
res
MWCD
COST-ENeL, Bled, 29-30 September 2014
Merging lexical entries
lemma
disproof
entry
pron
dIs"pru:f
npos
1sense
sense
n
facts that…def2
n
the act of…def
res
CED
1sense
sense
n
the action of…def2
n
evidence that…def
res
MWCD
X (new)
X (new)
COST-ENeL, Bled, 29-30 September 2014
Representation of the merged feature structure
lemma=disproof
entry=
pron=dIs"pru:fpos=n
X=
X=
n=1def=facts that disprove something
n=2def=the act of disproving
sense=
sense=
res=CED
n=1def=the action of disproving
n=2def=evidence that disproves
sense=
sense=
res=MWCD
COST-ENeL, Bled, 29-30 September 2014
The WordNet search for disproof
COST-ENeL, Bled, 29-30 September 2014
Feature structures representation for the WN synsets of disproof
lemma
disproof
synsets
pos
n(*, falsification, refutation)
synset
(falsification, falsifying, *, refutation, refutal)synsetlex
lexgloss any evidence that helps to establish
the falsity of something
gloss (the act of determining that something is false
COST-ENeL, Bled, 29-30 September 2014
The WordNet search for discount
COST-ENeL, Bled, 29-30 September 2014
Representing WordNet synsets
lemma
discount
synsets
pos
n(*, price reduction, deduction)
synset(discount rate, *, bank discount)
synset
synset
lex
lex
synset
gloss the act of reducing the selling price of merchandise
gloss interest on an annual basis deducted in advance on a loan
synsetspos
v
synset
synset
(dismiss, disregard, brush aside, brush off, *, push aside, ignore)lex
gloss bar from attention or consideration
……
…
ex“She dismissed his advances"
COST-ENeL, Bled, 29-30 September 2014
How could dictionary entries be merged with
WN synsets?
lemma
disproof
synsets
pos
n(*, falsification, refutation)
synset
(falsification, falsifying, *, refutation, refutal)synsetlex
lexgloss any evidence that helps to establish
the falsity of something
gloss (the act of determining that something is false
lemma
disproof
entry
pron
dIs"pru:f
npos
1sense
sense
n
facts that disproves smthdef2
n
the act of disprovingdef
res
CED
COST-ENeL, Bled, 29-30 September 2014
Merging a dictionary entry with a WN entry
lemma
disproof
entry
pron
dIs"pru:f
npos
1sense
sense
n
facts that disproves smthdef2
n
the act of disprovingdef
res
CED
lemma
disproof
synsets
pos
n(*, falsification, refutation)
synset
(falsification, falsifying, *, refutation, refutal)synsetlex
lexgloss any evidence that helps to establish
the falsity of something
gloss (the act of determining that something is false
COST-ENeL, Bled, 29-30 September 2014
synsets
pos
n(*, falsification, refutation)
synset
(falsification, falsifying, *, refutation, refutal)synsetlex
lexgloss any evidence that helps to establish
the falsity of something
gloss (the act of determining that something is false
lemma
disproofentry
pron
dIs"pru:f
npos
1sense
sense
n
facts that disproves smthdef2
n
the act of disprovingdef
res
CED
Merging a dictionary entry with a WN entry
COST-ENeL, Bled, 29-30 September 2014
Going one step further
• Feature structures are hierarchical data • Codd: hierarchical data can be represented as
relational tables – Codd, E.F. (June 1970). "A Relational Model of Data
for Large Shared Data Banks". Communications of the ACM 13 (6): 377–387.
COST-ENeL, Bled, 29-30 September 2014
Representing feature structures as relational tables
lemma
w
entry
pron
…
npos 1
sense
sense
n …def
res
…
cit …orth
auth…
……
…
from http://en.wikipedia.org/wiki/Relational_database
yr
COST-ENeL, Bled, 29-30 September 2014
Representing feature structures as relational tables
lemma
w
entry
pron
…
npos 1
sense
sense
n …def
res
…
W O R D
cit …orth
auth…
……
…
id lemma entry
yr
from http://en.wikipedia.org/wiki/Relational_database
COST-ENeL, Bled, 29-30 September 2014
Representing feature structures as relational tables
lemma
w
entry
pron
…
npos 1
sense
sense
n …def
res
…
W O R D
E N T RY entry pron pos sense res
cit …orth
auth…
……
…
id lemma entry
yr
from http://en.wikipedia.org/wiki/Relational_database
COST-ENeL, Bled, 29-30 September 2014
Representing feature structures as relational tables
lemma
w
entry
pron
…
npos 1
sense
sense
n …def
res
…
W O R D
E N T RY entry pron pos sense res
cit …orth
auth…
……
…
S E N S E sense n def cit
id lemma entry
yr
from http://en.wikipedia.org/wiki/Relational_database
COST-ENeL, Bled, 29-30 September 2014
Representing feature structures as relational tables
lemma
w
entry
pron
…
npos 1
sense
sense
n …def
res
…
W O R D
E N T RY entry pron pos sense res
cit …orth
auth…
……
…
C I T
S E N S E sense n def cit
cit orth auth yr
id lemma entry
yr
from http://en.wikipedia.org/wiki/Relational_database
COST-ENeL, Bled, 29-30 September 2014
Relational operators
• Projection: πa1,…an(R) => a relation containing only values of attributes a1,… an from the relation R
• Selection: σφ(R), with ϕ is logical condition => only tuples verifying the condition ϕ are retained from the relation (or the set) R
• Join: R✜S => the set of all attributes in R and S that are equal on their common attributes
• Union: RŮS => a table representing the union of the two relations
COST-ENeL, Bled, 29-30 September 2014
Interrogating a dictionary
• Citations before 1850 of the entry “symphony”.
lemma
w
entry
pron
…
npos 1
sense
sense
n …def
res
cit …orth
yrauth
…
……
…
WORD
ENTRY
SENSE
CIT
πorth(σlemma=“symphony” & yr<1850 (WORD✜ENTRY✜SENSE✜CIT))
COST-ENeL, Bled, 29-30 September 2014
Interrogating a combination between a dictionary and a wordnet
• All synonyms of nouns belonging to citations dated before 1850, sorted lexicographically.
lemma
w
entry
pron
…
npos 1
sense
sense
n …def
res
cit …orth
yrauth
…
……
…
WORD
ENTRY
SENSE
CIT
COST-ENeL, Bled, 29-30 September 2014
Interrogating a combination between a dictionary and a wordnet
• All synonyms of nouns belonging to citations dated before 1850, sorted lexicographically.
lemma
w
entry
pron
…
npos 1
sense
sense
n …def
res
cit …orth
yrauth
…
……
…
WORD
ENTRY
SENSE
CITThe citations of the title word w:πorth(σlemma=w & yr<1850 (WORD✜ENTRY✜SENSE✜CIT))
COST-ENeL, Bled, 29-30 September 2014
Interrogating a combination between a dictionary and a wordnet
• All synonyms of nouns belonging to citations dated before 1850, sorted lexicographically.
lemma
w
entry
pron
…
npos 1
sense
sense
n …def
res
cit …orth
yrauth
…
……
…
WORD
ENTRY
SENSE
CITUnify and lemmatise words belonging to the citations:lem(U(πorth(σlemma=w & yr<1850 (WORD✜ENTRY✜SENSE✜CIT))))
COST-ENeL, Bled, 29-30 September 2014
Interrogating a combination between a dictionary and a wordnet
• All synonyms of nouns belonging to citations dated before 1850, sorted lexicographically.
πlex(σpos=n & lemma lem(U(πorth(σlemma=w & yr<1850 (WORD✜ENTRY✜SENSE✜CIT)))
(WORD✜SYNS✜SYN))
lemma
w
entry
pron
…
npos 1
sense
sense
n …def
res
cit …orth
yrauth
……
……
WORDENTRY
SENSE
CIT
synsets posn
synset
lex
gloss…
…
SYNS
SYN
COST-ENeL, Bled, 29-30 September 2014
Conclusions
• I did not propose any model, I simply made some observations (nothing is really new)
• Linking lexicographic resources: – one resource => TEI representation => as feature
structures => hierarchical graphs => relational tables– more resources => unifications of tables– use query and relational operators for interrogation
COST-ENeL, Bled, 29-30 September 2014
Discussion
• Only a sketch – a lot of details should still be filled in– the good news: XML structures (the native
language of TEI) accept direct representations as database records: XSLT => opening direct access to a complex querying language: XQuery => mimicking the relational operators and adding more facilities
COST-ENeL, Bled, 29-30 September 2014
Discussion
• Another good news– representing variable depth structures– recursive hierarchies: Kamfonas
• Fixed depth dimensions are simpler to implement, maintain and query… Hierarchies that have variable depth or an uncertain number of levels… can often benefit if implemented as recursive hierarchies. http://www.kamfonas.com/id3.html
COST-ENeL, Bled, 29-30 September 2014
Discussion
• Even more good news (hopes)– interrogations can be formulated in natural
language => an interpreter translates them in the query language of a DBMS system
– as such, a handy tool at the benefit of lexicographers
COST-ENeL, Bled, 29-30 September 2014
Acknowledgements
• Work partially supported by the project The Computational Representative Corpus of Contemporary Romanian Language, a project of the Romanian Academy and partially by the COST-ENeL project
• I thank Isabelle Tamba and Mădălin Pătrașcu for the slides describing the CLRE project
COST-ENeL, Bled, 29-30 September 2014
Thank you!
COST-ENeL, Bled, 29-30 September 2014