asian language resources summit, phuket, march, 2009 kyoto (ict-211423) yielding ontologies for...
TRANSCRIPT
Asian Language Resources Summit, Phuket, March, 2009
KYOTO (ICT-211423)Yielding Ontologies for Transition-Based OrganizationFP7: Intelligent Content and Semantics
http://www.kyoto-project.eu/
Piek Vossen, VU University Amsterdam
Asian Language Resources Summit, Phuket, March, 2009
2
Overview
• Background information
• Baseline for retrieval in environment domain
• System architecture
• Knowledge mining
• Conclusions
Asian Language Resources Summit, Phuket, March, 2009
3
KYOTO (ICT-211423) Overview • Title: Knowledge Yielding Ontologies for Transition-Based Organization
• Funded: – 7th Framework Program-ICT of the European Union: Intelligent Content and Semantics
– Taiwan and Japan funded by national grants • Goal:
– Open and free platform for knowledge sharing across languages and cultures– Wiki environment that allows people in the field to maintain their knowledge and agree on
meaning without knowledge engineering skills– Bootstrap through open text mining & concept learning– Enables knowledge transition and information search across different target groups,
transgressing linguistic, cultural and geographic boundaries.– Enables deep semantic search for facts and knowledge
• URL: http://www.kyoto-project.eu/ (http://www.kyoto-project.eu/)• Duration:
– March 2008 – March 2011• Effort:
– 364 person months of work.
Asian Language Resources Summit, Phuket, March, 2009
4
Consortium
1. Vrije Universiteit Amsterdam (Amsterdam, The Netherlands), 2. Consiglio Nazionale delle Ricerche (Pisa, Italy), 3. Berlin-Brandenburg Academy of Sciences and Humantities (Berlin,
Germany), 4. Euskal Herriko Unibertsitatea (San Sebastian, Spain), 5. Academia Sinica (Tapei, Taiwan), 6. National Institute of Information and Communications Technology
(Kyoto, Japan), 7. Irion Technologies (Delft, The Netherlands), 8. Synthema (Rome, Italy), 9. European Centre for Nature Conservation (Tilburg, The Netherlands), • Subcontractors:
– World Wide Fund for Nature (Zeist, The Netherlands), – Masaryk University (Brno, Czech)
Asian Language Resources Summit, Phuket, March, 2009
5
KYOTO (ICT-211423) Overview
• Languages: – English, Dutch, Italian, Spanish, Basque, Chinese, Japanese
• Domain:– Environmental domain, BUT usable in any domain
• Global: – Both European and non-European languages
• Available: – Free: as open source system and data (GPL)
• Future perspective: – Content standardization that supports world wide communication
State of the artin the environment domain
Asian Language Resources Summit, Phuket, March, 2009
7
Baseline for environment domain• Mainly use Google, first 10 hits, no advanced options• Textual search with linguistic enhancements but no real semantic
search:– polluted water….
– polluting water….
• Growing time & information pressure:– deliver actual information from diverse & dynamic sources
– regional, local situations ►no general source
– various subdomains ► government, legal, biology, health, industry
– difficult access ► scientific publications
– no time to read ► too much information and work pressure
– dependent on trust: scientists ► environmentalist ►government ►general public
Asian Language Resources Summit, Phuket, March, 2009
8
High-level targets &Low-level questions
• High level target (about 300 questions collected)– Are there huge negative effects with regard to ecological
networks and alien invasive species?
• Low level facts that support answering the high level targets:– cases of alien invasion– amount of species– causal relations associated with these (increments of)
invasions– causes related to ecological networks– limit in the same time and location boundary
Asian Language Resources Summit, Phuket, March, 2009
9
Asian Language Resources Summit, Phuket, March, 2009
10
Baseline retrieval results 6 persons, 30 high-level questions,
Result Rank
CONFIRMED
DISAPPROVED
UNDECIDED
Total
0 13 20.31% 27 20.30% 10 15.87% 50 19.23%
1 6 9.38% 9 6.77% 9 14.29% 24 9.23%
2 8 12.50% 13 9.77% 7 11.11% 28 10.77%
3 5 7.81% 6 4.51% 3 4.76% 14 5.38%
4 8 12.50% 6 4.51% 2 3.17% 16 6.15%
5 2 3.13% 7 5.26% 3 4.76% 12 4.62%
6 2 3.13% 6 4.51% 4 6.35% 12 4.62%
7 2 3.13% 2 1.50% 1 1.59% 5 1.92%
8 4 6.25% 3 2.26% 1 1.59% 8 3.08%
9 1 1.56% 5 3.76% 0 0.00% 6 2.31%
-1 13 20.31% 49 36.84% 23 36.51% 85 32.69%
Total 64 24.62% 133 51.15% 63 24.23% 260
Asian Language Resources Summit, Phuket, March, 2009
11
KYOTO's Solution• Text mining:
– Massive and accurate indexing of facts from vast amounts of text;– In any language/culture from scattered sources;– Again and again to detect trends and changes;– Direct relation between knowledge modeling effort and text mining
• Knowledge modeling:– automatic learning of terms and concepts from text in any language;– formalization of knowledge in computer usable format -> wordnets &
ontologies• Community software:
– For experts in the field and not knowledge engineers– Continuous and collaborative effort:
• adapt to the changing domain;• consensus in the field;• consensus across languages and cultures
– Produce interoperable, formal, standardized knowledge structures;– Relate knowledge structure to expressions in languages
Top
Middle
H20 CO2
Substance
Abstract
Process
Physical
Ontology
Environmental organizations
Tybot: term yielding robot
Kybot: knowledge yielding robot
Wordnets
Distributed, diverse & dynamic data
1
Capture text:"Sudden increase of CO2 emissions in 2008 in Europe"
2
CO2 emission3
Wikyoto
maintainterms & concepts
4
Index facts:Process: Emission Involves: CO2Property: increase, suddenWhen: 2008 Where: Europe
5Text & Fact Index
SemanticSearch
6
Citizens
Governments
Companies
DomainCO2
EmissionH20
PollutionGreenhouse
Gas
System architecture
Original Document
Base
Keyword Search
Semantic & Syntactic Base
Kyoto Annotation
Format (KAF)
Linguistic Processor
End User
Semantic Search
End User
1
2
3
Data Flow Diagram of Kyoto System
Fact Base
Fact Extractor
Fact User
Kybot
Term BaseTerm
Extractor
Tybot
Multilingual Knowledge
Base
Wiki Term Editor
Concept User
Wikyoto
WordnetsOntologiesinterlinked
Asian Language Resources Summit, Phuket, March, 2009
15
Kyoto Annotation Format KAF
• Kyoto Annotation Format (Level 1)a multi-layered annotation format for:– Tokenizaton and word form segmentation– POS tagging – Lemmatization and Term extraction – Constituency Tagging– Dependency Tagging
ENG-3.0-107695012-N
Asian Language Resources Summit, Phuket, March, 2009
16
Semantic Annotation• Semantic Annotation Format for:
– Named Entity Recognition (time, events, quant. …)
– Word Sense Disambiguation (D-WSD)– Semantic Role Labeling (SRL)
no synsets
KAF level2 (SemKAF)ENG-3.0-107630294-N
Asian Language Resources Summit, Phuket, March, 2009
17
<term tid="t4" type="open" lemma="population" pos="N"> <span> <target id="w4"/> </span> <senseAlt>
<sense sensecode="EN-17-00861095-n" /><sense sensecode="EN-17-00859568-n" />.......
<term tid="t4" type="open" lemma="population" pos="N"> <span> <target id="w4"/> </span> <senseAlt>
<sense sensecode="EN-17-00859568-n" confidence="0.80 "/><sense sensecode="EN-17-00257849-n" confidence="0.13 /><sense sensecode="EN-17-00962397-n" confidence="0.07 />
</senseAlt> </term>
KAF annotation: WSD
Asian Language Resources Summit, Phuket, March, 2009
18
Data formats
Level of annotation:1. Morpho-syntax annotation2. Semantic annotation
3. Terms representation
4. Facts annotation
5. Wordnets6. Ontologies
Standard format
}KAF <=(MAF, SYNAF, SEMAF)
TMF
KAF
Wordnet-LMF OWL
Knowledge mining
Asian Language Resources Summit, Phuket, March, 2009
20
Knowledge mining
• Concept mining (Tybots):– Extract terms and relations in a language– Map the terms to an existing wordnet– Ontologize terms to concepts and axioms
• Fact mining (Kybots)– Define logical patterns– Define expression rules in a language
Asian Language Resources Summit, Phuket, March, 2009
21
What Tybots do...
• Input are text documents• Linguistic processors generate KAF annotation
(sequential):– morpho-syntactic analysis– semantic roles– named entities– wordnet and ontology mappings
• Output are term hierarchies in TMF (generic):– structural parent relations– quantified structural and semantic relations– statistical data
Asian Language Resources Summit, Phuket, March, 2009
22
SourceDocuments
LinguisticProcessors
[[the emission]NP [of greenhouse gases]PP [in agricultural areas]PP] NP
Morpho-syntactic analysis
TYBOT ConceptMiners
Abstract Physical
H20 CO2
Substance
CO2Emission
WaterPollution
Ontology
Process
Chemical Reaction
GlobalWarming
GreenhouseGas
Ontologize
Axiomatize
(instance s1 Substance) (instance e1 Warming) (katalyist s1 e1)
Synthesize
in
of
Term hierarchy
emission gas
greenhouse gas
area
agricultural area
CO2
naturalprocess:1
English Wordnet
emission:2gas:1
area:1
greenhouse gas:1
rural area:1
geographical area:1
region:3
location:3 substance:1
emission:3
farmland:2
CO2
Conceptual modeling
Asian Language Resources Summit, Phuket, March, 2009
23
What Kybots do
• Input:– KAF annotations of text: sequential & encoded by
language– Conceptual frame from the ontology– Expression rules for frame to language mapping:
• Wordnet in a language• Morpho-syntactic mappings rules
• Output are a database of facts in FactAF (generic):– aggregated facts– inferred facts– language neutral
Asian Language Resources Summit, Phuket, March, 2009
24
Fact mining• KYBOT = Knowledge Yielding Robot• Logical expression
– (instance, e1, Burn) (instance, e2, Warming) (cause, e1, e2) – (instance, s1, CO2) (instance, e1, GlobalWarming) (katalyist, s1,e1)
• Expression rules per language: – [N[s1]V[e1]]S e.g. "CO2 is emitted", "fine dust blocks sun-light"– [N[s1]N[e1]N e.g. "CO2 emission", "sun-light blocking"– [[N[e1]][prep][N[s2]]NP e.g. "emission of CO2", "sun light blocking by fine dust"
• Ontology * Wordnets– Capabilities: WNT -> adjectives ("explosive", "toxic"), WNT -> nouns
("explosive", "poison")– Causes: WNT -> verbs ("eat") , WNT -> nouns ("consumption")– Process: DamageProcess, ProduceProcess
• Kybot compiler– kybots = logical pattern+ ontology + WN[Lx] + ER[Lx]
Asian Language Resources Summit, Phuket, March, 2009
25
Fact mining by Kybots
SourceDocuments
LinguisticProcessors
[[the emission]NP [of greenhouse gases]PP [in agricultural areas]PP] NP
Morpho-syntactic analysis (KAF)
Abstract Physical
H2O CO2
Substance
CO2 emission
water pollution
Ontology Wordnets &Linguistic Expressions
Process
Chemical Reaction
Generic
Logical Expressions
[[the emission]NP ] Process: e1 [of greenhouse gases]PP Patient: s2 [in agricultural areas]PP] Location: a3
Fact analysis
Patient
PatientDomain
• semantic role labelling• time & place• aggregation from all relevant phrases and documents
• inferencing• adding trust and reliability
Wikyoto
Asian Language Resources Summit, Phuket, March, 2009
27
Do populations always consist of marine species?
A.....
decline...
population.....Z
Are terrestrial species never
marine species?
Simplified Term Fragment
population
marinespecies
terrestrialspecies
Simplified Ontology Fragment
?Population
Group
KyotoServer
Hidden
Shown
.... populations declined
.....terrestrial andmarine species..
in forests.....declined
Do populations consist of
marine species?
InterviewAre terrestrial
species a type of
populations?
Interview
.... populations such as
terrestrial and marine species .....
Smart Kytext
KAF DE-TNTybotspdf
FactAFKAF
Kybots
plugin plugin
DE-KONDE-WN
Facts in RDF
G-WN
Wordnets in LMFOntologies in OWL-DL
G-KON
WIKIPEDIA
SUMO DOLCE
GEO
FRAMENET
Kyoto Knowledge Base
WnIT
Domain
WnEN
Domain
WnEU
Domain
WnNL
DomainWnJP
Domain
WnCH
Domain
WnES
DomainOntologyOntologyOntology
Domain Ontology
Potential impact
Asian Language Resources Summit, Phuket, March, 2009
30
Ultimate goal
• Global standardization and anchoring of meaning such that:– Machines can start to approach text understanding -> semantic
web connects to the current web– Communities can dynamically maintain knowledge, concepts
and their terms in an easy to use system– Cross-linguistic and cross-cultural sharing and communication
of knowledge is enabled
• Establish a Global-Wordnet-Grid: formalization of Wikipedia for humans AND machines across languages
Asian Language Resources Summit, Phuket, March, 2009
31
Inter-LingualOntology
Device
Object
TransportDeviceEnglish Words
vehicle
car train
1
2
3 3
Czech Words
dopravní prostředník
auto vlak
2
1French Words
véhicule
voiture train
2
1
Estonian Words
liiklusvahend
auto killavoor
2
1
German Words
Fahrzeug
Auto Zug
2
1
Spanish Words
vehículo
auto tren
2
1
Italian Words
veicolo
auto treno
2
1
Dutch Words
voertuig
auto trein
2
1
Global WordNet Grid
Asian Language Resources Summit, Phuket, March, 2009
32
Linking Open Data dataset cloud
http://richard.cyganiak.de/2007/10/lod/
Wordnetsailingterms
Ontologyenvironment
concepts
environmentfacts
Ontologymedical
concepts
Wordnetlegalterms
Wordnetmedicalterms
medicalfacts
legalfacts
Ontologylegal
concepts
Ontologysailing
concepts
Wordnetenvironment
terms
Wordnetenvironment
terms
Wordnetenvironment
terms
Wordnetenvironment
terms
Wordnetenvironment
terms
Conclusions
Asian Language Resources Summit, Phuket, March, 2009
34
Kyoto main assets
• Wiki platform (WIKYOTO) for connecting, transferring and controlling knowledge and information across people and computers
• Term yielding robots (TYBOT): software that extracts terms and concepts from documents
• Knowledge yielding robots (KYBOT): fact extraction software that generates a comprehensive list of facts from collection of sources
• Fact repositories & fact alert: reports changes in facts on a collection of sources
• Domain WORDNETS and domain ONTOLOGIES• Create the backbone for the Global Wordnet Grid
Asian Language Resources Summit, Phuket, March, 2009
35
What makes KYOTO unique?
• Integrates & combines all ► knowledge engineering, language engineering, wikis, term & concept learning, fact mining from text in and across languages, & standardization
• Direct relation between concept modeling and text mining ► make it worth the effort
• Wikyoto community tool ► hides technology and complex knowledge and language representation
• Operated by community people and not by knowledge engineers and language technology people ► exploits massive labor force of communities all over the world
Asian Language Resources Summit, Phuket, March, 2009
36
• Text mining and ontology learning developed for separate languages – ►KYOTO multi and cross-lingual & cultural– ► cross-lingual and cross-cultural semantic interoperability
• Text mining and ontology learning is often limited to a specific domain and/or application ►KYOTO for any domain and application
• Text mining and ontology learning does not relate the terms and concepts to generic language and knowledge resources ►KYOTO anchors knowledge from a community to general vocabulary and likewise to other communities
What makes KYOTO unique?
Free, open source license (GPL)Thank you for your attention
Asian Language Resources Summit, Phuket, March, 2009
38
Contribution of KYOTO
html
•hundreds of thousands sources in the environment domain•in many different languages•spread all over the world•changing every day
xls
• KYOTO learns terms and concepts from text documents, • Stored as structures that people and computers understand
Wordnetenvironment
terms
Ontologyenvironment
concepts
Wordnetenvironment
terms
Wordnetenvironment
termsWordnet
environmentterms
• KYOTO delivers a Web 2.0 environment for community based control• Connects people across language and cultures• Establish consensus and knowledge transition
• KYOTO enables semantic search and fact extraction• Software can partially understand language and exploit web 1 data• Understanding is helped by the terms and concepts defined for each language
environmentfacts
TYBOT
KYBOT
WIKYOTO