henning agt talk-caise-semnet
TRANSCRIPT
28.06.2013 DIMA – TU Berlin 1
Fachgebiet Datenbanksysteme und InformationsmanagementTechnische Universität Berlin
http://www.dima.tu-berlin.de/
Automated Construction of a Large Semantic Network of Related Terms for Domain-Specific Modeling
CAiSE 2013, June 21st, Valencia
Henning Agt and Ralf-Detlef Kutsche
Technische Universität Berlin
28.06.2013 DIMA – TU Berlin 2
■ Autocompletion applications
■ Predict what the user wants to model next
Motivation
nursetreatmentmedicineemergency...
28.06.2013 DIMA – TU Berlin 3
■ Our Vision: Provide automated suggestions of semantically relatedmodel elements for domain modeling [5],[19]□ Focus on domain terminology and conceptual design□ Query domain and common sense ontologies□ Information extraction from text
■ Requirements for the intended application□ Dictionary of terms□ Relations between terms□ Query interface and ranking functions
Research Goals
nursetreatmentmedicineemergency...
OntoOntoOnto‐logies
Extract
ModelingTools
KnowledgeService
Query
TextAnalysis
OntoOntoTermi‐nology
Retrieve/Integrate
Generate
Provide
Suggestions
Use
28.06.2013 DIMA – TU Berlin 4
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐GramStatistics
Text Corpus
N‐Gram DB
POSDB
Norm.N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
AnalyseCo‐occurrences
Applications
Retrieve
Query
28.06.2013 DIMA – TU Berlin 5
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐GramStatistics
Text Corpus
N‐Gram DB
POSDB
Norm.N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
AnalyseCo‐occurrences
Applications
Retrieve
Query
28.06.2013 DIMA – TU Berlin 6
■ Large amounts of text data
■ N-Grams□ Sequence of n consecutive words/tokens and its frequency□ Google provides 1,2,3,4 and 5-grams in several languages
■ We work on the English-All dataset V2 (1-grams and 5-grams) [11]
Google Books N-Gram Dataset
5 million books
Corpus
500 billion words N‐gram analysis
N‐GramDataset
CSV text fileswith word frequencies
...
…to go to the hospital 46,410general condition of the patient 28,198I was in the hospital 19,268discharge from the hospital . 12,476admission to the hospital . 10,558the patient to the hospital 6,422by placing the patient in 6,026between doctor and patient . 5,908... ...
…able to leave the hospital 4,629patient admitted to the hospital 4,303a patient in the hospital 3,844the symptom of the patient 2,559the patient under local anesthesia 2,536a patient is suffering from 2,475the doctor and the hospital 1,362the hospital and the doctor 1,017...
28.06.2013 DIMA – TU Berlin 7
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐GramStatistics
Text Corpus
N‐Gram DB
POSDB
Norm.N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
AnalyseCo‐occurrences
Applications
Retrieve
Query
28.06.2013 DIMA – TU Berlin 8
■ N-gram database Make the data manageable□ Input: 2.5 terabytes of text□ Output: Tables with
10 million 1-grams and710 million 5-grams (21 gigabytes)
■ Part-of-speech tagging [8], [9] Identify lexical category of each text token□ Output: Table with POS tags for each
5-gram (14 gigabytes)
■ Normalization Reduce amount of word variations□ Plural stemming, lowercasing of
adjectives and normal nouns□ Proper nouns are not touched
■ Result: 710 million normalized and tagged 5-grams
Preprocessing
JJ NN IN DT NNgeneral condition of the patientNN NN NN CC NNdrug store pharmacist or doctor
doctors doctorMedical practitioner medical practitionerhospitals in Valencia hospital in Valencia
AdjectiveNormalNoun DeterminerPreposition
CoordinatingCoordinatingconjunction
28.06.2013 DIMA – TU Berlin 9
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐GramStatistics
Text Corpus
N‐Gram DB
POSDB
Norm.N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
AnalyseCo‐occurrences
Applications
Retrieve
Query
28.06.2013 DIMA – TU Berlin 10
■ Goal: Detect domain terminology using syntactical patterns [12]
■ Analysis of existing dictionaries□ 75% of terms: noun, noun-noun, adjective noun combinations
■ Excerpt of the 20 patterns used:
■ No proper nouns: Stanford University / university professor□ Our focus is conceptual design on schema level
■ Limitation: 5-gram: 5 words□ Maximum length of a term: 3 words
Lexical Patterns
doctor or mental health professional
term termseparation
28.06.2013 DIMA – TU Berlin 11
■ Hierarchical pattern matching
■ Distributional Semantics [13], [22]□ “Words that occur in the same contexts
tend to have similar meanings.”(Distributional Hypothesis by Z. Harris)
Co-Occurring Terms
your doctor or pharmacist . 9271
ContextfrequencyAbsolute frequency
„doctor“ and „pharmacist“co‐occurred 9271 times
Highest level remains
No idiomatic phrasesNo consecutive patterns
Easiest case
28.06.2013 DIMA – TU Berlin 12
■ Discard 5-grams that contain 4 or 5 stopwords
■ Apply pattern matching on the remaining 5-grams Result: Large table of binary relations
■ Frequency aggregation□ Many terms co-occurred in different contexts
■ Relative frequency computation□ For each term with respect to its related terms
■ Graph construction□ Directed, weighted edges□ Relational database and graph
database serialization (SQLite / Neo4J)
SemNet Construction
to go to the doctor I am what I am a ) ( 2 )
28.06.2013 DIMA – TU Berlin 13
■ Properties of SemNet□ 268,937 distinct single-word terms□ 2,115,494 distinct double-word terms□ 355,689 distinct triple-word terms□ 2.7 million terms and 37.5 million relations□ 2.2 GB disc space
■ Lessons learned from the analysis process
Statistics
41,6%15,7%
32,6%
10,1%
4 or 5stopwords
N-Gram Information Content
Only1 term
No patternmatch
N-gramswith asemanticrelationship
Semantic relatedness: Zipf‘s law
Rank
Deg
ree
ofre
late
dnes
s
28.06.2013 DIMA – TU Berlin 14
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐GramStatistics
Text Corpus
N‐Gram DB
POSDB
Norm.N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
AnalyseCo‐occurrences
Applications
Retrieve
Query
28.06.2013 DIMA – TU Berlin 15
■ Query Interfaces□ SQL: Query the relational database□ Cypher: Query the Neo4J database□ Java: Use SemNet in your applications□ PHP: Explore the data in a web interface
■ Examples of top 10 automatically identified related terms
Querying SemNet
(f – absolute term frequency in the original text corpus, #r – number of related terms)
select * from nouncooccurrences where termw1 = 5824331 and termw2 is null and termw3 is nullorder by relfreq desc limit 20;
public ArrayList<String>getRelatedStringTerms(ArrayList<String>inputTerms) { … }
28.06.2013 DIMA – TU Berlin 16
■ Challenge: Methods based matrices and vectors are too slow■ Strategy: Related term sets intersection + relative frequency
multiplication
Ranking Results of Multiple Input Terms
chair 0.0441contents 0.0359end 0.0221front 0.0194figure 0.0189head 0.0189side 0.0180data 0.0157hand 0.0132column 0.0131page 0.0118edge 0.0112result 0.0100value 0.0099place 0.0087row 0.0086show 0.0082elbow 0.0072list 0.0071bed 0.0071
table
transaction
data 0.0735information 0.0569record 0.0376table 0.0334access 0.0310spreadsheet 0.0252name 0.0201object 0.0164retrieval system 0.0163file 0.0158example 0.0153use 0.0150connection 0.0146structure 0.0139field 0.0125user 0.0124change 0.0112type 0.0107size 0.0104transaction 0.0102
database
… …
data 0.001155contents 0.000359information 0.000190record 0.000091use 0.000077end 0.000060example 0.000055name 0.000050figure 0.000047value 0.000045result 0.000037list 0.000037column 0.000034row 0.000033object 0.000024field 0.000023book 0.000016order 0.000016size 0.000014query 0.000012
table+database
…
∩
*
28.06.2013 DIMA – TU Berlin 17
■ Prototype: Ecore Diagram Editor with class name suggestions [15]■ Automated suggestion adaption with respect to the content of the model
Modeling With Semantic Autocompletion
28.06.2013 DIMA – TU Berlin 18
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐GramStatistics
Text Corpus
N‐Gram DB
POSDB
Norm.N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
AnalyseCo‐occurrences
Applications
Retrieve
Query
28.06.2013 DIMA – TU Berlin 19
■ Challenge□ No gold standard available for many information extraction tasks
■ Our strategy: Compare SemNet to existing knowledge bases□ Provide measurements on how much information of WordNet and ConceptNet is
contained in SemNet
■ WordNet V3.0: Lexical database for the English language [16]□ Synsets: Grouped terms that share the same sense□ Relations: Mainly taxonomic, part-whole and synonyms
■ ConceptNet V5.1: Semantic graph for general human knowledge [17]□ Nodes: Any natural language phrase that expresses a concept□ Relations: Taxonomic, part-whole, related-to and several others
■ SemNet: Semantic Network of Related Terms□ Nodes: Noun terminology□ Relations: Probabilistic links
Evaluation Setup
maternity
morningsickness
physicalcondition
ectopicpregnancy
entopicpregnancy
synonym
partmeronym
parturiency
hyponym
hypernym
pregnancy
ConceptuallyRelatedTo
pregnancy
expect
morningsickness
physicalcondition
go to bed
ectopicpregnancy
PartOf
stretch
IsAIsA
RelatedTo
Causes
startfamily
HasSubevent
mother
termination birth
woman
trimester
stage
weekchildbirth
lactation
month1
2
3 4
5
6
7
89
10
0.036
0.0310.030 0.030
0.026
0.025
0.0200.0180.017
0.016
pregnancy
Word sense pregnancy in WordNet(7 out of 32 relations)
Concept pregnancy in ConceptNet(7 out of 58 relations).
Term pregnancy in SemNet(First 10 out of 4039 relations).
S
W C
28.06.2013 DIMA – TU Berlin 20
■ WordNet□ Iterate through all noun synsets
(72,994 synsets evaluated)□ Check whether the nouns are
contained in SemNet(98,681 nouns evaluated)
Results: 77,16% of WordNet‘s synsets are contained in SemNet and62,17% of WordNet‘s nouns are contained in SemNet
■ ConceptNet□ Problem: Concepts can be expressed
using any natural language phrase□ First determine noun terminology□ Check whether the nouns are
contained in SemNet(49,301 concepts evaluated)
Result: 82,40% of ConceptNet‘s nouns are contained in SemNet
Noun terminology coverage
(doctor, doc, physician, MD, Dr., medico)
(ear doctor, ear specialist, otologist)
(sleep talking, somniloquy, somniloquism)
doctor
go to bed
pregnancy
beautiful
28.06.2013 DIMA – TU Berlin 21
■ WordNet / ConceptNet□ Iterate through all previously found
noun synsets (56,321 synsets used)and concepts (40,625 concepts used)
□ Check whether the relations betweensynsets are contained in SemNet(61,931 WordNet relations evaluated and256,213 ConceptNet relations evaluated)
■ Relation evaluation results
Relation coverage
(doctor, doc, physician, MD, Dr., medico)
(medical practitioner, medical man)
hypernym
(surgeon)(allergist)hyponym
28.06.2013 DIMA – TU Berlin 22
■ Input dataset
■ Text analysis process
■ Application of SemNet
■ Evaluation of SemNet
■ Conclusions and Future Work
Agenda
N‐GramStatistics
Text Corpus
N‐Gram DB
POSDB
Norm.N‐Gram
DB
Analyse Parse
Normalize
Tag
SemNet
AnalyseCo‐occurrences
Applications
Retrieve
Query
28.06.2013 DIMA – TU Berlin 23
■ Summary□ Input: 710 million 5-grams and 20 part-of-speech patterns□ Hierarchical pattern matching, distributional semantics□ Output: 2.7M multi-word terms and 37.5M weighted relations□ Only a window of 5 words can be analyzed to detect relations□ Applications: Domain-specific modeling, keyword expansion,
background knowledge for NLP tasks
■ Current and future work□ Support additional languages□ Improve ranking functions (pointwise mutual information)□ Relax 3-word-limitation, derive own n-gram datasets□ Combine probabilistic information with specific relations□ Domain clustering in the semantic network□ Additional modeling support: relations/associations, attributes
Conclusions and Future Work
28.06.2013 DIMA – TU Berlin 24
[5] H. Agt: Supporting Software Language Engineering by AutomatedDomain Knowledge Acquisition. In: MODELS 2011 WorkshopsLNCS 7167 Springer 2012
[8] Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In: Proceedings of the NAACL 2003, pp. 173–180.
[9] Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19(2), 313–330 (1993)
[11] Michel, J.B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Team, T.G.B., Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M.A., Aiden, E.L.: Quantitative Analysis of Culture Using Millions of Digitized Books. Science 331(6014), 176–182 (2011)
[12] Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th Conference on Computational Linguistics, COLING 1992, vol. 2 (1992)
[13] Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)[15] Agt, H.: SemAcom: A System for Modeling with Semantic
Autocompletion. In: Model Driven Engineering Languages and Systems - 15th International Conference, MODELS 2012, Demo Track, Innsbruck, Austria (2012)
[16] Fellbaum, C.: WordNet: An Electronic Lexical Database. The MIT Press, Cambridge (1998)
[17] Speer, R., Havasi, C.: Representing General Relational Knowledge in ConceptNet 5. In: LREC 2012
[19] Agt, H., Kutsche, R.D., Wegeler, T.: Guidance for Domain Specific Modeling in Small and Medium Enterprises. In: SPLASH 2011 Workshops. DSM 2011, Portland, OR, USA (2011)
[22] Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Int. Res. 37(1), 141–188 (2010)
Thank You For Your Attention!
MODELS?
Try out SemNet: http://www.bizware.tu‐berlin.de/semnet/
Contact: henning.agt@tu‐berlin.de