henning agt talk-caise-semnet

24
28.06.2013 DIMA – TU Berlin 1 Fachgebiet Datenbanksysteme und Informationsmanagement Technische Universität Berlin http://www.dima.tu-berlin.de/ Automated Construction of a Large Semantic Network of Related Terms for Domain-Specific Modeling CAiSE 2013, June 21 st , Valencia Henning Agt and Ralf-Detlef Kutsche Technische Universität Berlin

Upload: caise2013vlc

Post on 11-May-2015

185 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 1

Fachgebiet Datenbanksysteme und InformationsmanagementTechnische Universität Berlin

http://www.dima.tu-berlin.de/

Automated Construction of a Large Semantic Network of Related Terms for Domain-Specific Modeling

CAiSE 2013, June 21st, Valencia

Henning Agt and Ralf-Detlef Kutsche

Technische Universität Berlin

Page 2: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 2

■ Autocompletion applications

■ Predict what the user wants to model next

Motivation

nursetreatmentmedicineemergency...

Page 3: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 3

■ Our Vision: Provide automated suggestions of semantically relatedmodel elements for domain modeling [5],[19]□ Focus on domain terminology and conceptual design□ Query domain and common sense ontologies□ Information extraction from text

■ Requirements for the intended application□ Dictionary of terms□ Relations between terms□ Query interface and ranking functions

Research Goals

nursetreatmentmedicineemergency...

OntoOntoOnto‐logies

Extract

ModelingTools

KnowledgeService

Query

TextAnalysis

OntoOntoTermi‐nology

Retrieve/Integrate

Generate

Provide

Suggestions

Use

Page 4: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 4

■ Input dataset

■ Text analysis process

■ Application of SemNet

■ Evaluation of SemNet

■ Conclusions and Future Work

Agenda

N‐GramStatistics

Text Corpus

N‐Gram DB

POSDB

Norm.N‐Gram 

DB

Analyse Parse

Normalize

Tag

SemNet

AnalyseCo‐occurrences

Applications

Retrieve

Query

Page 5: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 5

■ Input dataset

■ Text analysis process

■ Application of SemNet

■ Evaluation of SemNet

■ Conclusions and Future Work

Agenda

N‐GramStatistics

Text Corpus

N‐Gram DB

POSDB

Norm.N‐Gram 

DB

Analyse Parse

Normalize

Tag

SemNet

AnalyseCo‐occurrences

Applications

Retrieve

Query

Page 6: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 6

■ Large amounts of text data

■ N-Grams□ Sequence of n consecutive words/tokens and its frequency□ Google provides 1,2,3,4 and 5-grams in several languages

■ We work on the English-All dataset V2 (1-grams and 5-grams) [11]

Google Books N-Gram Dataset

5 million books

Corpus

500 billion words N‐gram analysis

N‐GramDataset

CSV text fileswith word frequencies

...

…to go to the hospital 46,410general condition of the patient 28,198I was in the hospital 19,268discharge from the hospital . 12,476admission to the hospital . 10,558the patient to the hospital 6,422by placing the patient in 6,026between doctor and patient . 5,908... ...

…able to leave the hospital 4,629patient admitted to the hospital 4,303a patient in the hospital 3,844the symptom of the patient 2,559the patient under local anesthesia 2,536a patient is suffering from 2,475the doctor and the hospital 1,362the hospital and the doctor 1,017...

Page 7: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 7

■ Input dataset

■ Text analysis process

■ Application of SemNet

■ Evaluation of SemNet

■ Conclusions and Future Work

Agenda

N‐GramStatistics

Text Corpus

N‐Gram DB

POSDB

Norm.N‐Gram 

DB

Analyse Parse

Normalize

Tag

SemNet

AnalyseCo‐occurrences

Applications

Retrieve

Query

Page 8: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 8

■ N-gram database Make the data manageable□ Input: 2.5 terabytes of text□ Output: Tables with

10 million 1-grams and710 million 5-grams (21 gigabytes)

■ Part-of-speech tagging [8], [9] Identify lexical category of each text token□ Output: Table with POS tags for each

5-gram (14 gigabytes)

■ Normalization Reduce amount of word variations□ Plural stemming, lowercasing of

adjectives and normal nouns□ Proper nouns are not touched

■ Result: 710 million normalized and tagged 5-grams

Preprocessing

JJ    NN  IN  DT   NNgeneral condition of the patientNN   NN NN CC   NNdrug store pharmacist or doctor

doctors  doctorMedical practitioner medical practitionerhospitals in Valencia hospital in Valencia

AdjectiveNormalNoun DeterminerPreposition

CoordinatingCoordinatingconjunction

Page 9: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 9

■ Input dataset

■ Text analysis process

■ Application of SemNet

■ Evaluation of SemNet

■ Conclusions and Future Work

Agenda

N‐GramStatistics

Text Corpus

N‐Gram DB

POSDB

Norm.N‐Gram 

DB

Analyse Parse

Normalize

Tag

SemNet

AnalyseCo‐occurrences

Applications

Retrieve

Query

Page 10: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 10

■ Goal: Detect domain terminology using syntactical patterns [12]

■ Analysis of existing dictionaries□ 75% of terms: noun, noun-noun, adjective noun combinations

■ Excerpt of the 20 patterns used:

■ No proper nouns: Stanford University / university professor□ Our focus is conceptual design on schema level

■ Limitation: 5-gram: 5 words□ Maximum length of a term: 3 words

Lexical Patterns

doctor or mental health professional

term termseparation

Page 11: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 11

■ Hierarchical pattern matching

■ Distributional Semantics [13], [22]□ “Words that occur in the same contexts

tend to have similar meanings.”(Distributional Hypothesis by Z. Harris)

Co-Occurring Terms

your doctor or pharmacist .      9271

ContextfrequencyAbsolute frequency

„doctor“ and „pharmacist“co‐occurred 9271 times

Highest level remains

No idiomatic phrasesNo consecutive patterns

Easiest case

Page 12: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 12

■ Discard 5-grams that contain 4 or 5 stopwords

■ Apply pattern matching on the remaining 5-grams Result: Large table of binary relations

■ Frequency aggregation□ Many terms co-occurred in different contexts

■ Relative frequency computation□ For each term with respect to its related terms

■ Graph construction□ Directed, weighted edges□ Relational database and graph

database serialization (SQLite / Neo4J)

SemNet Construction

to go to the doctor I am what I am a ) ( 2 )

Page 13: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 13

■ Properties of SemNet□ 268,937 distinct single-word terms□ 2,115,494 distinct double-word terms□ 355,689 distinct triple-word terms□ 2.7 million terms and 37.5 million relations□ 2.2 GB disc space

■ Lessons learned from the analysis process

Statistics

41,6%15,7%

32,6%

10,1%

4 or 5stopwords

N-Gram Information Content

Only1 term

No patternmatch

N-gramswith asemanticrelationship

Semantic relatedness: Zipf‘s law

Rank

Deg

ree

ofre

late

dnes

s

Page 14: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 14

■ Input dataset

■ Text analysis process

■ Application of SemNet

■ Evaluation of SemNet

■ Conclusions and Future Work

Agenda

N‐GramStatistics

Text Corpus

N‐Gram DB

POSDB

Norm.N‐Gram 

DB

Analyse Parse

Normalize

Tag

SemNet

AnalyseCo‐occurrences

Applications

Retrieve

Query

Page 15: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 15

■ Query Interfaces□ SQL: Query the relational database□ Cypher: Query the Neo4J database□ Java: Use SemNet in your applications□ PHP: Explore the data in a web interface

■ Examples of top 10 automatically identified related terms

Querying SemNet

(f – absolute term frequency in the original text corpus, #r – number of related terms)

select * from nouncooccurrences where termw1 = 5824331 and termw2 is null and termw3 is nullorder by relfreq desc limit 20;

public ArrayList<String>getRelatedStringTerms(ArrayList<String>inputTerms) { … }

Page 16: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 16

■ Challenge: Methods based matrices and vectors are too slow■ Strategy: Related term sets intersection + relative frequency

multiplication

Ranking Results of Multiple Input Terms

chair 0.0441contents 0.0359end 0.0221front 0.0194figure 0.0189head 0.0189side 0.0180data 0.0157hand 0.0132column 0.0131page 0.0118edge 0.0112result 0.0100value 0.0099place 0.0087row 0.0086show 0.0082elbow 0.0072list 0.0071bed 0.0071

table

transaction

data 0.0735information 0.0569record 0.0376table 0.0334access 0.0310spreadsheet 0.0252name 0.0201object 0.0164retrieval system 0.0163file 0.0158example 0.0153use 0.0150connection 0.0146structure 0.0139field 0.0125user 0.0124change 0.0112type 0.0107size 0.0104transaction 0.0102

database

… …

data 0.001155contents 0.000359information 0.000190record 0.000091use 0.000077end 0.000060example 0.000055name 0.000050figure 0.000047value 0.000045result 0.000037list 0.000037column 0.000034row 0.000033object 0.000024field 0.000023book 0.000016order 0.000016size 0.000014query 0.000012

table+database

*

Page 17: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 17

■ Prototype: Ecore Diagram Editor with class name suggestions [15]■ Automated suggestion adaption with respect to the content of the model

Modeling With Semantic Autocompletion

Page 18: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 18

■ Input dataset

■ Text analysis process

■ Application of SemNet

■ Evaluation of SemNet

■ Conclusions and Future Work

Agenda

N‐GramStatistics

Text Corpus

N‐Gram DB

POSDB

Norm.N‐Gram 

DB

Analyse Parse

Normalize

Tag

SemNet

AnalyseCo‐occurrences

Applications

Retrieve

Query

Page 19: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 19

■ Challenge□ No gold standard available for many information extraction tasks

■ Our strategy: Compare SemNet to existing knowledge bases□ Provide measurements on how much information of WordNet and ConceptNet is

contained in SemNet

■ WordNet V3.0: Lexical database for the English language [16]□ Synsets: Grouped terms that share the same sense□ Relations: Mainly taxonomic, part-whole and synonyms

■ ConceptNet V5.1: Semantic graph for general human knowledge [17]□ Nodes: Any natural language phrase that expresses a concept□ Relations: Taxonomic, part-whole, related-to and several others

■ SemNet: Semantic Network of Related Terms□ Nodes: Noun terminology□ Relations: Probabilistic links

Evaluation Setup

maternity

morningsickness

physicalcondition

ectopicpregnancy

entopicpregnancy

synonym

partmeronym

parturiency

hyponym

hypernym

pregnancy

ConceptuallyRelatedTo

pregnancy

expect

morningsickness

physicalcondition

go to bed

ectopicpregnancy

PartOf

stretch

IsAIsA

RelatedTo

Causes

startfamily

HasSubevent

mother

termination birth

woman

trimester

stage

weekchildbirth

lactation

month1

2

3 4

5

6

7

89

10

0.036

0.0310.030 0.030

0.026

0.025

0.0200.0180.017

0.016

pregnancy

Word sense pregnancy in WordNet(7 out of 32 relations)

Concept pregnancy in ConceptNet(7 out of 58 relations).

Term pregnancy in SemNet(First 10 out of 4039 relations).

S

W C

Page 20: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 20

■ WordNet□ Iterate through all noun synsets

(72,994 synsets evaluated)□ Check whether the nouns are

contained in SemNet(98,681 nouns evaluated)

Results: 77,16% of WordNet‘s synsets are contained in SemNet and62,17% of WordNet‘s nouns are contained in SemNet

■ ConceptNet□ Problem: Concepts can be expressed

using any natural language phrase□ First determine noun terminology□ Check whether the nouns are

contained in SemNet(49,301 concepts evaluated)

Result: 82,40% of ConceptNet‘s nouns are contained in SemNet

Noun terminology coverage

(doctor, doc, physician, MD, Dr., medico)

(ear doctor, ear specialist, otologist)

(sleep talking, somniloquy, somniloquism)

doctor

go to bed 

pregnancy

beautiful

Page 21: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 21

■ WordNet / ConceptNet□ Iterate through all previously found

noun synsets (56,321 synsets used)and concepts (40,625 concepts used)

□ Check whether the relations betweensynsets are contained in SemNet(61,931 WordNet relations evaluated and256,213 ConceptNet relations evaluated)

■ Relation evaluation results

Relation coverage

(doctor, doc, physician, MD, Dr., medico)

(medical practitioner, medical man)

hypernym

(surgeon)(allergist)hyponym

Page 22: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 22

■ Input dataset

■ Text analysis process

■ Application of SemNet

■ Evaluation of SemNet

■ Conclusions and Future Work

Agenda

N‐GramStatistics

Text Corpus

N‐Gram DB

POSDB

Norm.N‐Gram 

DB

Analyse Parse

Normalize

Tag

SemNet

AnalyseCo‐occurrences

Applications

Retrieve

Query

Page 23: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 23

■ Summary□ Input: 710 million 5-grams and 20 part-of-speech patterns□ Hierarchical pattern matching, distributional semantics□ Output: 2.7M multi-word terms and 37.5M weighted relations□ Only a window of 5 words can be analyzed to detect relations□ Applications: Domain-specific modeling, keyword expansion,

background knowledge for NLP tasks

■ Current and future work□ Support additional languages□ Improve ranking functions (pointwise mutual information)□ Relax 3-word-limitation, derive own n-gram datasets□ Combine probabilistic information with specific relations□ Domain clustering in the semantic network□ Additional modeling support: relations/associations, attributes

Conclusions and Future Work

Page 24: Henning agt   talk-caise-semnet

28.06.2013 DIMA – TU Berlin 24

[5] H. Agt: Supporting Software Language Engineering by AutomatedDomain Knowledge Acquisition. In: MODELS 2011 WorkshopsLNCS 7167 Springer 2012

[8] Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In: Proceedings of the NAACL 2003, pp. 173–180.

[9] Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19(2), 313–330 (1993)

[11] Michel, J.B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Team, T.G.B., Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M.A., Aiden, E.L.: Quantitative Analysis of Culture Using Millions of Digitized Books. Science 331(6014), 176–182 (2011)

[12] Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th Conference on Computational Linguistics, COLING 1992, vol. 2 (1992)

[13] Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)[15] Agt, H.: SemAcom: A System for Modeling with Semantic

Autocompletion. In: Model Driven Engineering Languages and Systems - 15th International Conference, MODELS 2012, Demo Track, Innsbruck, Austria (2012)

[16] Fellbaum, C.: WordNet: An Electronic Lexical Database. The MIT Press, Cambridge (1998)

[17] Speer, R., Havasi, C.: Representing General Relational Knowledge in ConceptNet 5. In: LREC 2012

[19] Agt, H., Kutsche, R.D., Wegeler, T.: Guidance for Domain Specific Modeling in Small and Medium Enterprises. In: SPLASH 2011 Workshops. DSM 2011, Portland, OR, USA (2011)

[22] Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Int. Res. 37(1), 141–188 (2010)

Thank You For Your Attention!

MODELS?

Try out SemNet: http://www.bizware.tu‐berlin.de/semnet/

Contact: henning.agt@tu‐berlin.de