learning dictionaries from unannotated data

26
Learning dictionaries from unannotated data. Hristo Tanev OPTIMA action GlobeSec unit, IPSC

Upload: htanev

Post on 05-Jul-2015

415 views

Category:

Education


1 download

DESCRIPTION

In this presentation I show how my weakly supervised system for learning of semantic classes, Ontopopulis, works

TRANSCRIPT

Page 1: Learning Dictionaries From Unannotated Data

 Learning dictionaries from unannotated data.  

Hristo Tanev

OPTIMA action

GlobeSec unit, IPSC

Page 2: Learning Dictionaries From Unannotated Data

Outline of the talk

What are semantic dictionaries Ontopopulis – a system for learning of

semantic dictionaries Ontopopulis in use Conclusions

Page 3: Learning Dictionaries From Unannotated Data

NLP and dictionaries

Natural Language Processing (NLP) systems map a natural language text into some structured representation, which is somehow related to the human understanding of language President Obama is meeting tonight with Apple CEO Steve

Jobs {Obama:PER; Apple:ORG; Steve Jobs:PER}

This process is often multi-level, complex and requires knowledge about language and the world: Dictionaries Grammars Ontologies ….

Page 4: Learning Dictionaries From Unannotated Data

Semantic dictionaries Semantic dictionaries map words or phrases into domain-specific

semantic classes boat : VEHICLE gun : WEAPON engineer : PERSON swine flu : DISEASE nice : POSITIVE_ADJECTIVE

Many NLP systems use semantic dictionaries Information extraction

[PEOPLE] in a [VEHICLE] (two people in a boat) Opinion mining

List of positive and negative words and phrases

Semantic dictionaries are one of the most simple ways to present knowledge

Page 5: Learning Dictionaries From Unannotated Data

Semantic dictionaries

Semantic dictionaries are the most language and domain-specific resources of the NLP systems

They could be very large Expensive to create in terms of time and

resources Require domain and linguistic expertise

Page 6: Learning Dictionaries From Unannotated Data

Ontopopulis – an automatic system for learning of semantic dictionaries The system is based on a modification of a weakly supervised

method, described in [Tanev and Magnini, Weakly Supervised Approaches for Ontology Population, 2008,Ontology Learning and Population, Bridging the Gap between Text and Knowledge]

The system is multilingual and knowledge poor, uses just an unannotated corpus and a list of stop words

In contrast with state-of-the-art systems for learning of semantic classes, Ontopopulis does not use any language-specific processing

It is written in Java and requires about 10-20 minutes per couple (triple) of classes

Page 7: Learning Dictionaries From Unannotated Data

System architecture

Extraction of contextual features

Seed:train bustruckcar

Text collection

Contextual features:driver of the X : 2.6X plowed : 2.2X was parked : 2.2stopped a X : 2.2collided with another X : 2.1…

New term extraction

New terms:vehiclevanlorrytaximinibus

Stopwords

Page 8: Learning Dictionaries From Unannotated Data

Ontopopulis – basic steps

Ontopopulis takes on its input a small set of seed keywords for each semantic class which we want to learn

The system learns contextual features (n grams, which co-occur immediately before or after the seed terms ); it chooses the most reliable ones for each class

Optionally, the user can validate the contextual features

New terms are learnt for each semantic class with the validated features Current version of the system is tuned not to learn named

entities

Page 9: Learning Dictionaries From Unannotated Data

Ontopopulis – an example.Learning types of vehicles

Input: three seed sets of words, which refer to vehicles – watercrafts, aircrafts and land vehicles

Watercrafts: ferryboat , ship , boat , yacht Aircrafts: helicopter, airplane , Airbus Land vehicles: train , bus , truck , car

Page 10: Learning Dictionaries From Unannotated Data

Contextual features The system searches for the seed keywords in a corpus, finds

contextual features and scores them Watercraft top contextual features

X capsized 1.9388203783671003 seizure of the X 1.5338767554180062 X and its crew v 1.5219650106107294 missing after a X 1.4236053259793582 X was intercepted 1.3941924012474063 X ran aground 1.3306324796381248 born on a X 1.3147244396126636 ….

Aircraft top contextual features crash of the X 2.02 X that crashed 1.5359148669094496 wreckage of the X 1.4008144141101402 pieces of the X 1.1963353346631014 aboard the X 1.0836274437276754 X has crashed 0.9853940362678498 X pilot 0.8890513946133263 ….

Land vehicles contextual features driver of the X 2.6832564679889153 X plowed 2.287497321320604 X was parked 2.2407143635036704 stopped a X 2.200628452539843 collided with another X 2.1468282693841436 travel by X 2.0890015358446328 X was travelling 2.008981636231196 …..

Page 11: Learning Dictionaries From Unannotated Data

Scoring the contextual features

weight1(f, class)=

seed(watercraft)={boat, ferryboat, ship, yacht}PMI (f,s) – Pointwise Mutual Information of f and s

weightN(f, class)=

weight(f,class)=

),(3),(

),(

)(

sfPMIsffreq

sffreq

classseeds

•+∑

),(max

),(

11)(

1

1

classfweight

classfweight

classfeaturesf ∈

∑∈

classesclassN

NN classfweight

classfweightclassfweight

1

),(

),(),(

1

Page 12: Learning Dictionaries From Unannotated Data

Extracting new terms

Text collection is scanned for contextual features

The n-grams which appear in the feature slots are considered term candidates

Weighting term candidates:

weight(t, class)=

Term candidates are ordered in order of decreasing weight

∑∈ +

∩)(

),(.3),(

),().()()(

classfeaturesf

tfPMItffreq

tffreqfweighttfeaturesclassfeatures

Page 13: Learning Dictionaries From Unannotated Data

Extracting new terms

Top 20 terms for watercrafts (75% accuracy) vessel 392.08530101453465 ferry 130.92071859241187 arctic sea 111.51926214919027 boats 70.09673806960807 fishing boat 51.91800928040082 flight 51.54860533011118 ships 45.064249579966756 freighter 38.4793792989174 vessels 37.94665196333265 shuttle 37.84138500667754 tanker 33.973493404331116 cargo ship 30.92735210060045 craft 30.576926785957 cargo 24.773583958775333 submarine 22.62225313680197 trawler 21.744727204334037 princess ashika 20.788755092164358 liner 20.16456735187679 fishing vessel 20.103099674619276 cruise ship 19.564950730766093

Page 14: Learning Dictionaries From Unannotated Data

Extracting new terms Top 20 terms for aircraft (70% accuracy)

plane 386.08632995744114 aircraft 214.51664885690826 jet 116.9117713587897 airbus a330 110.65796968774065 air france 107.73115170156977 airliner 65.72192132602771 chopper 65.07326856476149 flight 63.81155233947375 yemenia 58.864411455717715 a330 51.35667865678823 shuttle 34.74771258444413 jetliner 33.00461319890622 airbus a310 30.019477774417997 a310 8.78145228767769 planes 26.203456787916377 passenger plane 25.637328558549058 passenger jet 24.670909891236946 france plane 24.02055464618129 caspian airlines22.123307028006952 france jet 20.787176394109114

Page 15: Learning Dictionaries From Unannotated Data

Extracting new terms Top 20 terms for land vehicles (80% accuracy)

vehicle 379.6998519301858 van 172.63373740700783 lorry 153.9337673760267 taxi 116.566997111338 minibus 99.674172750452 motorcycle 83.45691130257896 trailer 75.79549750687403 minivan 72.83622251294283 tractor 63.04147460775497 pickup truck 56.137849074735094 jeep 47.47715623723825 pickup 44.18340053505094 suv 43.51148949931748 cars 36.60232043125164 tanker 35.883024767183514 motorbike 35.73857198901322 driver 34.64342424860143 bakkie 31.44693588923759 passenger 29.58595652982613 passenger bus 27.710992943240772

Page 16: Learning Dictionaries From Unannotated Data

Ontopopulis vs. Google Sets

Google Sets for land vehicles with the same seed set extracted 20 new terms and reached 30% accuracy vs. 80% for Ontopopulis) boat airplane taxi helicopter plane airport air bicycle buggy aircraft coach suv ferry motorcycle robot transport tips time travel tank planes rail

Page 17: Learning Dictionaries From Unannotated Data

Ontopopulis in multilingual environment Italian - learning a list of dangerous and potentially dangerous substances Input: sostanze pericolose , rifiuti pericolosi, uranio, scorie nucleari Output (top 20, 70% accuracy)

rifiuti speciali 41.64611927865504 materiale 20.614977545240276 amianto 17.715371213671204 rifiuti tossici 13.7472503464293 spazzatura 13.113535554767779 esplosivo 11.406271876468216 cocaina 11.238345999686327 gpl 10.000438516204888 immondizia 9.8760929040176 sigarette 9.4065323172882 carburante 9.070216390158857 rifiuti provenienti 8.852697311731404 rifiuti radioattivi 8.616686260486738 prodotti 8.511069145547099 sostanze chimiche 8.340998333033808 materiali 7.940812933415176 scorie radioattive 7.934097715435894 alimenti 7.796882609916327 rifiuti solidi 7.541486005721908 prodotti caseari 7.467839989843934

Page 18: Learning Dictionaries From Unannotated Data

Using Ontopopulis for event extraction

We use Ontopopulis to learn terms, which we next put into the domain-specific dictionaries of our event extraction system NEXUS

Some rules which make reference to semantic classes: Rules for parsing person reference noun phrases, such as

two engineers Rules which detect weapons used:

ucciso con (una | un) [WEAPON] (ucciso con una pistola) Detection of vehicles used:

[PEOPLE] in (un | una) [VEHICLE] (due persone in una imbarcazione)

Drug traffickingtraffico di [DRUGS] (traffico di ketamina)

Page 19: Learning Dictionaries From Unannotated Data

Using Ontopopulis for event classification NEXUS uses combinations of classes of words to recognize event types. For

example: Words of class Crime near words like arrest trigger Arrest type of event Words of class Political person near words like kill trigger Assassination event type

We learned different semantic classes, related to crises for English, French, Italian, Spanish, Portuguese and Arabic

Some of these classes were: Disasters Humanitarian crises Law-enforcement authorities Political person Infrastructure Crimes Vehicles Heavy weapons Drugs

Page 20: Learning Dictionaries From Unannotated Data

Learning event – related classes for Spanish and Portuguese. Evaluation

------6095Spanish

7585207085756090Portuguese

BuildingCrimeEdged weapon

WatercraftVehiclePoliticianWeaponPersonAccuracy (%) top 20

Page 21: Learning Dictionaries From Unannotated Data

Using Ontopopulis for summarization

TAC’10: Aspect-driven summarization - summary plus aspects: Damage, Countermeasures, etc.

We created automatically with Ontopopulis, a list of damages, disaster and military countermeasures, crime charges and resources

Using damages and countermeasures dictionaries improved average aspect – based Pyramid score by 0.12; crime charges and resources dictionaries decreased the average aspect-based Pyramid score by 0.09

Page 22: Learning Dictionaries From Unannotated Data

Using Ontopopulis for opinion mining

Dictionaries of positive and negative words and phrases play a central role in the opinion mining systems

Difficult to find such dictionaries, especially for languages other than English

With Ontopopulis, we learned subjective words for English and Spanish

After manual cleaning, these words were plugged in our opinion mining system

Page 23: Learning Dictionaries From Unannotated Data

Using Ontopopulis for opinion mining

Learning positive and negative words Positive (seed set: nice, pleasant, convenient,

beautiful) Learnt positive words (top 33, accuracy 97%):

fun, wonderful, lovely, comfortable, safe, interesting, simple, easy, unique, enjoyable, reliable, friendly, exciting, affordable, accessible, *difficult, happy, decent, efficient, funny, healthy, warm, productive, clean, attractive, helpful, perfect, great, secure, intuitive, gentle, cool, sustainable

Page 24: Learning Dictionaries From Unannotated Data

Using Ontopopulis for opinion mining

Negative words (seed set: unpleasant, ugly, inconvenient)

Learnt negative words: (top 33, accuracy 88%)uncomfortable, simple, uncomfortable, *simple, sad, difficult, disturbing, painful, terrible, shocking, emotional, embarrassing, horrible, frightening, awful, fundamental, harsh, unfortunate, unpalatable, complicated, *historical, cruel, *universal, hard, *honest, scary, brutal, dangerous, obvious, ugly head, bizarre, awkward, eternal, bitter, *absolute

Page 25: Learning Dictionaries From Unannotated Data

A tasty conclusion Input: risotto, crepes, ratatouille, roasted chicken Output:

soup 9.424864485808794 pasta 8.503403365978578 salad 4.138940899343471 sauce 3.9255334464290845 juice 3.5978760396055662 seafood 3.5493529534271904 syrup 3.3213803233051684 barbecue 3.0409478219630994 pizza 2.9854125681933934 cooked 2.9262742039838177

Ontopopulis is nearly unsupervised, requires just a small input seed set

Language and domain – independent Results vary between semantic classes,

typically accuracy > 70% in top 20 acquired terms Manual supervision is necessary, however we found it easier

to clean already acquired dictionary, rather than creating it manually

Efficient, on a state-of-the-art PC requires about 10 minutes per class

Multiplatform - written entirely in Java Application potential

Page 26: Learning Dictionaries From Unannotated Data

Thank you!