information extraction language technology (a machine learning approach) 24 march 2005 antal van den...
TRANSCRIPT
![Page 1: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/1.jpg)
Information Extraction
Language Technology(A Machine Learning Approach)
24 March 2005
Antal van den Bosch and
Walter Daelemans
http://ilk.uvt.nl/~antalb/ltua
![Page 2: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/2.jpg)
What is Information Extraction?
• Input: unstructured text
• Output: structured information, fills pre-existing template (find salient information)
• Most often stored in database for futher processing (e.g. data mining)
![Page 3: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/3.jpg)
What isn’t information extraction?
• Information retrieval (we need to extract info, not only find relevant documents)
• Text understanding (only specific parts of the text are interesting)– large corpora can be used– possible to score objectively
![Page 4: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/4.jpg)
Applications of IE
• Can make information retrieval more precise
• Summarization of documents in well-defined subject areas
• Automatic generation of databases from text
![Page 5: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/5.jpg)
Overview
• Named entity recognition– Recognizing relevant entities in text
• Relation extraction– Linking recognized entities having
particular relevant relations
![Page 6: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/6.jpg)
De door het Amerikaanse National Hurricane Center als
'zeer gevaarlijk' omschreven orkaan Ivan nadert Cuba.
Een overzicht over wat Ivan op de Kaaimaneilanden
heeft aangericht, is er nog niet. Gouverneur Bruce
Dinwiddy zei maandag dat duizenden mensen dakloos zijn
geworden en dat ook belangrijke regeringsgebouwen zijn
getroffen.
Named Entity Recognition (NER) is a combination of concept chunking and labeling those chunks: we wish to identify textual information units that represent people, places, organizations, companies, bands, etc.
Named Entity Recognition
![Page 7: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/7.jpg)
NER has many applications
Why NER?
• prerequisite for information extraction
• improving information retrieval indexing querying
“belvedere”
![Page 8: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/8.jpg)
Intuitively simple?What’s the problem? NER seems intuitively simple for humans. How do we determine whether or not a (string of) word(s) represents a name?• does the word start with a capital letter? (orthographic characteristics)• have we seen it before? (lists of names)• contextual clues
How do we teach this to a computer?
![Page 9: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/9.jpg)
Some problems…
Problems:
• not every word that starts with a capital letter is a name ex: “Soms is dat niet mogelijk ...”
• context can be misleading ex: “Er was geen land met Henk te bezeilen.”
• no list can ever be complete ex: “Antbeard en zijn bemanning voeren ...” ex: “Wil je wat te drinken?”
![Page 10: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/10.jpg)
Feature extractionA lot of different features can be extracted for use in (inductively) learning to classify NEs. Every word can be represented with a lot of different features:
“… bedrijf dat Floralux inhuurde . In ‘81 …”
starts w/ capital letter? YES
first word o/t sentence? NO
contains punctuation? NO
string length? 8
![Page 11: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/11.jpg)
Feature extraction (2)
We represent the context by sliding a ‘window’ over the data which is anchored in the focus word.
“… het bedrijf dat Floralux inhuurde . In ‘81 bestond…”
left context
right context
focus word
“… het bedrijf dat Floralux inhuurde . In ‘81 bestond…”“… het bedrijf dat Floralux inhuurde . In ‘81 bestond…”
“… het bedrijf dat Floralux inhuurde . In ‘81 bestond…”
![Page 12: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/12.jpg)
To split or not to split?
Wolff , op het moment een journalist in Argentinië , speelde met Del Bosque bij Real Madrid in de jaren 70 .
determine boundaries +
types Wolff , op het moment een journalist in Argentinië , speelde met Del Bosque bij Real Madrid in de jaren 70 .
Wolff , op het moment een journalist in Argentinië , speelde met Del Bosque bij Real Madrid in de jaren 70 .
determine boundaries
Wolff , op het moment een journalist in Argentinië , speelde met Del Bosque bij Real Madrid in de jaren 70 .
determine types
Wolff , op het moment een journalist in Argentinië , speelde met Del Bosque bij Real Madrid in de jaren 70 .
![Page 13: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/13.jpg)
State of the art
STATE OF THE ARTSTATE OF THE ART F-score
English ~93%
Dutch ~77%
German ~72%
Spanish ~81%
Human performance 96-98%
Lots of other different languages have been targeted as well: Chinese, French, Japanese, Portuguese, Greek, Hindi, Rumanian, Turkish, Norwegian, and so on…
![Page 14: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/14.jpg)
Information extraction
• Named-entity recognition
• Relation extraction
• Coreference resolution– PvdA-leider Wouter Bos gaat alleen voor
minister-president. Vice-premier onder CDA-leider Balkenende is geen gedachte waar hij warm van wordt. "Hou het er maar op dat ik daar nee tegen zeg", aldus Bos woensdag voor RTL Nieuws.
![Page 15: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/15.jpg)
Information extraction
• Named-entity recognition has received a lot of attention in IE
• Relation extraction is taking over as focal point of attention
![Page 16: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/16.jpg)
Relation extraction
Eric Schmidt is directeur van Google . N N WW N VZ N .| ----- PER ---- | | - ORG - |
Example
directeur
![Page 17: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/17.jpg)
Why relation extraction?
• Named entities can be useful to enhance information retrieval
• Not enough to answer certain types of information-seeking questions
• For example– Wie is de directeur van Google?
![Page 18: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/18.jpg)
Why relation extraction?
• Naïve strategy– Find documents in which [PER
<unknown> ] and [ORG Google ] are within each other’s vicinity
– Can produce nice results, but does not always work
– Also, user still has to find answer– It would be better if the system produced
the answer 'Eric Schmidt'.
![Page 19: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/19.jpg)
Examples
• Some application areas are– News domain
• Relations among the most typical named entities: Person, Organisation, Location, Misc
• E.g. located in, parent of, part of
– Biomedical domain• Relations among biomedical entities, such as DNA,
proteins, diseases, etc.• Protein-protein interaction• Gene-disease relation
• Every domain-specific application needs its own set of entities
![Page 20: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/20.jpg)
Relation extraction• Difficult
– Automatic systems still perform poorly– But a few reasonable solutions
• Often only works in restricted domains– Techniques operating in the news domain are
lousy in other domains, e.g. biomedical texts
![Page 21: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/21.jpg)
Relations: implicit / explicit
• Explicit relations are spelled out– Joe Cummings, Chairman of Sybase, spoke
for four hours.
• Implicit relations imply understanding a text– Sybase was scheduled to testify, and
Chairman Joe Cummings spoke for four hours.
• Most current research involves explicit relations
![Page 22: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/22.jpg)
Difficulties
• A relation can be phrased in many ways– Eric Schmidt is de directeur van Google.– Eric Schmidt, de nieuwe directeur van
Google, verklaart ...– Eric Schmidt zet een volgende stap in zijn
carriere. Sinds kort is hij de directeur van Google.
– ...
![Page 23: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/23.jpg)
Assumptions
• Delimit the task– Relations always connect two named
entities• More complex relations between >2 entities are
harder
– Both entities are in the same sentence• Strong simplification• See next week (Veronique Hoste guest lecture)
![Page 24: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/24.jpg)
Relation extraction
• Relation detection– Is there a relation between two entities?
• Relation classification– Which type does the relation between two
entities have?
![Page 25: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/25.jpg)
MUC
• Message Understanding Conference
• Has organized many information extraction competitions
• Since 1998, relation extraction is a MUC competition
![Page 26: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/26.jpg)
ACE
• Automatic Content Extraction
• More recent than MUC
• The ACE data is the most popular data set for relation extraction research
![Page 27: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/27.jpg)
ACE• Types/Subtypes relations
– ROLE• Relates a person to an organisation or geopolitical
entity• member, owner, affiliate, client, citizen
– PART• Generalised containment• subsidiary, physical part-of, set membership
– AT• permanent and transient locations• located, based-in, residence
– SOC• social relations among persons• parent, sibling, spouse, grandparent, associate
![Page 28: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/28.jpg)
Automatic RE: Pipeline
• Relation extraction finds relations among pairs of named entities
• Assuming that named entities have already been identified
• Simple case of a pipeline, a heavily used architecture in language technology
![Page 29: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/29.jpg)
PipelineTokeniserTokeniser
Sentence splitterSentence splitter
Part-of-speech taggerPart-of-speech tagger
Syntactic parserSyntactic parser
Named-entity recogniserNamed-entity recogniser
Relation extractorRelation extractor
Information
![Page 30: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/30.jpg)
Pipeline• Parts of a pipeline are dependent on what
is done before them
• A weak point of the pipeline architecture is that errors tend to propagate as snowballs
![Page 31: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/31.jpg)
• Eric Schmidt is directeur van Google.● [PER Eric Schmidt ] werkzaam-bij [ORG Google ]
• Jan de Vries is vakkenvuller bij Albert Heijn.
● [PER Jan de Vries ] werkzaam-bij [ORG Albert Heijn ]
![Page 32: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/32.jpg)
• PER is directeur van ORG.
• PER is vakkenvuller bij ORG.
![Page 33: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/33.jpg)
• PER is directeur PREP ORG.
• PER is vakkenvuller PREP ORG.
![Page 34: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/34.jpg)
• PER is directeur PREP ORG.
• PER is vakkenvuller PREP ORG.
• Jan de Vries is fan van PSV.– PER is fan PREP ORG.
!
![Page 35: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/35.jpg)
directeur
vakkenvuller
fan
Semantic lexicon (e.g. WordNet)
portier
accountant liefhebber
bewonderaar...
...
...
...
...
...
![Page 36: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/36.jpg)
[PER Eric Schmidt ] werkzaam-bij [ORG Google ]
![Page 37: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/37.jpg)
[PER Jan de Vries ] werkzaam-bij [ORG Albert Heijn ]
![Page 38: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/38.jpg)
Similar?
![Page 39: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/39.jpg)
PER ↑smain-su ↓smain-predc np-mod pp-obj1 ORG
![Page 40: Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005 Antal van den Bosch and Walter Daelemans antalb/ltua](https://reader035.vdocument.in/reader035/viewer/2022070307/551a6afa5503463e778b5dd1/html5/thumbnails/40.jpg)
Evaluatie
• Comparable to text classification and named entity recognition– Precision
• Number of correctly predicted relations / Total number of predicted relations
– Recall• Number of correctly predicted relations / Total
number of relations in the text
– F-score• 2 * precision * recall / (precision + recall)