by dhirendra pratap singh & rudra murthy & dr. …•dhirendra p. singh, sudha bhingardive and...
TRANSCRIPT
-
Detection of MultiWord Expression and Name Entity Recognition
Multilingual Multiword Expressionby
Dhirendra Pratap Singh & Rudra Murthy & Dr. Pushpak Bhattacharyya
CSE Dept.
Indian Institute of Technology Bombay
-
Introduction
Research Problem
Motivation
Classification and Characteristics of MWEs
MWEs Detection Approach
Experiment and Results
Reference
Detection of MultiWord Expression and Name Entity Recognition
Outline
-
Introduction
MultiWord Expressions:• a group of two or more words when comes together and acts as a single semantic unit
• based on the various linguistic perspectives like lexical, syntactic, semantic, or purely statistical
Some MWEs in English:• Cloud nine
• Kick the bucket
• Swimming Pool
Some MWEs in Hindi:• धन दौलत (dhana daulat, wealth)• चाय पानी (chai paani, snacks)• नौ दो ग्यारा होना (nau do gyara hona, run away)
Detection of MultiWord Expression and Name Entity Recognition
-
Identification- How can we locate the tokens that correspond to MWEs?
- Unfortunately, he X the bucket
- X – located, not a MWE
- X – kicked, a MWE
Disambiguation
- Is it really a MWE in the current context?
- India and Pakistan broke bridges over the Mumbai blast issue
- India and Pakistan broke bridges over the Wagah border
Research Problems
Detection of MultiWord Expression and Name Entity Recognition
-
Many NLP applications face problems due to MWEs
Machine Translation
Information Retrieval
Motivation
Detection of MultiWord Expression and Name Entity Recognition
-
Machine Translation
मैंने धोखा खाया [Hindi → English]Google: I cheat eat
Correct: I was cheated
She kicked the bucket [English → Hindi]
Google: वह बाल्टी लात मारीCorrect: वह मर गयी
Motivation contd..
Detection of MultiWord Expression and Name Entity Recognition
-
Information Retrieval
Query: “burned bridges”
Google: Incidents of burning bridges
Actual: Incidents of broken ties
Detection of MultiWord Expression using Word Embeddings and WordNet-based Features
Motivation contd..
Detection of MultiWord Expression and Name Entity Recognition
-
8
MWEs Characteristics Compositionality
• Partially Compositional
Compositionality refers to the meaning of
their constituent words. E.g. तरण ताल (swimming pool), धन लक्ष्मी
(wealth), चाय पानी (snacks), etc.
• Non-Compositional
Non-compositionality cannot be
completely determined from the meaning
of its constituent words.E.g. काली जबुान, दम तोडना (pass away), Cloud
Nine, etc.
Idioms • Decomposability
Spill the bean
• Non- Decomposability
Kick the bucket
CollocationCollocation: They are fixed expressions and appear very
frequently in running text. E.g. कड़क चाय (strong tea), काला धन (black money), etc.
Non-SubstitutabilityIn Non-Substitutability, words cannot be substituted by its
synonymsE.g. अकं पत्र , क्षय-तिति , स्वचे्छा-मतृ्यु
Detection of MultiWord Expression and Name Entity Recognition
-
9
MWEs Classification (Sag et. al, 2002 )
Detection of MultiWord Expression and Name Entity Recognition
-
MWE Detection Approach
Hindi WordNet-based Feature approach
Word Embeddings approach
Using WordNet and Word Embeddings with Exact match
Detection of MultiWord Expression and Name Entity Recognition
-
Hindi Wordnet is the most useful
lexical resource for Indian languages
It is a lexical structure composed of
synsets, semantic and lexical relations
It can be used in various NLP
application
http://www.cfilt.iitb.ac.in/wordnet/hwn/
Hindi WordNet-based approach
Detection of MultiWord Expression using Word Embeddings and WordNet-based Features
http://www.cfilt.iitb.ac.in/wordnet/hwn/
-
Hindi WordNet-based approach contd..
Using WordNet-based Features:
Consider a word pair w1w2
• 𝑩𝑶𝑾 𝒘𝟏 = 𝑾’ 𝑾’ 𝝐 𝑾𝒐𝒓𝒅𝑵𝒆𝒕 − 𝒃𝒂𝒔𝒆𝒅 𝑭𝒆𝒂𝒕𝒖𝒓𝒆𝒔 𝒘𝟏• 𝑩𝑶𝑾 𝒘𝟐 = 𝑾’ 𝑾’ 𝝐 𝑾𝒐𝒓𝒅𝑵𝒆𝒕 − 𝒃𝒂𝒔𝒆𝒅 𝑭𝒆𝒂𝒕𝒖𝒓𝒆𝒔 𝒘𝟐
Where, WordNet Feature (wi) contains all content words from synonyms, gloss, example(s), hypernyms, hyponyms, meronyms, antonyms with respect to the word wi. We consider only one level of hierarchy for extracting these semantics features.
If 𝑤1 ϵ 𝐵𝑂𝑊 𝑤2 , 𝑡ℎ𝑒𝑛 𝑤1𝑤2 𝑖𝑠 𝑎 𝑀𝑊𝐸if 𝑤2 ϵ 𝐵𝑂𝑊 𝑤1 , 𝑡ℎ𝑒𝑛 𝑤1𝑤2 𝑖𝑠 𝑎 𝑀𝑊𝐸
Detection of MultiWord Expression using Word Embeddings and WordNet-based Features
-
Detection of CjVs using IndowordNet-based features approach
We are detecting Light Verb Constructions using Ontological features from IndoWordNet:
Consider a word pair w1w2
Detection of MultiWord Expression using WordNet-based Features
-
Detection of CjVs using IndowordNet-based features approach
We are detecting Light Verb Constructions using Ontological features from IndoWordNet:
Consider a word pair w1w2
Detection of MultiWord Expression using WordNet-based Features
-
Experiment and result
Languages Total Pairs (N+N) F-score
Hindi 1000 0.58
Marathi 1000 0.72
Bengali 1000 0.53
Punjabi 1000 0.43
Konkani 1000 0.53
Odiya 1000 0.38
Assamese 1000 0.40
Compound Nouns (CNs)
Table 1: Result of Compound Noun Detection
Detection of MultiWord Expression using WordNet-based Features
-
Experiment and result
Conjunct Verb (CjVs)
Table 1: Result of Conjunct Verb Detection
Languages Total pairs
(N+V)
F-score Total
pairs(Adj+V)
F-score
Hindi 457 0.87 577 0.89
Marathi 404 0.86 502 0.88
Bengali 797 0.87 303 0.92
Punjabi 1017 0.8 307 0.9
Konkani 879 0.84 269 0.95
Odia 832 0.85 368 0.91
Assamese 703 0.84 259 0.94
Languages Total Pairs (V+V) F-score
Hindi 399 0.99
Marathi 504 0.88
Table 2: Result of Compound Verb Detection
Conjunct Verb (CjVs)
Detection of MultiWord Expression using WordNet-based Features
-
Word Embeddings are based on the Distributional Hypothesis which work under the assumption that
similar words occur in similar contexts (Harris, 1954)
They represent each word with a low-dimensional real valued vector with similar words occurring
closer in that space
Word2vec tool is used for obtaining the word embeddings
It captures many linguistic regularities among words, for example,
Vector(‘king’) – Vector(‘man’) + Vector[‘women’] => Vector(‘queen’)
Word Embeddings for Hindi: They are trained on Bojar’s (2014) corpus (44 M sentences) with the Skip-
gram model, 200-dimensions, and the window size as 7
Word Embeddings: Linguistic regularities among words
Word Embedding approach
Detection of MultiWord Expression using Word Embeddings and WordNet-based Features
-
Word Cosine Distance
फ़ल 0.840545केला 0.705185
सीताफल 0.685993पपीता 0.682171
सौन्दययवर्यक 0.677420कन्दमूल 0.672466अननास 0.655930भाजियााँ 0.650811आडू 0.650100
Following are the closest words to a word फल in the corpus obtained using word2vec tool
Word Embedding approach
Detection of MultiWord Expression using Word Embeddings and WordNet-based Features
-
Hindi WordNet-based approach contd..
Using Word embeddings:
Consider a word pair w1w2
• 𝑩𝑶𝑾 𝒘𝟏 = 𝑾’ 𝑾’ 𝝐 𝑰𝒔𝑨𝑵𝒆𝒊𝒈𝒉𝒃𝒐𝒖𝒓 𝒘𝟏• 𝑩𝑶𝑾 𝒘𝟐 = 𝑾’ 𝑾’ 𝝐 𝑰𝒔𝑨𝑵𝒆𝒊𝒈𝒉𝒃𝒐𝒖𝒓 𝒘𝟐
Where, IsANeighbour(wi.) returns the top 20 neighbours of wi (according to cosine similarity of corresponding vectors).
If 𝑤1 ϵ 𝐵𝑂𝑊 𝑤2 , 𝑡ℎ𝑒𝑛 𝑤1𝑤2 𝑖𝑠 𝑎 𝑀𝑊𝐸if 𝑤2 ϵ 𝐵𝑂𝑊 𝑤1 , 𝑡ℎ𝑒𝑛 𝑤1𝑤2 𝑖𝑠 𝑎 𝑀𝑊𝐸
Detection of MultiWord Expression using Word Embeddings and WordNet-based Features
-
Using WordNet and Word Embeddings with Exact match
Using WordNet and Word Embeddings with Exact match:
• 𝑾𝑵𝑩𝒐𝒘𝟏 = 𝑾′ 𝑾′ = 𝑰𝒔 𝒘𝒐𝒓𝒅𝒏𝒆𝒕 − 𝒃𝒂𝒔𝒆𝒅 𝒇𝒆𝒂𝒕𝒖𝒓𝒆𝒔 𝒘𝟏
• 𝑾𝑬𝑩𝒐𝒘𝟐 = 𝑾′ 𝑾′ = 𝑰𝒔 𝒂 𝑵𝒆𝒊𝒈𝒉𝒃𝒐𝒖𝒓 𝒘𝟐
• 𝑾𝑵𝑩𝒐𝒘𝟏 ∩ 𝑾𝑬𝑩𝒐𝒘𝟐 ≠ 𝝓,⇒ 𝒘𝟏𝒘𝟐 𝒊𝒔 𝒂 𝑴𝑾𝑬
Detection of MultiWord Expression using Word Embeddings and WordNet-based Features
-
Evaluation of quality of word
embeddings:• Word embeddings, that were trained on
Bojar corpus, are evaluated on the word-
pair similarity dataset
• Agreement among the human annotators
was approx. 0.73
• Agreement between word embeddings
(word2vec tool) and human annotators was
approx. 0.61Table 1: Agreement of different entities on the translated
similarity dataset for Hindi
Entities Agreement
Human1/Human2 0.74
Human1/Human3 0.68
Human 2/Human3 0.77
Word2vec/Human1 0.65
Word2vec/Human2 0.54
Word2vec/Human3 0.63
Experiment and result
Detection of MultiWord Expression using Word Embeddings and WordNet-based Features
-
Evaluation of our approaches for MWEs detections:• As is evident from Table 2 and Table 3, WordNet based approaches perform comparatively better
• Word2vec approach performs relatively close
Results of Noun+Noun compounds on Hindi Dataset
Results of Noun+Verb compounds on Hindi Dataset
Techniques Resources used P R F-score
Approach 1 WordNet 0.79 0.77 0.78
Approach 2 Word2Vec 0.75 0.64 0.69
Approach 3 Word2Vec+WordNet 0.76 0.68 0.72
Techniques Resources used P R F-score
Approach 1 WordNet 0.75 0.82 0.78
Approach 2 Word2Vec 0.56 0.75 0.64
Approach 3 Word2Vec+WordNet 0.57 0.58 0.58
Experiment and result
Detection of MultiWord Expression using Word Embeddings and WordNet-based Features
-
Our work suggests that Wordnets definitely help in identification of MWEs.
Our investigation also shows that word embedding based approaches perform well too.
• This is helpful especially in the context of those languages whose Wordnets are incomplete.
Survey behavior of MWEs across languages
Study of Linguistic features that can assist identification of MWEs
Summary and Future Work
Detection of MultiWord Expression using Word Embeddings and WordNet-based Features
-
Publications
Detection of MultiWord Expression using Word Embeddings and WordNet-based Features
• Dhirendra P. Singh, Sudha Bhingardive and Pushpak Bhattacharyya, “Detection of Light Verb ConstructionsUsing WordNet”, Global WordNet Conference, (GWC 2016), Romania, 2016
• Sudha Bhingardive, Hanumant Redkar, Prateek Sappadla, Dhirendra P. Singh and Pushpak Bhattacharyya.“IndoWordNet-based Semantic Similarity Measurement”, Global WordNet Conference, (GWC 2016), Romania,2016.
• Dhirendra P. Singh, Sudha Bhingardive and Pushpak Bhattacharyya, “Detection of MultiWord Expression UsingWord Embeddings and WordNet based features”, International Conference on Natural Language Processing,(ICON 2015), India
• Sudha Bhingardive, Dhirendra P. Singh, Rudramurthy R and Pushpak Bhattacharyya. “Using Word Embeddingsfor Bilingual Unsupervised WSD”, International Conference on Natural Language Processing, (ICON 2015),India.
• Sudha Bhingardive, Dhirendra P. Singh, Rudramurty V, Hanumnat Redkar and Pushpak Bhattacharyya,“Unsupervised Most Frequent Sense Detection using Word Embeddings”, North American Chapter of theAssociation for Computational Linguistics – Human Language Technologies (NAACL HLT 2015) , Denver,Colorado, USA.
-
Publications
Detection of MultiWord Expression using Word Embeddings and WordNet-based Features
• Sudha Bhingardive, Ratish Puduppully , Dhirendra P. Singh and Pushpak Bhattacharyya. “Merging Senses ofHindi WordNet using Word Embeddings”, International Conference on Natural Language Processing, (ICON2014), Goa,India.
• Dhirendra P. Singh, Sudha Bhingardive, Kevin Patel and Pushpak Bhattacharyya, “Using Word Embeddings andWordNet features for MultiWord Expression Extraction”, Linguistic Society of India (LSI 2015), JNU, Delhi,India.
• Dhirendra P. Singh, ‘Linguistics Behavior of ‘Hindi Verb Collator’ in the context of Machine Translation’(Research Journal), 2013.
-
Christiane Fellbaum, “WordNet. An electronic lexical database”, Cambridge, MA: MIT Press; 1998.
Pushpak Bhattacharyya, “IndoWordNet”, LREC, 2010.
Darren Pearce, “Using conceptual similarity for collocation extraction”, Proceedings of the Fourth annual CLUK
colloquium, 2001.
Frank Smadja, “Retrieving collocations from text:xtract”, Computational Linguistics, 1993.
Tanmoy Chakraborty and Sivaji Bandyopadhyay, “Identification of Reduplication in Bengali Corpus and their
Semantic Analysis : A Rule Based Approach”, Proceedings of the Multiword Expressions: From Theory to Applications,
2010.
Carlos, Ramisch, Aline Villavicencio, and Christian Boitet, “mwetoolkit: a Framework for Multiword Expression
Identification.”, In Proc. of the Seventh LREC (LREC 2010).
Ramischy Carlos, Aline Villavicencio, and Christian Boitet, “Multiword Expressions in the wild? mwetoolkit comes in
handy”, COLING. 2010.
Veronika Vincze, Istvan Nagy T, and Gabor Berend, “Detecting noun compounds and light verb constructions: a
contrastive study”, Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real
World (MWE 2011).
References
Detection of MultiWord Expression using Word Embeddings and WordNet-based Features
-
Introduction to Name Entity Recognition(NER)
Rudra Murthy, Dhirendra and Dr. Pushpak Bhattacharyya
CSE dept.
Indian Institute of Technology Bombay
6/22/2016 Indian Institute of Technology Bombay 27
-
Goal of this talk
• Focus on NER
• Give the fundamental pointers for you to develop your NER system using various approaches.
• Understand how NER system fits in different NLP applications like
- Question Answering system
- Information Retrieval
- Information Extraction
- Machine Translation
……
…....
6/22/2016 Indian Institute of Technology Bombay 28
-
• Introduction
• Motivation
• Modelling
• Conclusion
Roadmap
6/22/2016 Indian Institute of Technology Bombay 29
-
IntroductionName Entity Recognition (NER) ?
• The task: identify lexical and phrasal information in text which express
references to named entities NE
• A very important sub-task: find and classify names in the running text.
• Used NLP layers
• Requires large amount of labeled training data. Costly and time-consuming.
• Upshot: for many widely spoken languages e.g. Indian languages, no NER
systems freely available.
6/22/2016 Indian Institute of Technology Bombay 30
-
NER classes Sample Categories
Name Person, organization, Location, Facilities, Artifact, Entertainment, Organisms,
Plants, Diseases, Cuisines, Locomotives
Time Time, Year, Month, Date, Day, Period and Special day are considered as Time
expressions.
Numerical
expression
Numerical expressions are categorized as Distance, Money, Quantity and Count
miscellaneous
NER classes?
The Named entity hierarchy is divided into four major classes; Name, Time and
Numerical expressions or other miscellaneous entities in a given running text
6/22/2016 Indian Institute of Technology Bombay 31
-
NER is the task of finding names and classifying them into person, location,
organization, or other miscellaneous entities in a given running text.
Examples are:
- Sachin Tendulakar is the star batsman for India
- Mohammed Amir granted UK visa
- Mohammed Amir granted UK visa
- I am the student of Indian Institute of Technology
6/22/2016 Indian Institute of Technology Bombay 32
-
• Question Answering (QA)
• Machine Translation (MT)
• Information Retrieval (IR)
• ……
• ……
Why NER ?
6/22/2016 Indian Institute of Technology Bombay 33
-
• What is QA system?
• QA is the system which is concerned about the giving answer automatically
posted by humans in natural language.
• QA system:
- Contain NER as a core components.
- NER task of finding some of the answers is simplified considerably.
NER in QA system ?
6/22/2016 Indian Institute of Technology Bombay 34
-
3
5
QA
System
Knowledge
Bases
Question: Where did Sachin
Tendulkar played his first test
match ?
Answer: Pakistan
6/22/2016 Indian Institute of Technology Bombay
-
NER in MT system ?
-
Motivation
• Dictionary Lookup
6/22/2016 Indian Institute of Technology Bombay 37
-
Dictionary Lookup ?
Have a dictionary of all person names, location names, organization names
or miscellaneous entities like sports team, political party name etc.
Given a sentence, search in the dictionary to see if there are any phrases
which appear in the dictionary
Example:
Greenland witnesses hottest June on record
6/22/2016 Indian Institute of Technology Bombay 38
-
Dictionary Lookup ?
Have a dictionary of all person names, location names, organization names
or miscellaneous entities like sports team, political party name etc.
Given a sentence, search in the dictionary to see if there are any phrases
which appear in the dictionary
Example:
Mamata Banerjee eyes Tata booster shot despite Singur fight
I was prosecuted to shield Tata: 2G accused Balwa
What should be the entity label for Tata?
6/22/2016 Indian Institute of Technology Bombay 39
-
Dictionary Lookup ?
Same word/phrase with different entity labels
Example
Mamata Banerjee eyes Tata booster shot despite Singur fight
I was prosecuted to shield Tata: 2G accused Balwa
It is difficult to collect the list of all named entities as new named
entities
6/22/2016 Indian Institute of Technology Bombay 40
-
Modelling for NER
6/22/2016 Indian Institute of Technology Bombay 41
-
Most Frequent tag
6/22/2016 Indian Institute of Technology Bombay 42
-
Most Frequent tag…..
6/22/2016 Indian Institute of Technology Bombay 43
-
6/22/2016 Indian Institute of Technology Bombay 44
-
6/22/2016 Indian Institute of Technology Bombay 45
-
6/22/2016 Indian Institute of Technology Bombay 46
-
6/22/2016 Indian Institute of Technology Bombay 47
-
6/22/2016 Indian Institute of Technology Bombay 48
-
6/22/2016 Indian Institute of Technology Bombay 49
-
6/22/2016 Indian Institute of Technology Bombay 50
-
6/22/2016 Indian Institute of Technology Bombay 51
-
6/22/2016 Indian Institute of Technology Bombay 52
-
6/22/2016 Indian Institute of Technology Bombay 53
-
6/22/2016 Indian Institute of Technology Bombay 54
Greedy Inference
-
Summary
• We began with an introduction to NER
• Brief overview of Maximum entropy approach for NER
• Currently, much of the community is looking towards Deep
• Learning based approaches
6/22/2016 Indian Institute of Technology Bombay 55
-
Thank you
Questions ?