bay area nlp reading group - 7.12.16
TRANSCRIPT
Bay Area NLPReading Group
July 12, 2016
AnnouncementsJoin our Slack channel!
https://bay-area-nlp-reading.slack.com/
To join, message me (Katie Bauer) on Meetup, talk to me after the meeting or email [email protected]
Want to help out?Present a paper you love
Demo your favorite NLP tool or library
Host a future meetup
Participate!
What is NER?Extracting proper nouns and classifying into categories- Universally: person, location, organization- Date/time, currencies, domain-specific
Traditional Approaches:- gazetteers (list lookup)- shallow parsing - ‘based in San Francisco’
Difficulties:- Reconciling different versions of names - Noam Chomsky vs. Professor Chomsky- Washington - person, place, collective name for US government- May - person or month?
What are Convolutional Neural Nets?1. Divide input into windows2. Calculate some sort of summary3. Feed that summary to next layer4. Divide summary into windows5. Summarize the summary
And so on and so forth
What does that look like for language?Windows are word contexts
If wi = ‘movie’,[wi-2, wi-1, wi, wi+1, wi+2] = [like, this, movie, very, much]
Wi is a column vector
ModelTask: Given a sentence, score the likelihood of each named entity class word for each word
Input:
Sentence of N words{w1,w2, … , wn-1, wn}
Wordswn = [wwrd,wwch]
ModelScoring
Concatenate all word vectors centered around word n to get vector rPass r through two layers of the neural networkCheck transition score Aut to see likelihood of tags given previous tagsStore all possible tag sequencesPick most likely sequence at end of sentence
OptimizationSentence score is conditional probability, so minimize negative log likelihoodBackpropogated stochastic gradient descent
CorporaPortuguese
- Word embeddings initialized with three corpora - Trained and tested on HAREM- HAREM 1 for training, miniHAREM for test
Spanish- Word embeddings initialized with Spanish Wikipedia- Trained and tested on SPA CoNLL-2002- SPA CoNLL-2002 has predivided training, development and test sets
ExperimentsComparable Architectures:
- CharWNN - WNN- CharNN - WNN + capitalization feature + suffix feature
ExperimentsState of the Art:
- AdaBoost for Spanish - ETLCMT for Portuguese
ExperimentsPortuguese by entity type
ExperimentsPretrained word embeddings vs. randomly initialized word embeddings
TakeawaysDifferent types of information are captured at word and character level
Prior knowledge (pretrained word embeddings) improves performance
With no prior knowledge, a bigger data set is better
Additional ResourcesIntroduction to Named Entity Recognition
https://gate.ac.uk/sale/talks/stupidpoint/diana-fb.ppt
Understanding Convolutional Neural Networks for NLPhttp://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
Implementing a CNN for Text Classification in Tensorflowhttp://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
Thank you!Bay Area NLPReading Group
July 12, 2016