bay area nlp reading group - 7.12.16

Bay Area NLPReading Group

July 12, 2016

AnnouncementsJoin our Slack channel!

https://bay-area-nlp-reading.slack.com/

To join, message me (Katie Bauer) on Meetup, talk to me after the meeting or email [email protected]



mailto:[email protected]

Want to help out?Present a paper you love

Demo your favorite NLP tool or library

Host a future meetup

Participate!

What is NER?Extracting proper nouns and classifying into categories- Universally: person, location, organization- Date/time, currencies, domain-specific

Traditional Approaches:- gazetteers (list lookup)- shallow parsing - ‘based in San Francisco’

Difficulties:- Reconciling different versions of names - Noam Chomsky vs. Professor Chomsky- Washington - person, place, collective name for US government- May - person or month?

What are Convolutional Neural Nets?1. Divide input into windows2. Calculate some sort of summary3. Feed that summary to next layer4. Divide summary into windows5. Summarize the summary

And so on and so forth

What does that look like for language?Windows are word contexts

If wi = ‘movie’,[wi-2, wi-1, wi, wi+1, wi+2] = [like, this, movie, very, much]

Wi is a column vector

ModelTask: Given a sentence, score the likelihood of each named entity class word for each word

Input:

Sentence of N words{w1,w2, … , wn-1, wn}

Wordswn = [wwrd,wwch]

ModelScoring

Concatenate all word vectors centered around word n to get vector rPass r through two layers of the neural networkCheck transition score Aut to see likelihood of tags given previous tagsStore all possible tag sequencesPick most likely sequence at end of sentence

OptimizationSentence score is conditional probability, so minimize negative log likelihoodBackpropogated stochastic gradient descent

CorporaPortuguese

- Word embeddings initialized with three corpora - Trained and tested on HAREM- HAREM 1 for training, miniHAREM for test

Spanish- Word embeddings initialized with Spanish Wikipedia- Trained and tested on SPA CoNLL-2002- SPA CoNLL-2002 has predivided training, development and test sets

ExperimentsComparable Architectures:

- CharWNN - WNN- CharNN - WNN + capitalization feature + suffix feature

ExperimentsState of the Art:

- AdaBoost for Spanish - ETLCMT for Portuguese

ExperimentsPortuguese by entity type

ExperimentsPretrained word embeddings vs. randomly initialized word embeddings

TakeawaysDifferent types of information are captured at word and character level

Prior knowledge (pretrained word embeddings) improves performance

With no prior knowledge, a bigger data set is better

Additional ResourcesIntroduction to Named Entity Recognition

https://gate.ac.uk/sale/talks/stupidpoint/diana-fb.ppt

Understanding Convolutional Neural Networks for NLPhttp://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Implementing a CNN for Text Classification in Tensorflowhttp://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

https://gate.ac.uk/sale/talks/stupidpoint/diana-fb.ppt

http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

Thank you!Bay Area NLPReading Group

July 12, 2016

bay area nlp reading group - 7.12.16

Data & Analytics