text classification 1 - sameer singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... ·...

27
Text Classification 1 Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017 January 12, 2017 Based on slides from Nathan Schneider, Noah Smith, Dan Klein and everyone else they copied from.

Upload: phunglien

Post on 29-Jul-2018

234 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

TextClassification1

Prof.SameerSinghCS295:STATISTICALNLP

WINTER2017

January12,2017

BasedonslidesfromNathanSchneider,NoahSmith,DanKleinandeveryoneelsetheycopiedfrom.

Page 2: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

TextClassification1

CS295:STATISTICALNLP(WINTER2017) 2

IntroductiontoTextClassification

NaiveBayesClassification

CourseProjects

Page 3: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

TextClassification

CS295:STATISTICALNLP(WINTER2017) 3

IntroductiontoTextClassification

NaiveBayesClassification

CourseProjects

Page 4: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

SentimentAnalysis

CS295:STATISTICALNLP(WINTER2017) 4

Filledwithhorrificdialogue,laughablecharacters,alaughableplot,adreallynointerestingstakesduringthisfilm,"StarWarsEpisodeI:ThePhantomMenace"isnotatallwhatIwantedfromafilmthatissupposedtobethehugeopeningtothesegueintothefantasticOriginalTrilogy.Thepositivesincludethescore,thesound…

Page 5: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

OtherExamples

CS295:STATISTICALNLP(WINTER2017) 5

• Reviewsoffilms,restaurants,products:positivevs.negative• Amazonreviewsdata,IMDBreviewsdata

• Library-likesubjects(e.g.,theDeweydecimalsystem)• Newsstories:politicsvs.sportsvs.businessvs.technology...

• 20newsgroupdata• Authorattributes:identity,politicalstance,gender,age,...• Email:spamvs.not

• Gmail:important,promotion,updates,socialmedia,…• Whatisthereadinglevelofapieceoftext?

• Automaticgraders?• Howinfluentialwillascientificpaperbe?• Advertisementrecommendations…• Willapieceofproposedlegislationpass?

• Identifythepresidentialcandidatefromspeeches• Postrecommendations/Fakenewsdetection

• Canmajorlyinfluencetheworld!

Page 6: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

FormalSetup

CS295:STATISTICALNLP(WINTER2017) 6

Classification

SupervisedLearning

TrainingAlgorithm

Page 7: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

Evaluation:ContingencyTable

CS295:STATISTICALNLP(WINTER2017) 7

Page 8: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

Accuracy

CS295:STATISTICALNLP(WINTER2017) 8

Problem

• Classimbalancehurts..• Gettingoneclassrightmattersmorethantheother(retrieval)

Page 9: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

PrecisionandRecall

CS295:STATISTICALNLP(WINTER2017) 9

Page 10: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

>2Classes?

CS295:STATISTICALNLP(WINTER2017) 10

Macro-averagedMeasures

Micro-averagedMeasures

Page 11: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

StatisticalSignificance

CS295:STATISTICALNLP(WINTER2017) 11

McNemar’s Test,Psychometrika, (1947)MoretestsinSmithbook,appendixB

Page 12: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

TextClassification

CS295:STATISTICALNLP(WINTER2017) 12

IntroductiontoTextClassification

NaiveBayesClassification

CourseProjects

Page 13: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

ClassificationusingJointProb

CS295:STATISTICALNLP(WINTER2017) 13

Page 14: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

NaïveBayesClassifier

CS295:STATISTICALNLP(WINTER2017) 14

Twoassumptions

• Wordorderingdoesnotmatter(BagofWords)

Page 15: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

NaïveBayesClassifier

CS295:STATISTICALNLP(WINTER2017) 15

Twoassumptions

• Wordorderingdoesnotmatter (BagofWords)• Wordsareindependentgivencategory

Page 16: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

EstimationofParameters

CS295:STATISTICALNLP(WINTER2017) 16

Page 17: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

ProblemwithNaïveBayes

CS295:STATISTICALNLP(WINTER2017) 17

Page 18: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

LinearModels

CS295:STATISTICALNLP(WINTER2017) 18

Page 19: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

NaïveBayesasaLinearModel

CS295:STATISTICALNLP(WINTER2017) 19

Page 20: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

TextClassification

CS295:STATISTICALNLP(WINTER2017) 20

IntroductiontoTextClassification

NaiveBayesClassification

CourseProjects

Page 21: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

GroupProjects

CS295:STATISTICALNLP(WINTER2017) 21

• Idealteamsizeis3• Absolutemaximumof4• <3ifIapprove(ongoingwork)

GroupsfortheProject

• Firsttworeportsareveryshort(1page)• Finalreportmattersthemost

SubmitFourReports

• Outputisanyphraseorsentence,definitely!• Inputisanyphraseorsentence

• Outputisasequenceorstructure(yes!)• Classification:onlyifoverwordsorphrases

• Outputislinguisticclasses/structures(yes!)

HowdoIknowit’sNLP?

Page 22: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

ScopeofWork

CS295:STATISTICALNLP(WINTER2017) 22

• NewTask/Data• NewMethod/Models• NewApplicationofExistingMethodtoExistingTask

Novelty

• Youdonothavemuchtime!• Aimtohavethewholepipelinedonesoon• Keepthe“scale”ofthedatasmall,sub-sampleifneeded• Bettertohaveacompletefinishedreport

• thangrandideasthatdidnotwork

Butnottoomuch!

• Youdonothavetocodeeverything• Exploitexistingcode,datasets,libraries,webservices• Donotreinventallthewheels!

Reuse

Page 23: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

Example1:What’stheword..

CS295:STATISTICALNLP(WINTER2017) 23

What’sthewordforsomeoneusingpretentiouswords? lexiphanic

MachineLearning(LSTM)

definitionofawordfromthedictionary theworditself

ThiscanbeacoolTwitterbot!

• Accuracyofguessingtheword,usingdefinitionsfromdifferentdictionary?

• Baselines:Google,reversedictionary.org,…Evaluation

Page 24: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

Example2:SQuAD

CS295:STATISTICALNLP(WINTER2017) 24

https://rajpurkar.github.io/SQuAD-explorer/

Tesla was the fourth of five children. He had an older brother named Dane and three sisters, Milka, Angelina and Marica. Dane was killed in a horse-riding accident when Nikola was five. In 1861, Tesla attended the "Lower" or "Primary" School in Smiljan where he studied German, arithmetic, and religion. In 1862, the Tesla family moved to Gospić, Austrian Empire, where Tesla's father worked as a pastor. Nikola completed "Lower" or "Primary" School, followed by the "Lower Real Gymnasium" or "Normal School."

How many siblings did Tesla have?fourWhat was Tesla’s brother’s name?DaneWhat happened to Dane?killed in a horse-riding accident

Page 25: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

DatasetsandPapers

CS295:STATISTICALNLP(WINTER2017) 25

• SearchKaggle,Quora,etc forlargetextdatasets• SeerecentpapersinNLPforreleaseddatasets• Lookfor“sharedtasks”,“challenges”,workshops• Linkstosomeexistingdatasetscomingtowebsitesoon

Data

• NLPConferences:ACL,EMNLP,NAACL• MLConferences:NIPS,ICML,ICLR,AAAI• Datafocusedvenues:TREC/TAC,SemEval,CONLL• Workshopsattheseconferences:interestingdirections• Morepaperscomingsoontothewebsite

Papers

Page 26: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

WritingthePitch

CS295:STATISTICALNLP(WINTER2017) 26

• Teamnameandmembers• Singlesentencedescriptionforeachmember

• (approximately)whattheywilldo• Singlesentenceonwhatmakesyourteamdiverse

Team

• MotivationandProblemDescription• Plannedapproach:tentative• Evaluation:usually,mostimportant

Project

• If1or2,meetmebefore/on January17(o.w.noneed)• Everygrouphastomeetafterwardstodiscusstheproject

Appointment

Page 27: Text Classification 1 - Sameer Singhsameersingh.org/courses/statnlp/wi17/slides/lecture-0112... · Text Classification 1 Prof. Sameer Singh ... Introduction to Text Classification

Upcoming…

CS295:STATISTICALNLP(WINTER2017) 27

• Homework1isup!• Nextlectureswillcontinuewithmoredetails• SignupfortheKaggle account(@uci.edu email)• Due:January26,2017

Homework

• ProjectpitchisdueJanuary23,2017!• Startassemblingteamsnow!(usePiazza)• Startlookingatpapers,data,etc.forideas

Project