nova data science meetup 1/19/2017 - presentation 2

81
Statistical NLP Mona Diab George Washington University

Upload: nova-datascience

Post on 08-Feb-2017

143 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Statistical NLP

Mona Diab George Washington University

Page 2: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Who am I? •  Prof in CS department working on issues

of big data, data science, natural language processing

•  [email protected] •  Check out my research @

– www.seas.gwu.edu/~mtdiab •  NLP lab @gw

– Care4lang1.seas.gwu.edu

Page 3: NOVA Data Science Meetup 1/19/2017 - Presentation 2

“Every 2 days we produce as much information as we did from the beginning of

time till 2003”

“Big Data refers to our ability to make use of the ever-increasing volumes of data.”

“…everything we do is increasingly leaving a

digital trace (or data), which we (and others) can use and analyze.”

Bernard Marr

Page 4: NOVA Data Science Meetup 1/19/2017 - Presentation 2

The Dream •  It’d be great if machines could

•  Process our email (usefully) •  Translate languages accurately •  Help us manage, summarize, and

aggregate information •  Use speech as a UI (when

needed) •  Talk to us / listen to us

• But they can’t: •  Language is complex, ambiguous,

flexible, and subtle •  Good solutions need linguistics

and machine learning knowledge

SlidecourtesyofHengJi

Page 5: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Heterogeneous Big Data

Martin Lockheed 3,000 workers

to furlough amid

\#USGovernmentShutdown

The Patient Protection and Affordable Care Act (PPACA),[1] commonly called the Affordable Care Act (ACA) or Obamacare, is a United States federal statute signed into law by President Barack Obama on March 23, 2010.

The U.S. Congress, still in partisan deadlock over Republican efforts to halt President Barack Obama's healthcare reforms, was on the verge of shutting down most of the U.S. government starting on Tuesday morning.

NSF and NIST are temporarily closed because the Government entered a period of partial shutdown.

President Obama's 70-minute White House meeting late Wednesday afternoon with congressional leaders including House Speaker John Boehner, did nothing to help end the impasse.

Page 6: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Mystery • What’s now impossible for computers (and any other species) to do is effortless for humans

✕ ✕ ✓

Page 7: NOVA Data Science Meetup 1/19/2017 - Presentation 2

NLP to the rescue!

Page 8: NOVA Data Science Meetup 1/19/2017 - Presentation 2

What is NLP?

•  Fundamental goal: deep understanding of broad language use •  not just string processing or keyword matching!

Page 9: NOVA Data Science Meetup 1/19/2017 - Presentation 2

What is NLP/CL? •  NLP: Natural Language Processing

–  Is the field of making computers process natural language •  Does process entail understand?

•  CL: Computational Linguistics –  Is the field of using computers to understand (natural)

language

•  Natural Language? –  Refers to the language spoken by people, e.g. English,

Japanese, Swahili, as opposed to artificial languages, like C++, Java, etc.

Page 10: NOVA Data Science Meetup 1/19/2017 - Presentation 2

What is NLP? •  Computers using and processing natural language input (data)

and producing useful information, could be natural language output/or structured data

•  Software that can recognize, analyze and generate text and speech

•  Typically NLP refers to processing unstructured data – text in free form (unstructured text)

•  Contrast to Structured data refers to information in “tables”

–  Typically allows numerical range and exact match (for text) queries, e.g.,Salary < 60000 AND Manager = Smith, should return Turner, Ian

Employee Manager Salary

Smith,John David,Richard $80,000

Turner,Ian Smith,John $59,000

Huang,Chang Smith,John $69,000

Page 11: NOVA Data Science Meetup 1/19/2017 - Presentation 2

11

Unstructured (text) vs. structured (database) data in 1996

0

20

40

60

80

100

120

140

160

Data volume Market Cap

Unstructured

Structured

Page 12: NOVA Data Science Meetup 1/19/2017 - Presentation 2

12

Unstructured (text) vs. structured (database) data

0

20

40

60

80

100

120

140

160

Data volume Market Cap

Unstructured

Structured

Page 13: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Goals of NLP/CL

•  Model Human Language Processing

•  Analyze Human Language

•  Facilitate Human Language Communication via Automated Tools

Page 14: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Why NLP? •  kJfmmfj mmmvvv nnnffn333 •  Uj iheale eleee mnster vensi credur •  Baboi oi cestnitze •  Coovoel2^ ekk; ldsllk lkdf vnnjfj? •  Fgmflmllk mlfm kfre xnnn!

•  Can you READ this? You, yes you!

Page 15: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Computers Lack Knowledge! •  Computers “see” text in English/Arabic/French

the same way you saw the previous slide! •  People have no trouble understanding language

–  Common sense knowledge –  Reasoning capacity –  Experience

•  However, Computers have –  No common sense knowledge –  No reasoning capacity

Unless we teach them!

Page 16: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Why Should You Care? •  An enormous amount of knowledge is now

available in machine readable form as natural language text

•  Conversational agents are becoming an important form of human-computer communication

•  Much of human-human communication is now mediated by computers

•  Very cool stuff! And with lots of commercial interest.

AdaptedfromSpeechandLanguageProcessing-JurafskyandMarJn

Page 17: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Why NLP? •  Applications for

processing large amounts of texts (BIG DATA) require NLP expertise

•  Classify text into categories •  Index and search large texts •  Automatic machine translation •  Speech understanding

–  Understand phone conversations •  Information extraction

–  Extract useful information from resumes

•  Automatic summarization –  Condense 1 book into 1 page

•  Question answering •  Knowledge acquisition •  Text generation / dialogs

Page 18: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Who uses NLP Commercial World

Page 19: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Why is NLP intriguing?

•  NLP has an AI aspect to it –  We’re often dealing with ill-defined problems– We don’t often come up with exact solutions/

algorithms– We can’t let either of those facts get in the

way of making progress

Page 20: NOVA Data Science Meetup 1/19/2017 - Presentation 2

NLP in CS taxonomy Computers

Artificial Intelligence Algorithms Databases Networking

Robotics Search Natural Language Processing

Information Retrieval

Machine Translation

Language Analysis

Semantics Parsing

Page 21: NOVA Data Science Meetup 1/19/2017 - Presentation 2

The Challenge •  Language is complex with infinite

possible constructions •  Good news is that there are patterns as

the symbol set is finite, but the patterns are latent

•  Abundance of raw data

Page 22: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Why is NLP hard? Some Headlines…

•  Police Begin Campaign To Run Down Jaywalkers •  Iraqi Head Seeks Arms •  Enraged Cow Injures Farmer With Ax •  Teacher Strikes Idle Kids •  Squad Helps Dog Bite Victim •  Red Tape Holds Up New Bridges •  Hospitals Are Sued by 7 Foot Doctors •  Court to Try Shooting Defendant •  Local High School Dropouts Cut in Half

Page 23: NOVA Data Science Meetup 1/19/2017 - Presentation 2

How can a machine understand these differences?

•  Get the cat with the gloves.

Page 24: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Ambiguous Spoken Example I made her duck

•  I cooked waterfowl for her •  I cooked the waterfowl that belongs to

her •  I created the ceramic duck she owns •  I caused her to quickly lower her head •  And more….

Page 25: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Example … continued!

I made her duck maid Eye

Speech recognition

cook

create

Word Sense Disambiguation

Syntactic parsing

Verb

noun

Part of Speech Tagging

Page 26: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Linguistics •  It is the study of the science of human

language

•  How the mind comes up with language

Page 27: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Levels of Language Description •  6 basic levels (more or less explicitly present in most theories):

–  and beyond (pragmatics/logic/...) –  meaning (semantics) –  (surface) syntax –  morphology –  phonology –  phonetics/orthography

•  Each level has an input and output representation –  output from one level is the input to the next (upper)

level –  sometimes levels might be skipped (merged) or split

Page 28: NOVA Data Science Meetup 1/19/2017 - Presentation 2

The Steps in NLP Discourse

Pragmatics

Semantics

Syntax

Morphology **we can go up, down and up and

down and combine steps too!!

**every step is equally complex

Page 29: NOVA Data Science Meetup 1/19/2017 - Presentation 2

The View: Ambiguity

•  All 6 levels of linguistic knowledge require resolving ambiguity

•  Ambiguity results from the existence of multiple possibilities for each level

Page 30: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Ambiguity • Computational linguists are obsessed with ambiguity • Ambiguity is a fundamental problem of computational linguistics

• Resolving ambiguity is a crucial goal

Page 31: NOVA Data Science Meetup 1/19/2017 - Presentation 2

non---standardEnglish

Greatjob@jusJnbieber!WereSOOPROUDofwhatyouveaccomplished!Utaughtus2#neversaynever&youyourselfshouldnevergiveupeither♥

AdaptedfromSpeechandLanguageProcessing---JurafskyandMar<n

Why else is natural language understanding difficult?

Page 32: NOVA Data Science Meetup 1/19/2017 - Presentation 2

non---standardEnglish

Greatjob@jusJnbieber!WereSOOPROUDofwhatyouveaccomplished!Utaughtus2#neversaynever&youyourselfshouldnevergiveupeither♥

segmenta5onissues

theNewYork---NewHavenRailroad

AdaptedfromSpeechandLanguageProcessing---JurafskyandMarJn

theNewYork---NewHavenRailroad

Why else is natural language understanding difficult?

Page 33: NOVA Data Science Meetup 1/19/2017 - Presentation 2

non---standardEnglish

Greatjob@jusJnbieber!WereSOOPROUDofwhatyouveaccomplished!Utaughtus2#neversaynever&youyourselfshouldnevergiveupeither♥

segmenta5onissues idioms

darkhorsegetcoldfeetloseface

throwinthetowel

theNewYork---NewHavenRailroad

AdaptedfromSpeechandLanguageProcessing---JurafskyandMarJn

theNewYork---NewHavenRailroad

Why else is natural language understanding difficult?

Page 34: NOVA Data Science Meetup 1/19/2017 - Presentation 2

non---standardEnglish

Greatjob@jusJnbieber!WereSOOPROUDofwhatyouveaccomplished!Utaughtus2#neversaynever&youyourselfshouldnevergiveupeither♥

segmenta5onissues idioms

darkhorsegetcoldfeetloseface

throwinthetowel

neologisms unfriendRe twee tbromance

theNewYork---NewHavenRailroad

AdaptedfromSpeechandLanguageProcessing---JurafskyandMarJn

theNewYork---NewHavenRailroad

Why else is natural language understanding difficult?

Page 35: NOVA Data Science Meetup 1/19/2017 - Presentation 2

non---standardEnglish

Greatjob@jusJnbieber!WereSOOPROUDofwhatyouveaccomplished!Utaughtus2#neversaynever&youyourselfshouldnevergiveupeither♥

segmenta5onissues idioms

darkhorsegetcoldfeetloseface

throwinthetowel

neologisms

unfriendRe twee tbromance

worldknowledge

MaryandSuearesisters.MaryandSuearemothers.

theNewYork---NewHavenRailroad

AdaptedfromSpeechandLanguageProcessing---JurafskyandMarJn

theNewYork---NewHavenRailroad

Why else is natural language understanding difficult?

Page 36: NOVA Data Science Meetup 1/19/2017 - Presentation 2

non---standardEnglish

Greatjob@jusJnbieber!WereSOOPROUDofwhatyouveaccomplished!Utaughtus2#neversaynever&youyourselfshouldnevergiveupeither♥

segmenta5onissues idioms

darkhorsegetcoldfeetloseface

throwinthetowel

neologisms

unfriendRe twee tbromance

trickyen5tynames

WhereisABug’sLifeplaying…LetItBesoldmillions……amutaJonontheforgene…

worldknowledge

MaryandSuearesisters.MaryandSuearemothers.

theNewYork---NewHavenRailroad

AdaptedfromSpeechandLanguageProcessing---JurafskyandMarJn

theNewYork---NewHavenRailroad

Why else is natural language understanding difficult?

Page 37: NOVA Data Science Meetup 1/19/2017 - Presentation 2

non---standardEnglish

Greatjob@jusJnbieber!WereSOOPROUDofwhatyouveaccomplished!Utaughtus2#neversaynever&youyourselfshouldnevergiveupeither♥

segmenta5onissues idioms

darkhorsegetcoldfeetloseface

throwinthetowel

neologisms

unfriendRe twee tbromance

trickyen5tynames

WhereisABug’sLifeplaying…LetItBesoldmillions……amuta<onontheforgene…

worldknowledge

MaryandSuearesisters.MaryandSuearemothers.

But that’s what makes it fun!

theNewYork---NewHavenRailroad

AdaptedfromSpeechandLanguageProcessing---JurafskyandMarJn

theNewYork---NewHavenRailroad

Why else is natural language understanding difficult?

Page 38: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Making progress on this problem…

•  The task is difficult! What tools do we need? – Knowledge about language – Knowledge about the world – A way to combine knowledge sources

•  How we generally do this: –  probabilistic models built from language data

•  P(“maison” → “house”) high •  P(“L’avocat général” → “the general avocado”) low

–  Luckily, rough text features can often do half the job.

Page 39: NOVA Data Science Meetup 1/19/2017 - Presentation 2

CL Toolkit •  Knowledge of Linguistics, i.e. NLPers call them features!!

•  State Machines –  Finite state automata, transducers

•  Formal Rule Systems –  Regular Grammars, Context Free Grammars

•  Logic –  First order logic, predicate calculus

•  Probability Theory –  Associating probabilities with the previous machinery

•  Machine Learning Tools –  Learning automatically from representations, play a very important role in cases where

we don’t have good explanations of why things happen the way they do

•  Performance Metrics –  Well defined evaluation metrics for different tasks

Page 40: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Major Topics 1.  Words 2.  Syntax 3.  Meaning 4.  Discourse

5. Applications exploiting each

Page 41: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Models and Algorithms

•  By models we mean the formalisms that are used to capture the various kinds of linguistic knowledge we need.

•  Algorithms are then used to manipulate the knowledge representations needed to tackle the task at hand.

Page 42: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Models •  Finite state machines •  Linguistic Rules •  Markov models •  Alignment •  Vector space model of word and

document meaning •  Logical formalisms •  Network models

Page 43: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Algorithms •  Rule-based

– Symbolic Parsers and morphological analyzers

– Finite state automata •  Probabilistic/statistical

– Learned from observation of (labeled) data – Predicting new data based on old – Machine learning

Page 44: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Algorithms •  Many of the algorithms that we’ll study will turn out to

be transducers; algorithms that take one kind of structure as input and output another

•  Unfortunately, ambiguity makes this process difficult •  This leads us to employ algorithms that are designed to

handle ambiguity of various kinds •  State-space search paradigm: To manage the problem

of making choices during processing when we lack the information needed to make the right choice

Page 45: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Machine Learning Machine learning based classifiers that are trained to make decisions based on (implicitly or explicitly modeled) features from context Simple Classifiers:

Naïve Bayes Logistic Regression (MaxEnt) Decision Trees Neural Networks

Sequence Models:

Hidden Markov Models Maximum Entropy Markov Models Conditional Random Fields Recursive Neural Networks (RNNs, LSTMs)

Page 46: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Approaching the challenge •  Divide & Conquer

– Break the problem into smaller problems

•  Throw state of the art techniques at the smaller problems

•  Keep your fingers crossed!!

Page 47: NOVA Data Science Meetup 1/19/2017 - Presentation 2

NLP Categories •  Applications

•  Word counters (wc in UNIX) •  Spell Checkers, grammar checkers •  Predictive Text on mobile handsets •  Machine Translation (MT) •  Information Retrieval (IR) •  Automatic Speech Recognition (ASR) •  Optical Character Recognition (OCR) •  Automatic Summarization, Speech Synthesis, etc.

•  Enabling Technologies –  Tokenization –  Part-of-Speech Tagging –  Syntactic Parsing –  Lemmatization –  Word Sense Disambiguation, etc.

Page 48: NOVA Data Science Meetup 1/19/2017 - Presentation 2

•  Alan Turing was British pioneering computer scientist, mathematician, logician, and cryptanalyst. He is widely considered the Father of Computer Science.

•  The movie Imitation Game is about him. •  The Turing test is a test of a machine's ability to exhibit

intelligent behavior equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine that is designed to generate human-­‐ like responses.

Turing Test

CourtesyofNizarHabash

Page 49: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Current Real-World Applications •  Search: very large corpora, e.g. Google •  Information Extraction: relevant information to a task •  Sentiment analysis: restaurant or movie reviews •  Summarizing very large amounts of text or speech: e.g.

your email, the news, voicemail •  Translating between one language and another: e.g.

Google Translate, Babelfish •  Dialogue systems: e.g. chatbots, Amtrak’s ‘Julie’ •  Question answering: e.g. IBM’s Watson Jeopardy!,

DARPA who/what/where…, Ask Jeeves •  Even more: speech processing, common sense

knowledge, text categorization, web monitoring, etc.

Page 50: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Recommendation Engines

Page 51: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Personal Assistants

Page 52: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Machine Translation •  Basic types of Machine Translation

– Text to Text Machine Translations – Speech to Speech Machine Translations

•  To date, majority of approaches have targeted rich language pairs (with lots of automated resources) – No Swahili-German systems

•  Current approaches are statistical, learning from existing translations (parallel data collections)

•  Reasonable performance due significant funding

Page 53: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Google Translate

AdaptedfromSpeechandLanguageProcessing-JurafskyandMarJn

Page 54: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Google Translate

AdaptedfromSpeechandLanguageProcessing-JurafskyandMarJn

Page 55: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Text Summarization

Information Extraction

10TH DEGREE is a full service advertising agency specializing in direct and inter-active marketing. Located in Irvine CA, 10TH DEGREE is looking for an AssistantAccount Manager to help manage and coordinate interactive marketing initiativesfor a marquee automative account. Experience in online marketing, automativeand/or the advertising field is a plus. Assistant Account Manager ResponsibilitiesEnsures smooth implementation of programs and initiatives Helps manage the de-livery of projects and key client deliverables . . . Compensation: $50,000-$80,000Hiring Organization: 10TH DEGREE

⇓INDUSTRY AdvertisingPOSITION Assistant Account ManagerLOCATION Irvine, CACOMPANY 10TH DEGREESALARY $50,000-$80,000

Information Extraction

! Goal: Map a document collection to structured database! Motivation:

! Complex searches (“Find me all the jobs in advertisingpaying at least $50,000 in Boston”)

! Statistical queries (“How has the number of jobs inaccounting changed over the years?”)

Text Summarization Dialogue Systems

User: I need a flight from Boston to Washington, arriving by10 pm.System: What day are you flying on?User: TomorrowSystem: Returns a list of flights

Page 56: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Blog Analytics •  Data-mining of blogs, discussion forums,

message boards, user groups, and other forms of user generated media – Product marketing information – Political opinion tracking – Social network analysis – Buzz analysis (what’s hot, what topics are

people talking about right now).

Page 57: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Livejournal.com:

I, me, my on or after Sep 11, 2001

o30-n5o16-o22

o2-o8s24

s22s20

s18s16

s14s12

B

7.2

7.0

6.8

6.6

6.4

6.2

6.0

5.8

GraphfromPennebakerslides

Cohn,Mehl,Pennebaker.2004.LinguisJcmarkersofpsychologicalchangesurroundingSeptember11,2001.PsychologicalScience15,10:687-693.

Page 58: NOVA Data Science Meetup 1/19/2017 - Presentation 2

September 11 LiveJournal.com study: We, us, our

o30-n5o16-o22

o2-o8s24

s22s20

s18s16

s14s12

B

1.1

1.0

.9

.8

.7

.6

.5

Cohn,Mehl,Pennebaker.2004.LinguisJcmarkersofpsychologicalchangesurroundingSeptember11,2001.PsychologicalScience15,10:687-693.

GraphfromPennebakerslides

Page 59: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Sentiment Analysis •  Movie Review Mining

–  User1: The Matrix rocked, I simply loved it…. –  User2: Really, that Keanu Reaves gets on my nerves,

he is too robotic –  User1: it was way deep, it obviously went over your

head! –  User2: I think it GOT INTO ur head J

•  What do you think User1 and User2’s sentiments are toward the movie? –  User1 –  User2

•  What do you think the sentiment of User2 toward User1 is?

Page 60: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Sentiment Analysis •  Movie Review Mining

–  User1: The Matrix rocked, I simply loved it…. –  User2: Really, that Keanu Reaves gets on my nerves,

he is too robotic –  User1: it was way deep, it obviously went over your

head! –  User2: I think it GOT INTO ur head J

•  What do you think User1 and User2’s sentiments are toward the movie? –  User1 –  User2

•  What do you think the sentiment of User2 toward User1 is?

Page 61: NOVA Data Science Meetup 1/19/2017 - Presentation 2

What about positive restaurant reviews? Sex, Drugs, and Dessert

•  sexy food •  seductively seared fois

gras

�  addicted to pepper shooters �  garlic noodles… my drug of

choice �  the fries are like crack

Page 62: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Let’s Zoom in •  Beyond Sentiment

– Perspective Identification •  Why would you need that?

–  Polling – Detection of bias

•  How would you go about doing it? – Sentiment analysis – Opinion mining –  Cognitive state (belief)

Page 63: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Dialogue Systems •  User: Can you book me a flight from New York

City to Washington? •  System: Is your destination Washington

state? •  User: No I mean Washington DC •  System: When would you like to fly?

•  Example: Amtrak’s Julie system

Page 64: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Tell me what's bothering you. hbp://www.masswerk.at/elizabot/eliza.html

Page 65: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Eliza

•  Eliza is a mock Rogerian psychotherapist –  Rogerian psychotherapy, or person-­‐centered therapy, is a

form of talk-­‐psychotherapy developed by psychologist Carl Rogers in the 1940s & 1950s.

•  The original program was described by Joseph Weizenbaum in 1966.

•  Primitive natural language processing using simple scripts and regular expressions

•  Chatterbots or chatbots •  Eliza effect

–  the tendency to unconsciously assume computer behaviors are analogous to human behaviors.

•  Turing test

Page 66: NOVA Data Science Meetup 1/19/2017 - Presentation 2

•  Men are all alike. •  IN WHAT WAY? •  They're always bugging us about something or other. •  CAN YOU THINK OF A SPECIFIC EXAMPLE? •  Well, my boyfriend made me come here. •  YOUR BOYFRIEND MADE YOU COME HERE •  He says I'm depressed much of the time. •  I AM SORRY TO HEAR YOU ARE DEPRESSED •  It's true. I am unhappy •  DO YOU THINK COMING HERE WILL HELP YOU NOT TO BE UNHAPPY •  I need some help, that much seems certain. •  WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP •  Perhaps I could learn to get along with my mother. •  TELL ME MORE ABOUT YOUR FAMILY •  My mother takes care of me. •  WHO ELSE IN YOUR FAMILY TAKES CARE OF YOU •  My father. •  YOUR FATHER •  You are like my father in some ways.

Original Chatting Program Eliza

Page 67: NOVA Data Science Meetup 1/19/2017 - Presentation 2

s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/ s/.* all .*/IN WHAT WAY/ s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

Eliza-style regular expressions Step 1: replace first person with second person references

s/\bI(’m| am)\b /YOU ARE/g s/\bmy\b /YOUR/g S/\bmine\b /YOURS/g Step 2: use additional regular expressions to generate replies

Step 3: use scores to rank possible transformations

Page 68: NOVA Data Science Meetup 1/19/2017 - Presentation 2

•  Let’s chat with Mitsuku! •  http://www.mitsuku.com •  Loebner prize winner 2013,

runner up 2015 – Modern form of the Turing test

for Artificial Intelligence

Mitsuku

SlidecourtesyofNizarHabash

Page 69: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Question Answering: IBM’s Watson

Page 70: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Question Answering: IBM’s Watson

•  Won Jeopardy on February 16, 2011!

70

WILLIAMWILKINSON’S“ANACCOUNTOFTHEPRINCIPALITIESOF

WALLACHIAANDMOLDOVIA”INSPIREDTHISAUTHOR’SMOSTFAMOUSNOVEL

BramStoker

Page 71: NOVA Data Science Meetup 1/19/2017 - Presentation 2

§§  Capture the imagination –  The Next Deep Blue

§§  Engage the scientific community

–  Envision new ways for computers to impact society & science –  Drive important and measurable scientific advances

§§  Be Relevant to IBM Customers

–  Enable better, faster decision making over unstructured and structured content –  Business Intelligence, Knowledge Discovery and Management, Government,

Compliance, Publishing, Legal, Healthcare, Business Integrity, Customer Relationship Management, Web Self-Service, Product Support, etc.

A Grand Challenge Opportunity

©2011IBMCorporaJon

Page 72: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Real Language is Real Hard

– A finite, mathematically well-defined search space – Limited number of moves and states – Grounded in explicit, unambiguous mathematical rules

– Ambiguous, contextual and implicit – Grounded only in human cognition – Seemingly infinite number of ways to express the same meaning

©2011IBMCorporaJon

Chess

HumanLanguage

Page 73: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Easy Questions?

Serial Number Type Invoice # 45322190-AK LapTop INV10895

David Jones

David Jones =

ln((12,546,798 * π)) ^ 2 / 34,567.46 = 0.00885

Select Payment where Owner=“David Jones” and Type(Product)=“Laptop”,

Owner Serial Number David Jones 45322190-AK

Invoice # Vendor Payment INV10895 MyBuy $104.56

Dave Jones

David Jones ≠ ©2011IBMCorporaJon

Page 74: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Hard Questions? Computer programs are natively explicit, fast and exacting in their calculation over numbers and symbols….But Natural Language is implicit, highly contextual, ambiguous and often imprecise.

§§ Where was X born? One day, from among his city views of Ulm, Otto chose a water color to

send to Albert Einstein as a remembrance of Einstein´s birthplace.

§§ X ran this? If leadership is an art then surely Jack Welch has proved himself a

master painter during his tenure at GE.

Person Birth Place A.  Einstein ULM

Person Organization J. Welch GE

Structured

Unstructured

©2011IBMCorporaJon

Page 75: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Automatic Open-Domain Question Answering A Long-Standing Challenge in Artificial Intelligence to emulate human expertise

©2011IBMCorporaJon7

§§  Given –  Rich Natural Language Questions –  Over a Broad Domain of Knowledge

§§  Deliver

–  Precise Answers: Determine what is being asked & give precise response –  Accurate Confidences: Determine likelihood answer is correct –  Consumable Justifications: Explain why the answer is right –  Fast Response Time: Precision & Confidence in <3 seconds

Page 76: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Information Retrieval •  Very successful enterprise: Google, Bing,

Yahoo, Altavista •  General model: given a huge collection of texts

(document collection), given a query –  Task: find specific documents that are relevant to

the given query –  How: Create an index, like the index in a book to

look up the information, predominant approaches include vector space models

Page 77: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Information Extraction Subject: curriculum meeting Date: January 15, 2012 To: Dan Jurafsky Hi Dan, we’ve now scheduled the curriculum meeting. It will be in Gates 159 tomorrow from 10:00-11:30. -Chris Create new Calendar entry

Event: Curriculum mtg

Date: Jan-16-2012 Start: 10:00am

End: 11:30am Where: Gates 159

Page 78: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Information Extraction

•  nice and compact to carry! •  since the camera is small and light, I won't

need to carry around those heavy, bulky professional cameras either!

•  the camera feels flimsy, is plastic and very light in weight you have to be very delicate in the handling of this camera 78

Sizeandweight

Abributes:zoomaffordabilitysizeandweightflasheaseofuse

Page 79: NOVA Data Science Meetup 1/19/2017 - Presentation 2

LanguageTechnology

CoreferenceresoluJon

QuesJonanswering(QA)

Part-of-speech(POS)tagging

WordsensedisambiguaJon(WSD)

Paraphrase

NamedenJtyrecogniJon(NER)

ParsingSummarizaJon

InformaJonextracJon(IE)

MachinetranslaJon(MT)Dialog

SenJmentanalysis

mostlysolved

makinggoodprogress

sJllreallyhard

SpamdetecJon

Let’sgotoAgra!

BuyV1AGRA…

Colorlessgreenideassleepfuriously.

ADJADJNOUNVERBADV

EinsteinmetwithUNofficialsinPrincetonPERSONORGLOC

You’reinvitedtoourdinnerparty,FridayMay27at8:30

PartyMay27add

BestroastchickeninSanFrancisco!

Thewaiterignoredusfor20minutes.

CartertoldMubarakheshouldn’trunagain.

Ineednewbaberiesformymouse.

The13thShanghaiInternaJonalFilmFesJval…

第13届上海国际电影节开幕…

TheDowJonesisup

Housingpricesrose

Economyisgood

Q.HoweffecJveisibuprofeninreducingfeverinpaJentswithacutefebrileillness?

IcanseeAlcatrazfromthewindow!

XYZacquiredABCyesterday

ABChasbeentakenoverbyXYZ

WhereisCiJzenKaneplayinginSF?

CastroTheatreat7:30.DoyouwantaJcket?

TheS&P500jumped

Page 80: NOVA Data Science Meetup 1/19/2017 - Presentation 2

•  Thanks for listening!! •  Questions?

Page 81: NOVA Data Science Meetup 1/19/2017 - Presentation 2

Reminder of who I amJ •  Prof in CS department working on issues

of big data, data science, natural language processing

•  [email protected] •  Check out my research @

– www.seas.gwu.edu/~mtdiab •  NLP lab @gw

– Care4lang1.seas.gwu.edu