natural language processing: challenges, solutions, and … · 2017-06-06 · natural language...

28
Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013

Upload: others

Post on 20-May-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

Natural Language Processing: Challenges, Solutions, and Applications

Bonnie Dorr

May, 2013

Page 2: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

•  Language understanding, translation, and summarization, require more than “just statistics” •  Moving from high resolution (low noise) media to

unrestricted and degraded or noisy media

•  Linguistically-motivated approaches can benefit from the robustness of statistical/ML techniques •  Moving from problems with general characteristics to

problems applicable to real-world data.

Themes

Page 3: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

DEFT

Deep Language Understanding

Hardcopy Foreign Documents

Military commander or

warfighter

Computer performing

analysis

Digital English Text

Digital Foreign Text

Foreign Speech

conversation

SMS, email

DARPA’s Language Research Programs

Digital Foreign Text

BOLT

English-speaking decision maker

Translation Interaction

Retrieval

Approved for Public Release, Distribution Unlimited

MADCAT OCR

RATS Speech

Recognition

3 Dorr, AMTA-2012 Keynote

Speech, October, 2012

Page 4: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

What are the challenges to language research?

•  Linguistic variety •  Dialects

•  Cross-language divergences

•  Technological shifts in forms of communication •  Casual, verbal communication

•  Grammatical correctness in structure has disappeared.

•  Global societal change

•  Volume of information •  Astronomical increases in volume

•  Demands on human translators vastly exceed resources

•  Far surpasses human assessment capability

Automated Foreign Language Exploitation is the Key

Approved for Public Release, Distribution Unlimited 4 Dorr, AMTA-2012 Keynote

Speech, October, 2012

Page 5: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

Multilingual Automatic Document Classification, Analysis, and Translation (MADCAT)

Approved for Public Release, Distribution Unlimited

•  Objective: Extract actionable info from foreign language text images

•  Technical Approach: •  Detect and recognize text in images, extract relevant metadata, and

translate recognized text into English •  Joint optimization of MADCAT component technologies and GALE machine

translation (MT) Pre-

Processing

Hypotheses Joint Optimization

Input Image

English Text

Page Zoning

Text Recognition

GALE MT

5 Dorr, AMTA-2012 Keynote

Speech, October, 2012

Page 6: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

MADCAT Technical Progress

6

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

Tra

nsl

atio

n A

ccu

racy

(1

00

-HT

ER

)

Percentage of Documents

MADCAT-Generated Data

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100 T

ran

slat

ion

Acc

ura

cy (

10

0-H

TE

R)

Percentage of Documents

Data collected in Iraq in 1991

Editable Gistable Triageable

P4

P3

P2

P1

P4

P3

P2

Approved for Public Release, Distribution Unlimited 6 Dorr, AMTA-2012 Keynote

Speech, October, 2012

Page 7: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

Goal: Create technologies for exploitation of potentially speech-containing signals received over extremely noisy and/or distorted communication channels

Improve Capability to Find and Make use of Foreign Language Speech Signals

7

LID Language Identification

SID Speaker Identification

KWS Key-Word Spotting

Categorize Distribute 0-10 dB SNR (Signal to Noise Ratio)

SAD Speech Activity Detection

Find Speech

Approved for Public Release, Distribution Unlimited Dorr, AMTA-2012 Keynote

Speech, October, 2012

Robust Automatic Transcription (RATS)

Page 8: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

Broad Operational Language Translation (BOLT)

BOLT is developing natural language processing capability to enable: 1. Translation and information retrieval for informal language and 2. Bilingual, multi-turn informal conversation using text or speech

Informal language is characterized by: ●  Use of dialects ●  Sloppy or garbled speech or text ●  Incomplete and ungrammatical sentences ●  Frequent references by use of pronouns ●  Frequent changes in topic ●  Interjection of disfluencies (restarts, interjections - “uh”and fill words – “you know”)

Formality Material Type Dialect Accuracy

Formal newswire MSA 95%

Semi formal blogs & news groups MSA 87%

Semi formal various web media Dialectal Arabic 67%

Informal* messaging Dialectal Arabic <40%

Baseline MT System Error Rate for Arabic à English Text

*Estimated degradation

Program Goal: Informal Language and Multi-Turn Conversations

Distribution authorized to U.S. Government Agencies and their contractors 8 Dorr, AMTA-2012 Keynote

Speech, October, 2012

Page 9: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

Handling Dialects

Arabic Variant   Arabic Source Text   Pre-BOLT MT  Modern Standard Arabic   للااييووججدد ككههررببااء٬، ممااذذاا

ححددثث؟ Does not have electricity, what happened?

Egyptian Regional Dialect   االلككههرربباا ااتتققططععتت٬، للييهه ككددهه ببسس؟

Atqtat electrical wires, Why are Posted?

Levantine Regional Dialect   ششككللوو ممففييشش ككههرربباا٬، للييشش ههييكك؟

Cklo Mafeesh ككههرربباا, Lech heck?

Iraqi Regional Dialect   ششوو ممااككوو ككههررببااء٬، خخييرر؟ Xu MACON electricity, good?

Reference: There is no electricity, what happened?

Handling dialects is crucial for automated processing of informal Arabic web material

BOLT Target Applications: Examples

ت يقوتل ا ع تيق وتل ا ع صب تيوتي ر لم عت ام لب ق للوووولل

Reference: before you retweet, check the Time lol Pre-BOLT MT: T. Ikutl AZ AZ Tel Tik casting Tioti t not signed or the core of S to Wall

Approved for Public Release, Distribution Unlimited 9 Dorr, AMTA-2012 Keynote

Speech, October, 2012

Page 10: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

Pronoun and null subject/object resolution

大家心里都能猜出几分,越是这样控制舆论越让人们群众心里起疑。 Reference: Everyone can guess in their hearts, and the more they try

to control public opinion this way, the more the people become suspicious in their hearts.

Literal: Everyone heart in can guess some, more is thus control

public opinion more cause masses heart in arise suspicion.

Pre-BOLT MT: We all know can guess a bit, the more people that control the mass media more suspicious mind.

BOLT Target Applications: Chinese Examples

Approved for Public Release, Distribution Unlimited 10 Dorr, AMTA-2012 Keynote

Speech, October, 2012

Page 11: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

BOLT Evaluation Results - Machine Translation

Approved for Public Release, Distribution Unlimited

Editable Gistable Triageable

95 87 80

Averages

11 Dorr, AMTA-2012 Keynote

Speech, October, 2012

Page 12: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

Evaluation Results - Machine Translation

Approved for Public Release, Distribution Unlimited 12

Editable Gistable Triageable

88 82 73

Averages

Dorr, AMTA-2012 Keynote Speech, October, 2012

Page 13: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

Clarification Dialogue Example

Approved for Public Release, Distribution Unlimited

Script: The <wines> will not be allowed on the sites.

ASR of User: The wind this will not be allowed on the sites

Clarification1 Please give me a different word or phrase for this: (USER SPEECH)

ASR of User: the alcohol

Clarification2 Did you mean allowed as in permitted, or aloud as in using one’s voice? Please say one or two

ASR of User: one

Transcript the alcohol will not be allowed on the sites

Translation

13

Page 14: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

Clarification Dialogue – Before and After

Approved for Public Release, Distribution Unlimited 14

Page 15: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

•  Expressing the underlying concept of a set of words in one language using a different structure in another language

•  Experiments indicate that these occur in 1/3 of sentences in certain language pairs (e.g., English-Spanish).

•  Proper handling of linguistic divergences: •  enriches translation mappings for statistical extraction •  improves the quality of word alignment for statistical MT.

Ah-hah! Back to our theme.

15

The MT Challenge: Linguistic Divergences

Page 16: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

Divergence Categories

•  Light Verb Construction To butter à poner mantequilla (put butter)

•  Manner Conflation To float à ir flotando (go floating)

•  Head Swapping Swim across à atravesar nadando (cross swimming)

•  Thematic Divergence I like grapes à me gustan uvas (to-me please grapes)

•  Categorial Divergence To be hungry à tener hambre (have hunger)

•  Structural Divergence To enter the house à entrar en la casa (enter in the house)

16

Page 17: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

•  Motivating Question: Can we inject statistical techniques into linguistically motivated MT?

•  Using “approximate Interlingua” for MT •  Tap into richness of deep target-language resources

•  Linguistic Verb Database (LVD) •  CatVar database (CATVAR)

•  Constrained overgeneration •  Generate multiple linguistically-motivated sentences •  Statistically pare down results

[Work with Nizar Habash and Christof Monz, 2009]

17

Generation-Heavy Hybrid MT (GHMT)

Page 18: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

GHMT Example

Yo puse mantequilla en el pan

puse

Sentence

Subj Pred

mantequilla

Obj1 PP

el pan

Obj2 en

Yo Goal Theme Agent [CAUSE GO]

PUT V

I I I BUTTER N BREAD

18

Page 19: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

GHMT Example (continued)

Knowledge Resources in English only (LVD; CATVAR – Dorr, Habash, Monz, 2003, 2006, 2009)

Goal

PUT V

I

Theme Agent Goal

BUTTER

I

Agent [CAUSE GO] [CAUSE GO]

I I BUTTER N BREAD

V

I I BREAD

19

Page 20: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

Output Ranking:

X puse mantequilla en Y (X put butter on Y)

X buttered Y

20

GHMT: Statistical Extraction after Linguistic Generation (Language Model induced from ML)

Rank Hypothesis

1 I buttered the bread

2 I butter the bread

3 I breaded the butter

4 I bread the butter

5 I buttered the loaf

6 I butter the loaf

7 I put the butter on bread

Page 21: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

•  Combination of approaches (“Hybrid MT” and “Linguistically informed Stats MT”) achieves better MT results than either approach alone (Habash & Dorr, 2006)

•  Best paper award, NAACL 2007: "Combining Outputs from Multiple Machine Translation Systems" (Ayan&Dorr @ University of Maryland and Rosti&Schwartz @ BBN)

•  Hybrid approaches are now the standard for large-scale MT systems.

•  Jacob Devlin, former UMD student, exploring richer combination approaches. (Best paper, NAACL-2012)

21

MT System Combination Findings

Page 22: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

What types of linguistic knowledge (beyond syntactic parsing) are useful?

22

Page 23: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

Linguistic Annotations and Inferred Knowledge Units

First Order Knowledge Second Order Knowledge Modality: X __ happen

can <permissive> must <obligative> might <possible>

Confidence value regarding the occurrence of event X

Dialog Acts: Send me Y <request-action> Did Y get sent <request info> I sent Y <inform>

Inferred relationships and intentions

Requestee is subordinate of requester “send Y” is an intended activity

23

Page 24: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

•  Patterns of interaction reflect social situation (who has power, who has status, etc) [Passonneau & Rambow, 2009]

•  Patterns of interaction (taken together with modality/confidence) reveal implicit relationships and underlying intentions [Discussions with IHMC 2011]

•  Use of Modality and Negation for Semantically Informed Machine Translation [Dorr et al., 2012 Computational Linguistics, 38:2]

•  Opinion Analysis for detecting Intensity, not just positive vs. negative. [U.S. Patent 8,296,168, October 23, 2012, with Subrahmanian, Reforgiato, and Sagoff].

24

Inferring relationships and intentions from informal communication

Page 25: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

Deep Exploration and Filtering of Text (DEFT)

Example Information Communication

Approved for Public Release, Distribution Unlimited 25

A: Where were you? We waited all day for you and you never came. B: I couldn’t make it through, there was no way. They…they were everywhere. Not even a mouse could have gotten through. A: You should have found a way. You know we need the stuff for the…the party tomorrow. We need a new place to meet…tonight. How about the…uh…uh…the house? You know, the one where we met last time. B: You mean your uncle’s house? A: Yes, the same as last time. Don’t forget anything. We need all of the stuff. I already paid you, so you had better deliver. You had better not $%*! this up again.

Co-referring Locations Inter-related Events Anomaly, Novelty, Emerging trends Causal Relations (why, how)

Dorr, AMTA-2012 Keynote Speech, October, 2012

Page 26: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

What would DEFT produce?

Informal Communication Input DEFT Summary Bullets •  People

•  Person A – Superior; planner. •  Person B – Inferior; courier. •  Unspecified group “we”

•  Associations •  B has previously received a call from a

person of interest •  Activities

•  Meeting between A and B •  One failed attempt today •  Planning another attempt for tonight. Time

unspecified. •  Planned delivery of “stuff” by B to A

•  Causal Links •  Attempted meeting did not happen due to

security in area of meeting. •  Geospatio-Temporal Links

•  The planned event is going to happen tomorrow

•  Location: “the house”, point of call origination •  Entity-Event Linking

•  Replanned meeting associated with “your uncle’s house”

•  Previous meetings at “your uncle’s house” •  B paid an unspecified amount for services. Information from other Sources

A: Where were you? We waited all day for you and you never came. B: I couldn’t make it through, there was no way. They…they were everywhere. Not even a mouse could have gotten through. A: You should have found a way. You know we need the stuff for the…the party tomorrow. We need a new place to meet…tonight. How about the…uh…uh…the house? You know, the one where we met last time. B: You mean your uncle’s house? A: Yes, the same as last time. Don’t forget anything. We need all of the stuff. I already paid you, so you had better deliver. You had better not $%*! this up again.

26 Approved for Public Release, Distribution Unlimited

Dorr, AMTA-2012 Keynote Speech, October, 2012

Page 27: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

The Future

•  Global shift to new forms of communication •  Informal communication, dialects, and implictly conveyed info

•  Focus on problems and data with real-world applicability •  Express technical progress in terms understandable to end

users (e.g., editable, gistable, triageable)

•  Generation-Heavy Hybrid MT •  More robust across genre •  System combination produces best results

•  Inferring Relations and Intentions from informal communication •  Requires deeper linguistic knowledge •  Potential for hybrid linguistic/statistical approach to inference

27

Page 28: Natural Language Processing: Challenges, Solutions, and … · 2017-06-06 · Natural Language Processing: Challenges, Solutions, and Applications Bonnie Dorr May, 2013 • Language

Questions?

[email protected]

Approved for Public Release, Distribution Unlimited 28