nlp research at internet age an overview of nlp at microsoft research asia ming zhou manager of...

42
NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Upload: danielle-silversmith

Post on 14-Dec-2015

228 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

NLP Research at Internet AgeAn Overview of NLP at Microsoft Research Asia

Ming ZhouManager of Natural Language Group

Microsoft Research Asia

Page 2: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Trends of Internet Services

• Eco system to work with third party’s apps– Apple Apps, Facebook, Twitter, Baidu, Sina, QQ

• Real time content collection and search– Twitter, Facebook, Del.ici.ous, NYT, YouTube

• Mobile search– Contextual intent understanding– Towards decision making and action taking

• Social power– Social tags (like) for general search engines– Search engines in SNS – Social QA

Page 3: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Impact and Challenge to NLP Research

• Impact– Biggest database ever – connects data– Biggest social network – connects people– Harnessing collective intelligence– Contextual information processing: User, user’s social

network, location, time – Real-time information processing: Collection, index,

operation without delay • Challenge

– How to leverage data, people, contextual information to reach real-time information processing?

Page 4: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Problems of Traditional NLP Approaches (NLP 1.0)

• Deep in individual component technologies but reach upper bounds

• Less consider scenarios, user’s need, market need• Serious data sparseness with human annotation • Evaluation bottleneck• Slow deployment • Lack effective framework to involve users’ feedback

4

Page 5: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

New Strategy of NLP (NLP2.0)• Data collection from the web• Domain specific and open-IE • Contextual NLP • Maximize on the system level not on the

individual component• Earlier deployment on Internet • Make best use of social factors

5

Page 6: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Our Vision and Task

• Advanced NLP technologies– Word breaker, POS tagging, chunking, syntactic parser, semantic role

labeling, speller, query suggestion, summarization– Chinese, Japanese, English

• Multi-language information access– Statistical machine translation– Multi-language search

• Semantic computing– Sentiment analysis, event extraction, ontology learning– Understanding query intent and document – Contextual NLP

Understand user and document in any language, for any device and any applications

Page 7: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Text analysis

Skeleton parser

Named entity identification

Pos tagging

SLM

Com

ponent techs

Machine Translation

Translation evaluation

Tran. know. acquisition

WEB mining for MT

SMT

Information Extraction

Annotation tool

Machine learning

Term extraction

Information Retrieval

paraphrasing

Vertical search

Cross language IR

NLP enriched Indexing

and search

Query-doc relevance

Text mining

Data NLP (C, J, E) MT (C, J, E)

MRD

Translation

lexicon

Bilingual corpus

Bilingual tagged

corpus

IR and IE (C,J,E)

MRD

Parsing lexicon Tagged corpus

Balanced corpus

Applications

Chinese IME

Query speller

English writing wizard News Search

Twitter SearchPocket translatorJapanese IME

MSRA NLP Research Overview

Meta data extraction

Couplet generation Resume Routing General web search

Chatbot

Comparison Shopping

Page 8: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Research Accomplishment • Awards

– MSRA Best Research Team(2010)– Finalist of WSJ Asian Innovation Awards (2010)– MS ARD Best Project (Engkoo)– MSRA Best Innovation (1998-2008): IME and Chinese couplets

• Academic impact– Best result in NIST 2008 SMT, CWMT 2008 and CWMT 2009– Best result in SIGHAN 2006 bake off on Chinese word segmentation– Best result in cross language information retrieval in TREC-9, NTCIR-III– 40 ACL papers, 9 SIGIR, 17 Coling papers (2000-2010)– PC Chair, area chair of ACL

• Collaboration with universities– HIT Joint lab on NLP, Speech and Search, Tsinghua Joint lab on Media and

Network– 400 interns in 12 years– Summer schools since 2001– PhD supervisors at universities

8

Page 9: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Summer School on Information Extraction (Harbin, June, 2005)

Cheng Niu: Information extraction

Frank Seide: Speech information extraction

and search

Hwee Tou Ng: Advanced topics of information

extraction

Chin-Yew Lin: Information extraction

for automatic summarization

Page 10: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Projects based on NLP 2.0

• Engkoo: Web-based English learning service– Data mining from the web

• Chinese couplets– Include user’s power into system evolvement

• Semantic analysis and search of micro-blogging– Move to SNS, mobile

Page 11: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

EngkooParallel data mining from the web

Video: http://video.sina.com.cn/v/b/37417609-1286528122.html

Page 12: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Rapidly Changing Language

• Approximately 1.5 billion people speak English as a primary, secondary or business language

• China: The largest “English speaking” country with 250 million English learners and USD 60 billion annual expenses

• Problem: Live language: new words, new meanings

Key Insight:With billions of translated web pages and sharable repositories of language data growing every day, the Internet holds the sum of human language knowledge

Page 13: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

www.engkoo.com

Major Features: Microsoft Products:

Endless Lexicon with Native Definitions

State-of-the-Art Machine Translation(NIST OpenMT Winner)

Real-time Interactive Alignment

Bing

Office

MSN

Human-Like TTS & Phonetic Search

Page 14: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Massive Dictionary Mined from the Web

Page 15: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Fresh and Diverse Examples

Page 16: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Advanced Search with Sentence Analysis

Page 17: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia
Page 18: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Sentences Classification

Page 19: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia
Page 20: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia
Page 21: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Learn Contextual Usage with Word Alignment

Page 22: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Learn Contextual Usage with Word Alignment

Page 23: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Learn Contextual Usage with Word Alignment

Page 24: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Hints of Easy-Confused Words

Page 25: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia
Page 26: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Knowlege Mining Pipeline

MinedData

Parsed Data Linguistic

Knowledge

Web Mining Indexed

DataLinguistic Parsing

Knowledge Mining

Multi-level

Indexing

Machine Translation ModelParaphrasing Model

tokenizing: he could hardly afford to waste that golden time. 他 无法 浪费 那样 的 好 时光。

skeleton parsing: (Tsub~he~afford) (ModAdv~hardly~afford) (Tobj~waste~afford) (Tobj~time~waste) (AdjAttrib~golden~time)

(Tsub~ 他 ~ 浪费 ) (ModAdv~ 无法 ~ 浪费 )(Tobj~ 浪费 ~ 时光 ) (AdjAttrib~ 好 ~ 时光 )

alignment: he( 他 ) could hardly afford to( 无法 ) waste( 浪费 ) that( 那样的 ) golden( 好 ) time( 时光 )

1. word’s idiomatic usage • Verb~Noun (decline~offer)

• Verb~Adv (greatly~improve)• Adj~Noun (arduous~task)• Adv~Adj (extremely~bad)

2. paraphrasing• turn_on~light, switch_on~light

• laborious~task, hard~task• deeply~moved, deeply~touched

3. collocation translations• 订 ~ 计划 ,make~plan• 订 ~ 旅馆 , book~room

• 订 ~ 杂志 , subscribe to ~magazine

Parallel Sentence: He could hardly afford to waste that golden time.

他无法浪费那样的好时光。

1. single word“he”, “could”, “hardly”, “afford” etc.“ 他” , “ 无法” , ” 浪费“ etc.

2. single word with its POS“he_Pron”, “could_Verb”,“hardly_Adv” etc.

“ 他 _Pron”, “ 无法 _Adv”, ” 浪费 _Verb“ etc.3. collocation

“Tsub~he~afford ”, “Tobj~time~waste” etc.“Tsub~ 他 ~ 浪费” , “ModAdv~ 无法 ~ 浪费” etc.

Page 27: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Chinese Couplets

Include user‘s power into system evolvement

Page 28: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Chinese Couplets (http://duilian.msra.cn)

http://video.sina.com.cn/v/b/10937201-1452530713.html

Page 29: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

FS and SS Share the Same Style

风 (wind)---------------- 水 (water)吹 (blow) --------------- 使 (make)荞 (buckwheat) -- ------ 舟 (ship)动 (wave)---------------- 流 (go)桥 (bridge) ------------- 洲 (island)未 (not) ----------------- 不 (not)动 (wave) --------------- 流 (go)

Repetition of pronunciations( 音韵联 )

Page 30: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

FS and SS Share the Same Style

有 (have)----------------- 缺 (lack)子 (son) ------------------- 鱼 (fish)有 (have) ------------------ 缺 (lack)女 (daughter)------------- 羊 (mutton)方 (so) --------------------- 敢 (dare)称 (call) -------------------- 叫 (call)好 (good) ------------------- 鲜 (fresh)

Decomposition of characters ( 拆字联 )

鲜鱼 羊

好 女 子

Page 31: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

FS and SS Share the Same Style

板桥 (Banqiao)---------------- 东坡 (Dongpo)造 (produce) ------------------- 居 (live)桥 (bridge) --------------------- 坡 (mountain)板 (board)---------------------- 东 (east)

Person name

( 人名联 )

Palindrome( 回文联 ) 

• Banqiao( 板桥 ) and Dongpo( 东坡 ) are famous litterateurs• Reading from top to down is identical to down to top

Page 32: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

天 高sky high

SS Generation Process

山hill

天sky

高high

深deep

任permit

倚depend

虫insect

鸟bird

虎tiger

飞fly

舞dance

鸣tweedle

鸟 飞bird fly

山 高hill high

海 阔 凭 鱼 跃Sea wide allow fish jump

虎 啸tiger roar

山高任鸟飞天高任鸟鸣天高任鸟飞山高靠虎啸山高任虎啸山深任鸟飞天高任花香

……

SMT decoding Reranking

天高任鸟飞山高任鸟飞天高任鸟鸣天高任鸟舞山深任鸟飞山高任花香天高任花香

……

山高任鸟飞天高任鸟鸣天高任鸟飞山深任鸟飞天高任花香天高任鸟舞山高任花香

……

Linguisticfiltering

Page 33: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

SS Generation Approach

• A multi-phase SMT approach

– Phase1: a phrase-based log-linear model

– Phase2: some linguistic filters

– Phase3: a Ranking SVM

Phrase-based log-linear model

SS output

Linguistic filters

FS input

N-best candidates

Ranking SVM model

Page 34: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Great Examples

• FS: 月落乌啼霜满天• SS: 风吹雁过雨连宵

• FS: 千江有水千江月• SS: 万里无云万里星• FS: 秦淮河桨声灯影• SS: 松花江水色月光

• FS: 此木为柴山山出 ( 此 + 木 = 柴 ; 山 + 山 = 出 )• SS: 白水作泉日日昌 ( 白 + 水 = 泉 ; 日 + 日 = 昌 )

Page 35: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia
Page 36: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia
Page 37: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

• Motivation– Training data is not adequate– While user log is big(60k/m), increasing, diverse

• What logs we record– User inputs– User finalized couplets

• Second sentences selected out of the candidates provided by our system• User modified second sentences

User log for Model Enhancement

Page 38: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

User’s Log Analysis

Number of input sentences 12,322

Number of unique input sentences 6,698

Users directly select from system output

3,459

User manual modify system output 606

Save as favorite couplets 109

Invalid user input 618

No second sentence generated 2,211

Banner generation 2,687

Select the generated banner as favorite

428

No banner output 265

Data Source Log from

http://couplet.msra.cn

Time period Aug. 31-Oct. 9,

2006

Page 39: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

New Framework with Log Data

Training data

Source-Channel model

Second sentence output

Translation model

Log data

Re-ranking

First sentence input

Language model

Mutual information

N-best candidates

Translation model

Language model

Mutual information

User operation

Page 40: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Twitter Search

Move to social internet and mobile

Page 41: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Tweets

Noise Filtering

Raw Data

Semantic Role Labeling

Sentiment Analysis

NE Recognition

Dependency Parsing Co-reference

Text Normalization ClassificationSentence Boundary

Detection

Tweets Cluster

Statistical Relationship

Learning

News & Images Link Extraction

Community Extraction User Influence Measure

Hot tag, topic Extraction Popular Tweet Extraction

Top video, music, artists Extraction

A collection of tweets

Individual tweet

Multi-level Indexing

Sem

antic Search

Page 42: NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia

Conclusion

• Internet trends and impacts to NLP• NLP2.0 strategy• Web data mining: Engkoo• User’s power: Couplets• SNS and mobile: Twitter search