big data and natural language processing

17
www.decideo.fr/bruley Natural Language Natural Language Processing Processing June 2013 Michel Bruley

Post on 20-Oct-2014

935 views

Category:

Business


0 download

DESCRIPTION

Natural Language Processing (NLP) is the branch of computer science focused on developing systems that allow computers to communicate with people using everyday language.

TRANSCRIPT

Page 1: Big Data and Natural Language Processing

www.decideo.fr/bruley

Natural Language Natural Language ProcessingProcessing

June 2013

Michel Bruley

Page 2: Big Data and Natural Language Processing

www.decideo.fr/bruley

Natural Language Processing Natural Language Processing (NLP)(NLP)

NLP is the branch of computer science focused on developing systems that allow computers to communicate with people using everyday language

NLP is considered as a sub-field of artificial intelligence and has significant overlap with the field of computational linguistics. It is concerned with the interactions between computers and human (natural) languages.

– Natural language generation systems convert information from computer databases into readable human language

– Natural language understanding systems convert human language into representations that are easier for computer programs to manipulate.

NLP encompasses both text and speech, but work on speech processing has evolved into a separate field

Page 3: Big Data and Natural Language Processing

www.decideo.fr/bruley

Where does it fit in the CS* Where does it fit in the CS* taxonomy?taxonomy?

Computers

Artificial Intelligence AlgorithmsDatabases Networking

Robotics SearchNatural Language Processing

InformationRetrieval

Machine Translation

Language Analysis

Semantics Parsing* CS = Computer Science

Page 4: Big Data and Natural Language Processing

www.decideo.fr/bruley

Why Natural Language Why Natural Language Processing?Processing?

Applications for processing large amounts of texts require NLP expertise

Classify text into categories, index and search large texts: Classify documents by topics, language, author, spam filtering, information retrieval (relevant, not relevant), sentiment classification (positive, negative)Extracting data from text: converting unstructured text into structure dataInformation extraction: discover names of people and events they participate in, from a document, …Automatic summarization: Condense 1 book into 1 page, …Speech processing, artificial voice: get flight information or book a hotel over the phone, …Question answering: find answers to natural language questions in a text collection or databaseSpelling & Grammar CorrectionsPlagiarism detectionAutomatic translationEtc.

Page 5: Big Data and Natural Language Processing

www.decideo.fr/bruley

The problemThe problem

When people see text, they understand its meaning (by and large)

According to research, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt tihng is that the frist and lsat ltteer are in the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by islelf but the wrod as a wlohe.

When computers see text, they get only character strings (and perhaps HTML tags)

We'd like computer agents to see meanings and be able to intelligently process text

These desires have led to many proposals for structured, semantically marked up formats

But often human beings still resolutely make use of text in human languages

This problem isn’t likely to just go away

Page 6: Big Data and Natural Language Processing

www.decideo.fr/bruley

Example: Natural language Example: Natural language understandingunderstanding

Raw speech signal

• Speech recognitionSequence of words spoken

• Syntactic analysis using knowledge of the grammarStructure of the sentence

• Semantic analysis using info. about meaning of wordsPartial representation of meaning of sentence

• Pragmatic analysis using info. about contextFinal representation of meaning of sentence

Natural language understanding process – Prof. Carolina Ruiz

Page 7: Big Data and Natural Language Processing

www.decideo.fr/bruley

Example detail: Syntactic Example detail: Syntactic AnalysisAnalysis

The big cat is drinking milk

Noun Phrase Verb Phrase

Determiner Adjective Phrase

Noun Auxiliary Verb Noun Phrase

The big cat is drinking milk

• Syntactic analysis involves isolating phrases and sentences into a hierarchical structure, allowing the study of its constituents.

• For example the sentence “the big cat is drinking milk” can be broken up into the following constituents:

Page 8: Big Data and Natural Language Processing

www.decideo.fr/bruley

Why NLP is difficultWhy NLP is difficult

Language is flexible– New words, new meanings – Different meanings in different contexts

Language is subtle– He arrived at the lecture– He chuckled at the lecture– He chuckled his way through the lecture– **He arrived his way through the lecture

Language is complex!

Page 9: Big Data and Natural Language Processing

www.decideo.fr/bruley

Why NLP is difficultWhy NLP is difficult

MANY hidden variables

– Knowledge about the world

– Knowledge about the context

– Knowledge about human communication techniques

• Can you tell me the time?

Problem of scale

– Many (infinite?) possible words, meanings, context

Problem of sparsity

– Very difficult to do statistical analysis, most things (words, concepts) are never seen before

Long range correlations

Page 10: Big Data and Natural Language Processing

www.decideo.fr/bruley

Why NLP is difficultWhy NLP is difficult

Key problems:

– Representation of meaning

– Language presupposes knowledge about the world

– Language only reflects the surface of meaning

– Language presupposes communication between people

Page 11: Big Data and Natural Language Processing

www.decideo.fr/bruley

Patented Natural Language Processing Patented Natural Language Processing (NLP) “Reads” Every Communication(NLP) “Reads” Every Communication

Each data feed is parsed through one or more of the 7 NLP engines

…it is then deconstructed to provide context, subject, and other information regarding the customer (gender, name etc.)

Finally each identified customer is matched back to the Discovery platform data to gain a full view

Natural language processing (NLP) is the study of the interactions between computers and natural languages (e.g., English, Polish). The crucial challenge that NLP

addresses is in deriving meaning from human or natural language input and allowing consumers to analyze

parsed meanings in large volumes.

Page 12: Big Data and Natural Language Processing

www.decideo.fr/bruley

For Example….For Example….

I bought an iPad2 for my mom last week. She loves the weight, but doesn’t like the color. She wishes it came

in blue. She says if it came in blue, then she’d buy one for all her friends

Entities (brands, people, locations, times, products…)Events and relationships (purchasing event, my mom…)Sentiment (product specifications)Suggestions (feature specifications)Intent (to purchase, to leave)Geo/Temporal

QUESTION: Why is this a big deal?

NLP takes a simple English statement, parses them into the categories above (and more categories) and VOILA…we got STRUCTURED DATA

Page 13: Big Data and Natural Language Processing

www.decideo.fr/bruley

Aster

ASTER DISCOVERY PLATFORM

“Now-structured”

data

“Now-structured”

data

ArchitectureArchitecture

Customers / Sales / Other

data

Customers / Sales / Other

data

Churn ScoreSQL MR

Churn ScoreSQL MR

Attensity PipelineReal-time annotated social media data feed: 150+ million social and online sources

Other Unstructured Data

Emails; Surveys; CRM Notes….

Pipeline Connector

ASAS WrapperSQL MR

ASAS WrapperSQL MR

NLP

ETL

Visualization (e.g., Tableau,

MSTR)

Predictive

Page 14: Big Data and Natural Language Processing

www.decideo.fr/bruley

This integration provides types, subtypes, super types (“Savings”, “Checking”, “Investment”)

Inclusion of the Anaphora: Connecting a subject (George Harrison) without repeating the full name (“He”, “Him”)

Includes other languages besides English

Attensity’s Semantic Annotation Server (ASAS) capabilities Entity Extraction: Automatic detection and extraction of more than 35 entities such as Name,

Place Uses Attensity Triples to create context on entities and identify verbs, relationships, actions Auto Classification: Uses custom classification rules to classify articles by content, sort by

relevance, and discovers repeated information Exhaustive Extraction: Application of linguistic principles to extract context, entities, and

relationships similar to how the human mind would Voice Tags: to identify types of statements and auto classify them (Question, Intent,

Conditional)

Creates a unique identifier for each entity for cross reference

Aster + Attensity = Competitive Aster + Attensity = Competitive AdvantageAdvantage

Page 15: Big Data and Natural Language Processing

www.decideo.fr/bruley

Structuring Unstructured Data: Structuring Unstructured Data: Process FlowProcess Flow

The flight was delayed and flight attendant would not give us any new information.

Page 16: Big Data and Natural Language Processing

www.decideo.fr/bruley

New Table: Customer Reactions

Database Record from a Customer Survey

date

10-02-06

region

0006

rec?

4

source

telephone

Why would you recommend/not recommend?The flight was delayed and flight attendant would

not give us any new information.

Who/Whatflight

Behaviordelay

Fact/Triple

flight : delaySame Record with Relational Facts

Extracted from Notes Field

date region source rec? who-what Behavior Fact/Triple

10-2-12 0006 telephone 4 flight delay flight : delay

10-2-12 0006 telephone 4 information give [not]information : give [not]

1-1-13 0007 e-mail 8 i happy [not] i : happy [not]

1-1-13 0007 e-mail 8 rep rude rep : rude

1-1-13 0007 e-mail 8 flight cancel flight : cancel

Original Structured DataNewly Structured DataProvided by Attensity

How Triples are Extracted & How Triples are Extracted & StructuredStructured

Extract Extract relational facts & Triples

from Notes field

Then FusePopulate new table with

attribute values and fuse with structured data.

Page 17: Big Data and Natural Language Processing

www.decideo.fr/bruley

Team PowerTeam Power