natural language processing: l01 introduction

22
Natural Language Processing Unit 1 – Introduction Anantharaman Narayana Iyer narayana dot anantharaman at gmail dot com 7 th Aug 2015

Upload: ananth

Post on 14-Jan-2017

1.803 views

Category:

Software


2 download

TRANSCRIPT

Natural Language ProcessingUnit 1 – Introduction

Anantharaman Narayana Iyer

narayana dot anantharaman at gmail dot com

7th Aug 2015

Topics

• Motivation: Why NLP?

• Course Outline

• Grading Policy

What are the opportunities for NLP?

NLP is a hugely important topic for both industry and academia

Trends that accelerate NLP research

• Availability of web and social data

• Mobile devices as a source of data

• Need for natural language based I/O for new devices

• ML techniques: eg deep learning

• Increasing availability of datasets in open web e.g. Freebase, dbpedia

Motivation

• Google Search Engine• Intelligently responding to the

query: eg, Where is India Gate?

• Predicting next word for autocompletion

• Ability to do spelling corrections

• Segmenting words that may be joined without space

• Ranking the search results

• Google translate

• Gmail• Eg, Understand contents of an e-

mail through NLP and alert the user

Speech/NLP

• What technologies are involved here?

- Continuous Speech Recognition- Keyword Spotting- Text to speech- Speech in Speech out systems- Speaker identification- Novel applications (to be explained on the board)

Disambiguation

• Consider an example below. • We would like to collect tweets on a subject

(Say Rahul Gandhi) and analyse the sentiment

• We can do a search on Twitter with the Search API with key words: “Rahul Gandhi”

• This might miss tweets that have only the term Rahul and not Gandhi.

• If we just search for the search terms: [“Rahul”, “Gandhi”], we may get results that match any Rahul (e.g Rahul Dravid or KL Rahul)

• We can do an intelligent tweet searchusing NLP techniques

Summarization

• The challenge we face is not the lack of information but the overload.

• Summarization is a core technology that can help address information overload

• Related Problems:• How to validate the quality, correctness of

information?

• Summarizing multimedia

• How do we summarize social data, where:

• Data may have less signal, more noise!

• Data may be biased

• Data may not be factual

• Repetitive

• Can we autogenerate a (set of) Tweet(s) from a news article?

Answer Evaluation

• Answer evaluation is a core challenge for online education systems.

• Wouldn’t it be nice if questions can be both descriptive as well as objective?

• Can there be an automated answer evaluation system that doesn’t require peer evaluation?

Sentiment Analysis

• Measurement of pulse of people from social media

• Can measure sentiments against a brand or product or events.

• Crowded space but not a fully solved problem due to inherent challenges in Natural Language Processing

• Can we build a sentiment analyser using RNNs and evaluate the performance?

Plagiarism Detection

Dialog Systems

• Dialog systems that can be deployed commercially?• Natural Language Processing

• Natural language generation

Can we build a NLG library and make it open source?

Demo

• http://www.manifestation.com/neurotoys/eliza.php3

Course Structure

• Foundational

• Emerging

•Applications

Course Positioning• Classical NLP techniques (such as Language Models, MaxEnt

classifiers, HMM, CRF etc) have proven to be effective in addressing problems like Part of Speech tagging, Text classification, Information Retrieval etc. However they are inadequate when dealing with problems that involve more semantics

• Modern approaches (such as deep learning) hold lot of promise in addressing problems involving semantics. They were also shown to produce results better than or equal to classical techniques for typical NLP tasks.

• Internationally acclaimed courses like those offered by Dan Jurafsky, Christopher Manning, Michael Collins on Coursera and also those offered at Stanford are strong in the traditional topics and somewhat light when discussing emerging topics.

• The recent course by Socher at Stanford is heavy on Recurrent network based approaches but assumes that the student is familiar to a good extent with the traditional NLP

• Our course takes the best of both worlds and backs it up with intense hands on work.

Key Topics• Foundational

• Words, sentences: Tokenization, regular expressions, challenges of ambiguity, edit distance, spelling corrections, string similarity, tf, tf-idf

• Stemming, Lemmatization

• Language models, smoothing, applications to speech, metrics

• Tagging problems: Viterbi Algorithm (HMM), POS, NER tagging, SRL

• Parsing: PCFG, CKY algorithm

• Information Retrieval, Information Extraction, Word Sense disambiguation, Summarization, Q&A systems, Dialogue Systems

• Natural Language Generation

• Emerging Approaches:• Deep Learning and Vector Space approaches to: Word representation, Sentence and text

compositionality, LM, Parsing, Parsing, Q&A Systems

• Applications:• Modern approaches to many exciting applications including speech

Course Grading Policy

• Unit Evaluations (3 out of 5): 30%

• Lab sessions (2 out of 5): 10%

• T1: 15%

• Final Exam: 3 days, 6 to 8 hours per day of product development (Will be run like a hackathon with a 90 minutes objective type written test on day 1): 15% (for test) + 25% (for hands on)

• Attendance: 5%

Challenges: Why NLP is hard?

The central challenge of Natural Language Processing is ambiguity and it exists at every level or stage of NLP

Poets and writers thrive on ambiguity in the language semantics while most of us abhor ambiguity!

Can the NLP understand poetry or better still, can it generate one?

That seems to be the ultimate!

Another challenge is the representation: How to represent words? Sentences? Large text? How to model the real world knowledge?

One prayer, 25 interpretations! (Ref: Raghuvamsaby Kalidasa)

Vagarthaviva sampriktau vagarthah pratipattaye | Jagatah pitarauvande parvathiparameshwarau || – Raghuvamsha 1.1

• Common Meaning: I pray parents of the world, Lord Shiva and Mother Parvathi, who are inseparable as speech and its meaning to gain knowledge of speech and its meaning.

Ambiguity – some examples

• Homophones: Words with same pronunciation but with different meanings• Peace, piece: A spoken sentence like “The PM attended the peace summit” has an ambiguity at the term “peace”, as

a speech to text translation might translate this as “piece”

• Knew, new

• Weak, week

• Word boundary• It’s all ready, looking great!

• It’s already looking great!

• Syntactic Ambiguity: Arises due to different parse trees for the same input• Phrase boundary

• Ananth created the presentation with video from web: ‘with video’ can be attached as “Ananth created the presentation, ‘with video’ “ or to “Ananth created the ‘presentation with video’”

• Semantic level ambiguity: Many ways to interpret a sentence• John and Susan are married (to each other? Separately?)

• Ram had a smooth sailing.

• Prices have gone through the roof

• India says it can’t accept the proposal

Representation: Text, Images, Audio, Video• What are the distinguishing characteristics of text data and what are the unique challenges?

• Text is made of words, images of pixels, audio with sampled and digitized audio signal, video with image frames in motion

• How do we represent a piece of text in the computer?

• Let’s do a simple exercise: What are the thoughts, emotions that cross your mind when you hear the following words?• Kalam

• Brilliant

• Pleasant

• Destruction

• Perfume

• Code

• Test

• Run

• Signal

• Words can be used in different contexts and the context is key to interpreting the meaning of the word