natural language processing using java

Natural Language Processing using Java

Sang Venkatraman

April 21, 2015

Agenda• Text Retrieval and Search

• Implementing Search

• Evaluating Search Results

• NLP - Document Level Analysis

• Parsing and Part of Speech Tagging

• Entity Extraction

• Word Sense Disambiguation

• Concept Extraction

• Concept Polarity

• NLP - Sentence Level Analysis

• Document Summarization

• Dependency Analysis and Coreference

• Example Question Parsing System

• Sentiment Analysis

• Final Thoughts/Questions 2

Text Retrieval and Search

• An collection of text documents exists in a system. This is called the corpus.

• The documents are preprocessed and indexed before query time.

• User performs a query - the query defines one or more concepts that the user is interested in. For e.g. “Thai restaurant in Atlanta”

• The search engine is expected to retrieve most relevant documents based on a ranking function

• The search engine can also apply some heuristics based on user feedback (such as always ignoring a specific document) to further prune the results.

3

Search - Vector Space Model

• Term: Is a word or set of words (ngrams)

• Each term defines one dimension

• Query Vector: q = (X1,…,Xn)

• Document Vector: d = (Y1,…,Ym)

• relevance (q,d) ~ similarity(q,d)

4

Preparing Text for Search

• Tokenization: For each document, we split it into paragraphs, split paragraphs into sentences and sentences into words.

• Word Normalization:

• Index text and query terms have same form e.g. match U.S.A and USA

• Usually lower cased

• Stop word Removal: An optional step where a predefined list of stop words are removed. More important for small corpuses

• Stemming - Reduce terms to their stems

• Language dependent - in English, every word has 2 parts, the stem and the affix

• automate(s), automatic, automation => automat, plural forms like cats => cat

• The “stem” may not be an actual word for e.g. consolidating => consolid5http://snowball.tartarus.org/algorithms/english/stemmer.html

http://snowball.tartarus.org/algorithms/english/stemmer.html

6The inverted index part of the image taken from http://butchiso.com/assets/posts/mysql-full-text-search-p3/inverted_index.png

http://butchiso.com/assets/posts/mysql-full-text-search-p3/inverted_index.png

Search Example

• For any given term in the query:

• Term Frequency (TF) - The number of times a term occurs in a document. Normalize this by the total number of terms in a document.

• Document Frequency (DF) - The number of documents that the term occurs in

• Inverse Document Frequency (IDF) - Inverse of above. So, it will be high for less frequent terms and low for more frequent terms.

• Simple ranking of documents for a query

• For all the terms in the query, sum up the product of TF and IDF. This can be used to rank the results with the documents with the highest tf-idf on top.

• Example:

• Document 1 = “The rose is red”

• Document 2 = “Red shoe”

• Query 1 = “Red” => both Document 1 and Document 2 because both documents have same number of terms after removing stop words

7

Evaluating Search Results

• Search results can be evaluated by 2 metrics that encourage two kinds of algorithm behavior:

• High Precision - Very few false positives. Critical for systems that cannot make a wrong recommendation.

• High Recall - Very few misses. Critical for systems where every missed opportunity needs to be minimized but there is a low cost associated with a false positive.

• FMeasure - The harmonic mean of precision and recall. It tries to balance out the explorative nature of search with the preciseness of the results.

8

• precision = a/(a + c)

• recall = a/(a + b)

• fMeasure = 2 * precision * recall/(precision + recall)

• Example:

• retrieved documents = 5, relevant documents = 10

• relevant documents within the 5 retrieved results = 4

• precision = 4/5 = 0.8, recall = 4/10 = 0.4, FMeasure = 0.53

9Table from https://www.coursera.org/course/textretrieval

https://www.coursera.org/course/textretrieval

Section Summary

• In this section, we applied NLP techniques across an entire corpus. This is where frameworks like map reduce play an important role.

• The NLP techniques by themselves were shallow but were able to implicitly handle compound words and stop words.

• Introduced a simple formula for ranking and retrieving search results. The real world involve more complex probabilistic models like BM25 that follow the same principles.

• Reviewed some techniques for evaluating search algorithms. These simple approaches can also be used for other NLP and machine learning problems.

10http://en.wikipedia.org/wiki/Okapi_BM25

http://en.wikipedia.org/wiki/Okapi_BM25

11

Big Data is for Losers. I’m into Small Data now.

Extracting Concepts From Text

• We apply various NLP techniques to analyze the contents of a document. Some example are:

• Mentions of people, places, locations etc.

• Central Themes or concepts in the document

• This is different from search

• Search follows a pull model where the users take initiative in querying the system for relevant documents.

• In concept extraction, we can infer abstract concepts from text and push it to interested users. We may also be able to infer the concepts a user is interested in based on the content they consume.

12

Concept Extraction - Motivation

13

Sentence Segmentation

• Periods are ambiguous - Abbreviations, decimals etc.

• !, ? - Less ambiguous

• Classifier - rules (using case, punctuation rules etc.), ML etc.

• StanfordNLP sentence detection and tokenizer

• Trained on Penn Bank dataset and is hence suited towards more formal english.

• OpenNLP has a sentence detection and tokenizer as well.

• Both these libraries perform pretty well for English and there is not much to choose between them. They can also be retrained.

14http://nlp.stanford.edu/software/tokenizer.shtml https://opennlp.apache.org https://github.com/dpdearing/nlp

http://nlp.stanford.edu/software/tokenizer.shtml

https://opennlp.apache.org

https://github.com/dpdearing/nlp

Part of Speech Tagging using StanfordNLP

• StanfordNLP is quite accurate (~90%) and has been trained using a maximum entropy tagger.

15

TAG POS TAG POSDT Determiner PRP Pronoun

JJ,JJR,JJS Adjective VB VerbsNN,NNS Noun IN Preposition

NNP,NNPS Proper Noun CC Conjunction

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Named Entity Recognition

• Named Entity Recognition is the NLP task of recognizing proper nouns in a document.

• Named Entity Recognition consists of three steps:

• Spotting: Statistical model pre-trained on well known corpus data help us “spot” entities in the text.

• Disambiguation: Once spots are found, we may need to disambiguate them (for e.g. there are multiple entities with the same name and the correct url needs to be retrieved)

• Filtering: Remove named entities whose types we are not interested in or entities that have very few links pointing to them.

• At the end of NER, we get back a set of url of resources that were referenced in the text.

16

Spotting is the process of identifying and assigning classes to named entities.

17

STANFORDNLP OPENNLP

I go to school at <ORGANIZATION>Stanford University</ORGANIZATION>, which is located in

<LOCATION>California</LOCATION>.

I go to school at <ORGANIZATION>Stanford University</ORGANIZATION> which is located in <LOCATION>California</

LOCATION>

Schooled in the <LOCATION>Philippines</LOCATION> Schooled in the <LOCATION>Philippines</LOCATION>

Where does <ORGANIZATION>Toyota</ORGANIZATION> have its factories?

Where does <ORGANIZATION>Toyota</ORGANIZATION> have its factories?

What does <ORGANIZATION>GM</ORGANIZATION> produce?

What does <ORGANIZATION>GM</ORGANIZATION> produce?

Is <ORGANIZATION>GM</ORGANIZATION> moving its jobs to <LOCATION>Atlanta</LOCATION>.

is <ORGANIZATION>GM</ORGANIZATION> moving its jobs to <LOCATION>Atlanta</LOCATION>.

I work at <ORGANIZATION>Chevy</ORGANIZATION>.

I work at Chevy.

I work at <ORGANIZATION>chevy</ORGANIZATION>.

I work at chevy.

I am fixing a <ORGANIZATION>General Motors</ORGANIZATION> car

I am fixing a <ORGANIZATION>General Motors</ORGANIZATION> car

You told me I was like the <LOCATION>Dead Sea</LOCATION>

You told me I was like the <LOCATION>Dead Sea</LOCATION>

Dbpedia Spotlight

• Dbpedia Spotlight is an API that can be used to perform all 3 steps of NER

• Spots - It identifies spots using a statistical backed model.

• Spots are disambiguated based on other references in the document

• Uri’s are retrieved for each of the identified named entities. These are usually dbpedia urls with references to freebase and other ontologies.

• Provides API to perform the steps of NER separately as well

• Spotting - Identifies only the spots

• Disambiguate - Performs disambiguation based on different options provided

• Annotate - Performs all 3 steps of NER and provides results

• Candidates - Provides a ranked list of candidates for each spot

18https://github.com/dbpedia-spotlight/dbpedia-spotlight

https://github.com/dbpedia-spotlight/dbpedia-spotlight

Dbpedia Spotlight Results

19

ID SONG EXPECTED ACTUAL PRECISION RECALL FMEASURE

1 Here We Stand (Talking Heads)

http://dbpedia.org/resource/Pizza_Hut http://dbpedia.org/resource/7-Eleven http://dbpedia.org/resource/Dairy_Queen

http://dbpedia.org/resource/7-Eleven 1.0 0.33 0.5

2 Kodachrome (Paul Simon)

http://dbpedia.org/resource/Kodachrome http://dbpedia.org/resource/Nikon

http://dbpedia.org/resource/Nikon 1.0 0.5 0.66

3Brand New

Cadillac (The Crash)

http://dbpedia.org/resource/Cadillac http://dbpedia.org/resource/Cadillac 1.0 1.0 1.0

4

A Certain Romance

(Arctic Monkeys)

http://dbpedia.org/resource/Reebok http://dbpedia.org/resource/Converse_(shoe_company)

http://dbpedia.org/resource/Reebok http://dbpedia.org/resource/Converse_(shoe_company)

1.0 1.0 1.0

5 My Humps (Black Eyed Peas)

http://dbpedia.org/resource/Prada http://dbpedia.org/resource/Gucci http://dbpedia.org/resource/Fendi http://dbpedia.org/resource/Dolce_&_Gabbana http://dbpedia.org/resource/True_Religion

http://dbpedia.org/resource/Prada http://dbpedia.org/resource/Gucci

1.0 0.33 0.5

Mean 1.0 0.63 0.73

http://dbpedia.org/resource/Dairy_Queen

http://dbpedia.org/resource/7-Eleven

http://dbpedia.org/resource/Nikon

http://dbpedia.org/resource/Nikon

http://dbpedia.org/resource/Cadillac

http://dbpedia.org/resource/Cadillac

http://dbpedia.org/resource/Converse_(shoe_company)

http://dbpedia.org/resource/Converse_(shoe_company)

http://dbpedia.org/resource/Gucci

Querying the Semantic Web

• SPARQL is a query language to interact with the semantic web.

• SPARQL is the equivalent of SQL for RDF stores.

• Ontologies provide knowledge about different entities usually in the form of a subject-predicate-object triple.

• English version of dbpedia contains 4.58 million things with 584 million facts.

20

SELECT ?industry WHERE {<http://dbpedia.org/resource/Fendi? dbprop:industry ?industry>

http://dbpedia.org/sparql http://wiki.dbpedia.org/Datasets#h434-9

http://dbpedia.org/resource/Fendi?

http://dbpedia.org/sparql

http://wiki.dbpedia.org/Datasets#h434-9

Named Entity Recognition Demo

• http://dbpedia-spotlight.github.io/demo/

21

http://dbpedia-spotlight.github.io/demo/

Extracting Concepts using Word Senses

22http://www.picgifs.com/clip-art/activities/sweating/clip-art-sweating-328953.jpg

http://www.picgifs.com/clip-art/activities/sweating/clip-art-sweating-328953.jpg

Word Sense Disambiguation

• For many words, multiple senses of the word exists based on the context. For e.g. there are multiple senses for the word “bank” (even within the same part of speech).

• Extremely difficult for Computers. A combination of context and common sense information make this quite easy for humans.

• Word Sense Disambiguation can be useful for

• Machine translation between languages (surface form loses value during translation because the only thing that matters is the sense of the word)

• Information Retrieval - Correct interpretation of the query. However this can be overcome by providing enough terms to only retrieve relevant documents.

• Automatic annotation of text

• Measuring semantic relatedness between documents.

23http://babelnet.org/ https://code.google.com/p/dkpro-wsd/wiki/LSRs

http://babelnet.org/

https://code.google.com/p/dkpro-wsd/wiki/LSRs

• Solving the Word Sense Disambiguation Problem

• Need an inventory of knowledge that can be used to disambiguate words. Usually a graph structure. Some examples are:

• WordNet

• Wikipedia

• Yago

• Freebase

• ConceptNet

• Algorithms to traverse the inventory to retrieve most likely disambiguation of a word. These are usually graph algorithms that work on a measure of centrality like degree centrality etc.

• Assumptions:

• The document has enough context to disambiguate the word correct. If not, we would default to the most frequent sense of a word.

• Single sense per discourse24

WordNet

• WordNet is a hierarchically organized lexical database widely used in NLP applications. Started at Princeton in 1985.

• Contains nouns, verbs adjectives and adverbs

• Words are separated into senses and are represented as synsets.

• The noun “bank” can have multiple senses based on the context (for e.g. bank of a river, financial institution etc.)

• Synsets are connected by well defined semantic relationships

• Majority of WordNet relations connect words from same part of speech.

• Can be accessed in Java using the extJWNL library

25

PART OF SPEECH UNIQUE STRINGSNoun 117,798

Verb 11,529Adjective 22,479Adverb 4,481

http://extjwnl.sourceforge.net/


WordNet Synsets

26

Synset format => baseform#pos#index bank#n#1 -> river bank

bank#n#2 -> Financial institution bank#v#3 -> bank with a financial institution

http://wordnetweb.princeton.edu/perl/webwn

http://wordnetweb.princeton.edu/perl/webwn

WordNet Relationships

• Hypernym - Defines a superordinate relationship.

• Motor vehicle is a hypernym of car

• Hyponym - Subordinate relationship

• Mango is a hyponym of fruit

• The root node of nouns is “entity”

• Other relationships: InstanceOf, Synonyms/Antonyms, Meronym (PartOf) etc.

27

28http://www.ling.helsinki.fi/kit/2008s/clt231/nltk-0.9.5/doc/images/wordnet-hierarchy.png

http://www.ling.helsinki.fi/kit/2008s/clt231/nltk-0.9.5/doc/images/wordnet-hierarchy.png

Accessing WordNet using extJWNL

• Download WordNet 3.0 dataset

• Use the properties file to point to the location of WordNet

• on the file system or database

• Lemmatization - Needed to get the base form of a word (different from stemming) using the WordNet dictionary.

• cat and cats have same lemma

29

val dictionary = Dictionary.getInstance(new FileInputStream(“data/file_properties.xml"))

def getBaseForm(pos: POS, word: String): String = { dictionary.getMorphologicalProcessor.lookupBaseForm(pos, word.toLowerCase) }



WSD using WordNet

• Example 1 - “I am going to the bank”

• “bank” by itself usually just defaults to bank#n#1

• Example 2 - “What is the difference between a bank and a credit union?”

• Credit Union only has one sense - credit_union#n#1

• Because credit union is present, “bank” is disambiguated to “bank#n#2”

30https://code.google.com/p/dkpro-wsd/wiki/LSRs

Concept Graph

• WordNet does not capture any common sense information. For e.g. bank (financial institution) and money do not have a close relationship in WordNet.

• It is possible to use other resource like ConceptNet that map common sense knowledge to WordNet (and ontologies like dbpedia). For e.g. we can download mappings for concepts like Money, Love, Sports, Family etc.

• Another option is to deploy a custom concept graph:

• Deploy WordNet onto a Graph database. That forms the base graph.

• Deploy custom concept mapping to the WordNet synsets.

• Add mappings for relevant wikipedia (dbpedia) categories

31http://conceptnet5.media.mit.edu/data/5.3/c/en/family?limit=1000

http://conceptnet5.media.mit.edu/data/5.3/c/en/family?limit=1000

Concept Extraction Architecture

32

Concept Analysis of over 500K songs

33

Concept Polarity

• SentiWordNet is a lexical resource for opinion mining and sentiment analysis

• SentiWordNet provides sentiment values for the different WordNet sysnsets. For each synset in WordNet, SentiWordNet assigns it scores on 3 dimensions - positivity, negativity and objectivity.

• Once the central concepts are found, we can extract the polarity of the concepts.

• Example:

• “They are really happy to be here” => happy#a#1 has a very positive polarity.

34http://sentiwordnet.isti.cnr.it/

http://sentiwordnet.isti.cnr.it/

Section Summary

• Went beyond surface forms and analyzed the concepts contained in documents.

• The approach was still mostly bag of words meaning that the structure of the individual sentences did not matter.

• The approaches in tandem with common sense knowledge sources help in extracting concepts from documents.

• It also allows documents to be compared based on semantic similarity measures.

35

36http://www.smosh.com/smosh-pit/photos/funny-smartass-siri-responses

http://www.smosh.com/smosh-pit/photos/funny-smartass-siri-responses

Document Summarization• Objective - Reduce the document in order to create a summary that retains the most

important points of the original document.

• Two Approaches:

• Extractive: Extract the sentences that are most representative of the content of the document.

• Generative: Generate a summary of the text using words that may not be part of the original text. This is a difficult task and is often not attempted.

• Evaluating summarization techniques:

• Somewhat subjective because humans sometimes cannot agree on the best summary

• Extractive Approaches

• Based on term frequency

• Based on sentence similarity

37

38http://en.wikipedia.org/wiki/Apache_Cassandra

ID SENTENCE EXPECTED SCORE

1 Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. High

2 Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients. High

3 Cassandra also places a high value on performance. Low

4 In 2012, University of Toronto researchers studying NoSQL systems concluded that "In terms of scalability, there is a clear winner throughout our experiments. Low

5 Cassandra achieves the highest throughput for the maximum number of nodes in all experiments" although "this comes at the price of high write and read latencies." High

6 Cassandra's data model is a partitioned row store with tunable consistency. Medium

7 Rows are organized into tables; the first component of a table's primary key is the partition key; within a partition, rows are clustered by the remaining columns of the key. Medium

8 Other columns may be indexed separately from the primary key. Low

9 Tables may be created, dropped, and altered at runtime without blocking updates and queries. Low

10 Cassandra does not support joins or subqueries, except for batch analysis via Hadoop. Medium

11 Rather, Cassandra emphasizes denormalization through features like collections. Medium

http://en.wikipedia.org/wiki/Apache_Cassandra

TextRank

• A graph approach where each vertex is a sentence and each edge has a weight corresponding to the similarity between the two sentences. Every vertex is connected to every other vertex.

• For every sentence:

• Calculate its similarity to every other sentence. The similarity measure can be simple for e.g. normalized value of the number of common terms between the 2 sentences

• Sum the similarity of the sentence to every other row (sum up each of the rows). That is the score of the sentence.

• Sort the vertices based on the sum of the weights of their edges and return the top k sentences.

39http://lit.csci.unt.edu/index.php/Graph-based_NLP

http://lit.csci.unt.edu/index.php/Graph-based_NLP

40

TOP SENTENCES SCORE

Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless

replication allowing low latency operations for all clients.1.6

Cassandra also places a high value on performance. 1.125

Other columns may be indexed separately from the primary key. 0.999

• Can the similarity metric be improved?

Dependency Analysis in Sentences

• StanfordNLP can be used to analyze the grammatical structure of sentences and provide a dependency graph between the different elements of the sentence.

• LexicalizedParser can provide a graph where the vertices are the words and the edges are the grammatical relationships in a sentence.

41http://nlp.stanford.edu/software/lex-parser.shtml

http://nlp.stanford.edu/software/lex-parser.shtml

42

TAG MEANING TAG MEANING

advmod Adverbial Modifier dobj Direct Object

(she,gave)

neg Negation Modifier iobj Indirect Object

(gave,me)

nsubj Nominal Subject amod Adjective Modifier

nsubjpass Passive Nominal Subject prep Preposition

Question Parsing

43

Dependency Analysis

• Works well for short sentences. It loses accuracy when the scope is increased to a document.

• May aid in text simplification by using the relationships between the entities.

• By analyzing the subject and the object, we can clearly establish a point of view (for e.g. direct address vs first person vs. second person etc.).

• Could potentially help in story extrapolation but does not generalize well. So this is a topic of research.

44

Sentiment Analysis

• StanfordNLP has a deep learning model for sentiment analysis.

• Takes a deep parsing approach to sentiment analysis - the structure of the sentence is constructed prior to the analysis.

• Was trained on movie reviews data and obtained an accuracy of 5% more than the closest model.

• Uses an annotated dataset called the Stanford Sentiment Treebank. Users are encouraged to add labels to improve the model further.

45

Sentiment Analysis Examples

• Taxonomy

• Very Negative

• Negative

• Neutral

• Positive

• Very Positive46

Sentiment Analysis Demo

• http://nlp.stanford.edu:8080/sentiment/rntnDemo.html

47

http://nlp.stanford.edu:8080/sentiment/rntnDemo.html

StanfordNLP Sentiment Analysis

• Provides relatively good results for short sentences.

• Sentences that are similar to the training data (movie reviews) perform much better than other sentences.

• No good way to aggregate sentiments across a document. A future work would probably involve document level dependency parsing and sentiment analysis.

• Only provides overall sentiment. Does not provide an indication of the object of the sentiment.

48

Final Thoughts

• Shallow NLP is employed in text retrieval and search and provide good results for general search use cases.

• Deeper NLP involves semantic parsing, common sense interpolation (both local and global knowledge bases) and tends to be harder.

• Deeper NLP is more practical after picking a specific domain for e.g. medical records, legal documents etc.

• 2 cents on Intelligence - Memory based systems

• http://watson-um-demo.mybluemix.net/

49http://en.wikipedia.org/wiki/On_Intelligence

http://watson-um-demo.mybluemix.net/

http://en.wikipedia.org/wiki/On_Intelligence

Resources

• StanfordNLP Github: https://github.com/stanfordnlp/CoreNLP

• Own repository: https://github.com/sangv/swsd

• Dbpedia Spotlight: https://github.com/dbpedia-spotlight/dbpedia-spotlight

• Opennlp repo: https://github.com/apache/opennlp

• ConceptNet conceptnet5.media.mit.edu

• On Intelligence book: http://en.wikipedia.org/wiki/On_Intelligence

50

https://github.com/sangv/swsd

https://github.com/dbpedia-spotlight/dbpedia-spotlight

https://github.com/apache/opennlp

http://conceptnet5.media.mit.edu

http://en.wikipedia.org/wiki/On_Intelligence

Thank You

51@sang_v

natural language processing using java

Technology