nlp in q - kx systems · text nlp in q no existing nlp libraries parsing is expensive, simple...

37
NLP IN Q

Upload: others

Post on 09-Sep-2019

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

NLP IN Q

Page 2: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

NLP IN Q

▸ No existing NLP libraries

▸ Parsing is expensive, simple vector operations are cheap

▸ Focus on vector operations, rather than named entity recognition, part-of-speech tagging, co-reference resolution

Page 3: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

REPRESENTING DATADocument 1 circumambulate| 4.997212 sail | 0.9236821 cook | 4.969805 whale | 0 ishmael | 3.722053 ...

Document 2 fish | 0.4790985 harpoon | 2.636207 jolly | 2.898556 ishmael | 5.263778 inn | 4.057829 ...

Union of keys `circumambulate`sail`cook`whale`ishmael`fish`harpoons`jolly`ishmael`inn

Key aligned vectors 4.997 0.923 4.969 0 3.722 0n 0n 2.049 3.722 0n ...0n 0.653 0n 0 5.263 0.479 2.636 2.898 5.263 4.057...

Page 4: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

COSINE SIMILARITY

▸ Used to calculate the angle between two vectors

▸ Dot product is the sum of the pair-wise products

▸ Given two vectors aligned such that each index i refers to the same element in each vector, the q is 0 ^ (sum x * y) % (*) . {sqrt sum x xexp 2} each (x; y)

Page 5: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

TF-IDF

▸ Significance depends on the document and the corpus

▸ A word is significant if it is common in a document, but uncommon in a corpus

▸ Term Frequency * Inverse Document Frequency

▸ IDF: log count[corpora] % sum containsTerm;TF: (1 + log occurrences) | 0;significance: TF * IDF;

Page 6: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

LEVEL STATISTICS

▸ Within a single document, clustered terms are more significant than uniformly distributed ones

▸ Compares the standard deviation of distances between words to the distribution that would be predicted by a geometric distribution.

▸ Where “distances” is a vector of the distance between words: σ : (dev each distances) % avg each distances; σnor : σ % sqrt 1 - p; sd(σnor) : 1 % (sqrt n) * (1 + 2.8 * n xexp -0.865); // factors in # of occurrences ⟨σnor⟩ : ((2 * n) - 1)%((2 * n) + 2); signifigance: (σnor - ⟨σnor⟩) % sd(σnor);

▸ Carpena, P., et al. "Level statistics of words: Finding keywords in literary texts and symbolic sequences." Physical Review E 79.3 (2009): 035102.

Page 7: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

CLUSTERING

▸ Find groupings of entities

▸ Cluster documents, terms, proper nouns

▸ Find natural divisions of text

▸ Can be random or deterministic

▸ Can take as parameters: similarity of documents, number of clusters, time spent clustering

▸ Cluster centroids can be represented as feature vectors

Page 8: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

CLUSTERING ALGORITHMS

▸ K means

▸ Pick k random documents, and cluster around these, then use the centroids as the new clustering points, and repeat until convergence

▸ Buckshot clustering

▸ Cluster sqrt(n) of the documents with an O(n^2) algorithm, then match the remaining documents to the centroids

▸ Group Average

▸ Starting with buckshot clusters, cluster each cluster into sub-clusters, merge any similar sub-clusters , repeat as long as you want

Page 9: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

THE MARKOV CLUSTERING ALGORITHM

▸ “… random walks on the graph will infrequently go from one natural cluster to another.” - Stijn van Dongen

▸ Multiply the matrix by itself, square every element, normalize the columns, and repeat this process until it converges.

▸ Rows with multiple non-zero values gives the clusters

▸ http://micans.org/mcl/

Page 10: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

THE MARKOV CLUSTERING FOR DOCUMENTS

.80+ Form letters, similar versions, updated articles

.60 to .80 Articles translated from English, then back

.50 to .60 Articles about the same events

.25 to .50 Articles about the same topics

.5 to .25 Articles about the same, more general, topics

less than .10 Several very large clusters, outliers become obvious

Minimum similarity is passed in as a parameter

Page 11: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

MARKOV CLUSTERING THE BIBLE (OR KJB IN KDB)

▸ At .49, you get Matthew, Mark and Luke in a cluster

▸ All .11, you get the New Testament, the Old Testament, and the Epistles, clustered by author

▸ At .05, you get the Epistles of John in one cluster, and everything else in another

Page 12: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

Similarity of .06

Page 13: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

Similarity of .08

Page 14: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

EXPLAINING SIMILARITY

▸ Useful for explaining why

▸ a document is in a cluster

▸ a document matches a query

▸ two documents are similar

▸ product : (terms1 % magnitude terms1) * (terms2 % magnitude terms2);desc alignedKeys ! product % sum product;

Page 15: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

Diogenes sitting in his tub, Jean-Léon Gerôme

Page 16: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

EXPLAINING SIMILARITY 3

▸ Given the cluster containing the three gospels Matthew, Mark and Luke, described by the keywordsdisciples, pharisees, john, peter, herod, mary, answering, scribes, simon, pilate

▸ The Gospel According to Saint Matthew (Relevance 0.84)disciples 0.310, pharisees 0.146, peter 0.087, herod 0.085, john 0.082, scribes 0.062, mary 0.061, hour 0.057, publicans 0.056, simon 0.053

▸ The Gospel According to Saint Matthew (Relevance 0.78)disciples 0.291, pharisees 0.115, john 0.093, peter 0.090, herod 0.081, immediately 0.072, scribes 0.069, answering 0.066, pilate 0.062, mary 0.061

▸ The Gospel According to Saint Luke (Relevance 0.82)disciples 0.244, pharisees 0.133, john 0.093, answering 0.091, herod 0.088, peter 0.086, mary 0.076, simon 0.067, pilate 0.062, immediately 0.061

Page 17: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

COMPARE CORPORA

▸ Find, for each term, the difference in relative frequency via log-likelihood

▸ totalFreq : (termCountA + termCountB) % (totalWordCountA + totalWordCountB);expectedA : totalWordCountA * totalFreq;desc (termCountA * log[termCountA % expectedA]);

▸ Rayson, Paul, and Roger Garside. "Comparing corpora using frequency profiling." Proceedings of the workshop on Comparing Corpora. Association for Computational Linguistics, 2000.

Page 18: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

COMPARE CORPORA - KJB

▸ Old Testament - Lord, shall, thy, Israel, king, thee, thou, land, shalt, children, house

▸ New Testament - Jesus, ye, Christ, things, unto, god, faith, disciples, man, world, say

Page 19: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

COMPARE CORPORA - JEFF SKILLING’S EMAILS

▸ Business emails - enron, please, jeff, energy, information, market, business

▸ Fraternity emails - yahoo, beta, betas, reunion, kai, ewooglin

Page 20: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

WORDS AS VECTORS

▸ Words can be described as vectors

▸ All previously mentioned operations become available on individual words

▸ Vectors are based on co-occurence

▸ word2vec uses machine learning to find which co-occuring words are most predictive

Page 21: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

CALCULATING VECTORS FOR WORDS

▸ Finding the significance of “captain” to “ahab”

▸ Of the 272 sentences containing “captain”, 78 contain “ahab”

▸ “captain” occurs in 2.7% of sentences, but occurs in 16% of sentences also containing “ahab”

▸ The likelihood of a sentence that contains “ahab” also containing “captain” is a binomially distributed random variable, as it is the product of a Bernoulli process

▸ The deviation of this random variable is √(np(1-p))where p is the overall probability of “captain” being in a sentence

▸ Significance is (cooccurenceRate - overallFrequency) % deviation (.16 - .027) % .162 .84

Page 22: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

WORD VECTOR EXAMPLES

Moby

stem relevance tokens ------------------------------------------------------------------dick 11.3 `dick`dick's whale 7.75 `whaling`whale`whales`whale's`whaled white 7.04 `white`whiteness`whitenesses`whites ahab 6.1 `ahab`ahab's`ahabs boat 4.95 `boat`boats`boat's encounter 4.52 `encounter`encountered`encountering`encountersseem 4.31 `seemed`seem`seems`seeming sea 4.13 `sea`seas`sea's

Page 23: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

WORD VECTOR EXAMPLES

harpoon stem relevance tokens -------------------------------------------------------whale 3.918473 `whaling`whale`whales`whale's`whaled boat 2.902082 `boat`boats`boat's line 2.235111 `line`lines`lined`lining sea 1.991354 `sea`seas`sea's iron 1.973497 `iron`irons`ironical dart 1.964671 `dart`darted`darts`darting ship 1.888228 `ship`ships`shipped`ship's`shipping queequeg 1.825947 `queequeg`queequeg's

Page 24: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

WORD VECTOR EXAMPLES - PROPER NOUNS ONLY

Moses

stem relevance ------------------- aaron 2.76 israel 2.26 pharaoh 1.39 egypt 1.31 egyptians 1.23 levites 1.19 eleazar 1.07 sinai 1.06 joshua 1 jordan 0.921 god 0.9

Jesus stem relevance ------------------- galilee 1.85 god 1.85 son 1.73 lord 1.68 john 1.57 peter 1.5 jerusalem 1.47 jews 1.45 pilate 1.37 david 1.33 pharisees 1.24

Pharaoh

stem relevance ----------------------- egypt 5.53 moses 3.9 joseph 3.46 egyptians 3.26 goshen 2.21 aaron 2.03 israel 1.58 god 1.16 red sea 1.16 canaan 1.13 hebrews 1.13

Page 25: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

WORDS AS VECTORS

▸ Clustering words becomes possible

▸ Given the names: pharaoh jude simon noah lamech judas ham methuselah aaron levi moses shem japeth jesus

▸ Cluster 1: noah lamech ham methuselah shemCluster 2: pharaoh aaron levi moses Cluster 3: simon judas jesus

Page 26: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

ANSWER QUERIES▸ Find the harmonic mean of each tokens relevance for each

search term

▸ Drop any terms with above average significance to the anti-search terms

Search Terms: captain, pequod captain ahab | 0.672187 captain peleg | 0.4844358 captains | 0.4797662 captain bildad| 0.4429896

Search Terms: captain, !pequod captain sleet | 0.5986764 captain scoresby| 0.5184432 captain pollard | 0.5184432 captain mayhew | 0.5184432 captain boomer | 0.5184432

Page 27: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

ANSWER QUERIES

How are Captain Bildad and Captain Peleg related?

captain| 0.0341 hand | 0.00935 old | 0.00896 ship | 0.00894 owner | 0.00883

Page 28: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

EXPAND SETS

Summing the vectors for a set of words will give an expanded set

expanding simon, andrew, james, and john givesbartholomew, alphaeus, matthew, thaddaeus, canaanite, zelotes, thomas, brother, iscariot, zebedee, james, peter, lebbaeus, boanerges, traitor, andrew, philip, simon, judas, and john

expanding bread, fish, milk, and beans givesbutter, honey, lentiles, cheese, millet, kine(cows), fitches(spelt), parched, shobi, bason, earthen, pulse, wheat, and barley

Page 29: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

STEMMING

▸ Stemming removes what it guesses are inflections antidisestablishmentarianism -> establish programmer -> program brother -> broth

▸ Produces a root word, which may not be a real word: happiness -> happi

▸ Stemmers can be compared by aggressiveness

▸ Stemming is rule based, does not require extensive datasets

Page 30: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

STEMMING

▸ Moby Dick has 16950 distinct words 10466 distinct stems nearly 700 words have 4 of more inflected forms

▸ general generally generous generic generously generalizing generations generated

▸ admirer admire admirals admiral admirable admirably admiral’s admirers

Page 31: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

TOKENIZING

▸ Tokens are individual words, names, numbers, etc.

▸ Proper names are counted as a single token

▸ Rule based, for simplicity

▸ To tokenize, all characters not in [a-zA-Z0-9’\u00C0-\u017F] get replaced with whitespace, then split on whitespace, remove terminal apostrophes, then join consecutive proper nouns

Page 32: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

PROPER NOUNS

▸ Any run of title cased word not at the start of a sentence is treated as a proper noun

▸ Any title cased word at the start of a sentence is treated as a proper noun if it is found as a proper noun elsewhere

Page 33: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

SENTENCE DETECTION

▸ Rule based, for simplicity

▸ Break on ‘?’, ‘!’ or any period which isn’t in an ellipsis (…), used as an infix (P.E.I.), preceding a number (.1416), following a title (Ms.)

▸ Sentence boundaries get modified to include quotes or multiple punctuation marks

▸ Not as accurate as machine learning algorithms which can detect sentence breaks such as “My uncle is from P.E.I. He’s in the potato business”

Page 34: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

EXPLORATORY NLP UI IN ANALYST FOR KXSearching, managing document collections, and common nlp operations are available through the UI

Page 35: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

VISUALIZING NLP DATA IN ANALYST FOR KXData from prose text can be visualized, here showing a discontinuity from chapter 32-45,which is on zoology, instead of part of the narrative

Page 36: NLP IN Q - Kx Systems · TEXT NLP IN Q No existing NLP libraries Parsing is expensive, simple vector operations are cheap Focus on vector operations, rather than named entity

TEXT

CLOSING COMMENTS

▸ Feature vectors and the vector space model work for things other than documents

▸ Used with dialectology, phonetics, music classification, recommender systems

▸ Other operations that can be done with the same techniques include summarization, phrase detection, and natural language generation