nlp in q - kx systems · text nlp in q no existing nlp libraries parsing is expensive, simple...

NLP IN Q

TEXT

NLP IN Q

▸ No existing NLP libraries

▸ Parsing is expensive, simple vector operations are cheap

▸ Focus on vector operations, rather than named entity recognition, part-of-speech tagging, co-reference resolution

TEXT

COSINE SIMILARITY

▸ Used to calculate the angle between two vectors

▸ Dot product is the sum of the pair-wise products

▸ Given two vectors aligned such that each index i refers to the same element in each vector, the q is 0 ^ (sum x * y) % (*) . {sqrt sum x xexp 2} each (x; y)

TEXT

TF-IDF

▸ Significance depends on the document and the corpus

▸ A word is significant if it is common in a document, but uncommon in a corpus

▸ Term Frequency * Inverse Document Frequency

▸ IDF: log count[corpora] % sum containsTerm;TF: (1 + log occurrences) | 0;significance: TF * IDF;

TEXT

LEVEL STATISTICS

▸ Within a single document, clustered terms are more significant than uniformly distributed ones

▸ Compares the standard deviation of distances between words to the distribution that would be predicted by a geometric distribution.

▸ Where “distances” is a vector of the distance between words: σ : (dev each distances) % avg each distances; σnor : σ % sqrt 1 - p; sd(σnor) : 1 % (sqrt n) * (1 + 2.8 * n xexp -0.865); // factors in # of occurrences ⟨σnor⟩ : ((2 * n) - 1)%((2 * n) + 2); signifigance: (σnor - ⟨σnor⟩) % sd(σnor);

▸ Carpena, P., et al. "Level statistics of words: Finding keywords in literary texts and symbolic sequences." Physical Review E 79.3 (2009): 035102.

TEXT

CLUSTERING

▸ Find groupings of entities

▸ Cluster documents, terms, proper nouns

▸ Find natural divisions of text

▸ Can be random or deterministic

▸ Can take as parameters: similarity of documents, number of clusters, time spent clustering

▸ Cluster centroids can be represented as feature vectors

TEXT

CLUSTERING ALGORITHMS

▸ K means

▸ Pick k random documents, and cluster around these, then use the centroids as the new clustering points, and repeat until convergence

▸ Buckshot clustering

▸ Cluster sqrt(n) of the documents with an O(n^2) algorithm, then match the remaining documents to the centroids

▸ Group Average

▸ Starting with buckshot clusters, cluster each cluster into sub-clusters, merge any similar sub-clusters , repeat as long as you want

TEXT

THE MARKOV CLUSTERING ALGORITHM

▸ “… random walks on the graph will infrequently go from one natural cluster to another.” - Stijn van Dongen

▸ Multiply the matrix by itself, square every element, normalize the columns, and repeat this process until it converges.

▸ Rows with multiple non-zero values gives the clusters

▸ http://micans.org/mcl/

TEXT

THE MARKOV CLUSTERING FOR DOCUMENTS

.80+ Form letters, similar versions, updated articles

.60 to .80 Articles translated from English, then back

.50 to .60 Articles about the same events

.25 to .50 Articles about the same topics

.5 to .25 Articles about the same, more general, topics

less than .10 Several very large clusters, outliers become obvious

Minimum similarity is passed in as a parameter

TEXT

MARKOV CLUSTERING THE BIBLE (OR KJB IN KDB)

▸ At .49, you get Matthew, Mark and Luke in a cluster

▸ All .11, you get the New Testament, the Old Testament, and the Epistles, clustered by author

▸ At .05, you get the Epistles of John in one cluster, and everything else in another

Similarity of .06

Similarity of .08

TEXT

EXPLAINING SIMILARITY

▸ Useful for explaining why

▸ a document is in a cluster

▸ a document matches a query

▸ two documents are similar

▸ product : (terms1 % magnitude terms1) * (terms2 % magnitude terms2);desc alignedKeys ! product % sum product;

Diogenes sitting in his tub, Jean-Léon Gerôme

TEXT

EXPLAINING SIMILARITY 3

▸ Given the cluster containing the three gospels Matthew, Mark and Luke, described by the keywordsdisciples, pharisees, john, peter, herod, mary, answering, scribes, simon, pilate

▸ The Gospel According to Saint Matthew (Relevance 0.84)disciples 0.310, pharisees 0.146, peter 0.087, herod 0.085, john 0.082, scribes 0.062, mary 0.061, hour 0.057, publicans 0.056, simon 0.053

▸ The Gospel According to Saint Matthew (Relevance 0.78)disciples 0.291, pharisees 0.115, john 0.093, peter 0.090, herod 0.081, immediately 0.072, scribes 0.069, answering 0.066, pilate 0.062, mary 0.061

▸ The Gospel According to Saint Luke (Relevance 0.82)disciples 0.244, pharisees 0.133, john 0.093, answering 0.091, herod 0.088, peter 0.086, mary 0.076, simon 0.067, pilate 0.062, immediately 0.061

TEXT

COMPARE CORPORA

▸ Find, for each term, the difference in relative frequency via log-likelihood

▸ totalFreq : (termCountA + termCountB) % (totalWordCountA + totalWordCountB);expectedA : totalWordCountA * totalFreq;desc (termCountA * log[termCountA % expectedA]);

▸ Rayson, Paul, and Roger Garside. "Comparing corpora using frequency profiling." Proceedings of the workshop on Comparing Corpora. Association for Computational Linguistics, 2000.

TEXT

COMPARE CORPORA - KJB

▸ Old Testament - Lord, shall, thy, Israel, king, thee, thou, land, shalt, children, house

▸ New Testament - Jesus, ye, Christ, things, unto, god, faith, disciples, man, world, say

TEXT

COMPARE CORPORA - JEFF SKILLING’S EMAILS

▸ Business emails - enron, please, jeff, energy, information, market, business

▸ Fraternity emails - yahoo, beta, betas, reunion, kai, ewooglin

TEXT

WORDS AS VECTORS

▸ Words can be described as vectors

▸ All previously mentioned operations become available on individual words

▸ Vectors are based on co-occurence

▸ word2vec uses machine learning to find which co-occuring words are most predictive

TEXT

CALCULATING VECTORS FOR WORDS

▸ Finding the significance of “captain” to “ahab”

▸ Of the 272 sentences containing “captain”, 78 contain “ahab”

▸ “captain” occurs in 2.7% of sentences, but occurs in 16% of sentences also containing “ahab”

▸ The likelihood of a sentence that contains “ahab” also containing “captain” is a binomially distributed random variable, as it is the product of a Bernoulli process

▸ The deviation of this random variable is √(np(1-p))where p is the overall probability of “captain” being in a sentence

▸ Significance is (cooccurenceRate - overallFrequency) % deviation (.16 - .027) % .162 .84

TEXT

WORD VECTOR EXAMPLES

Moby

stem relevance tokens ------------------------------------------------------------------dick 11.3 `dick`dick's whale 7.75 `whaling`whale`whales`whale's`whaled white 7.04 `white`whiteness`whitenesses`whites ahab 6.1 àhabàhab'sàhabs boat 4.95 `boat`boats`boat's encounter 4.52 èncounterèncounteredèncounteringèncountersseem 4.31 `seemed`seem`seems`seeming sea 4.13 `sea`seas`sea's

TEXT

WORD VECTOR EXAMPLES

harpoon stem relevance tokens -------------------------------------------------------whale 3.918473 `whaling`whale`whales`whale's`whaled boat 2.902082 `boat`boats`boat's line 2.235111 `line`lines`lined`lining sea 1.991354 `sea`seas`sea's iron 1.973497 ìronìronsìronical dart 1.964671 `dart`darted`darts`darting ship 1.888228 `ship`ships`shipped`ship's`shipping queequeg 1.825947 `queequeg`queequeg's

TEXT

WORD VECTOR EXAMPLES - PROPER NOUNS ONLY

Moses

stem relevance ------------------- aaron 2.76 israel 2.26 pharaoh 1.39 egypt 1.31 egyptians 1.23 levites 1.19 eleazar 1.07 sinai 1.06 joshua 1 jordan 0.921 god 0.9

Jesus stem relevance ------------------- galilee 1.85 god 1.85 son 1.73 lord 1.68 john 1.57 peter 1.5 jerusalem 1.47 jews 1.45 pilate 1.37 david 1.33 pharisees 1.24

Pharaoh

stem relevance ----------------------- egypt 5.53 moses 3.9 joseph 3.46 egyptians 3.26 goshen 2.21 aaron 2.03 israel 1.58 god 1.16 red sea 1.16 canaan 1.13 hebrews 1.13

TEXT

WORDS AS VECTORS

▸ Clustering words becomes possible

▸ Given the names: pharaoh jude simon noah lamech judas ham methuselah aaron levi moses shem japeth jesus

▸ Cluster 1: noah lamech ham methuselah shemCluster 2: pharaoh aaron levi moses Cluster 3: simon judas jesus

TEXT

ANSWER QUERIES▸ Find the harmonic mean of each tokens relevance for each

search term

▸ Drop any terms with above average significance to the anti-search terms

Search Terms: captain, pequod captain ahab | 0.672187 captain peleg | 0.4844358 captains | 0.4797662 captain bildad| 0.4429896

Search Terms: captain, !pequod captain sleet | 0.5986764 captain scoresby| 0.5184432 captain pollard | 0.5184432 captain mayhew | 0.5184432 captain boomer | 0.5184432

TEXT

EXPAND SETS

Summing the vectors for a set of words will give an expanded set

expanding simon, andrew, james, and john givesbartholomew, alphaeus, matthew, thaddaeus, canaanite, zelotes, thomas, brother, iscariot, zebedee, james, peter, lebbaeus, boanerges, traitor, andrew, philip, simon, judas, and john

expanding bread, fish, milk, and beans givesbutter, honey, lentiles, cheese, millet, kine(cows), fitches(spelt), parched, shobi, bason, earthen, pulse, wheat, and barley

TEXT

STEMMING

▸ Stemming removes what it guesses are inflections antidisestablishmentarianism -> establish programmer -> program brother -> broth

▸ Produces a root word, which may not be a real word: happiness -> happi

▸ Stemmers can be compared by aggressiveness

▸ Stemming is rule based, does not require extensive datasets

TEXT

STEMMING

▸ Moby Dick has 16950 distinct words 10466 distinct stems nearly 700 words have 4 of more inflected forms

▸ general generally generous generic generously generalizing generations generated

▸ admirer admire admirals admiral admirable admirably admiral’s admirers

TEXT

TOKENIZING

▸ Tokens are individual words, names, numbers, etc.

▸ Proper names are counted as a single token

▸ Rule based, for simplicity

▸ To tokenize, all characters not in [a-zA-Z0-9’\u00C0-\u017F] get replaced with whitespace, then split on whitespace, remove terminal apostrophes, then join consecutive proper nouns

TEXT

PROPER NOUNS

▸ Any run of title cased word not at the start of a sentence is treated as a proper noun

▸ Any title cased word at the start of a sentence is treated as a proper noun if it is found as a proper noun elsewhere

TEXT

SENTENCE DETECTION

▸ Rule based, for simplicity

▸ Break on ‘?’, ‘!’ or any period which isn’t in an ellipsis (…), used as an infix (P.E.I.), preceding a number (.1416), following a title (Ms.)

▸ Sentence boundaries get modified to include quotes or multiple punctuation marks

▸ Not as accurate as machine learning algorithms which can detect sentence breaks such as “My uncle is from P.E.I. He’s in the potato business”

EXPLORATORY NLP UI IN ANALYST FOR KXSearching, managing document collections, and common nlp operations are available through the UI

VISUALIZING NLP DATA IN ANALYST FOR KXData from prose text can be visualized, here showing a discontinuity from chapter 32-45,which is on zoology, instead of part of the narrative

TEXT

CLOSING COMMENTS

▸ Feature vectors and the vector space model work for things other than documents

▸ Used with dialectology, phonetics, music classification, recommender systems

▸ Other operations that can be done with the same techniques include summarization, phrase detection, and natural language generation

TEXT

CONTACT [email protected]

mailto:[email protected]

nlp in q - kx systems · text nlp in q no existing nlp libraries parsing is expensive, simple...

Documents