nlp in q - kx systems · text nlp in q no existing nlp libraries parsing is expensive, simple...
TRANSCRIPT
NLP IN Q
TEXT
NLP IN Q
▸ No existing NLP libraries
▸ Parsing is expensive, simple vector operations are cheap
▸ Focus on vector operations, rather than named entity recognition, part-of-speech tagging, co-reference resolution
TEXT
REPRESENTING DATADocument 1 circumambulate| 4.997212 sail | 0.9236821 cook | 4.969805 whale | 0 ishmael | 3.722053 ...
Document 2 fish | 0.4790985 harpoon | 2.636207 jolly | 2.898556 ishmael | 5.263778 inn | 4.057829 ...
Union of keys `circumambulate`sail`cook`whale`ishmael`fish`harpoons`jolly`ishmael`inn
Key aligned vectors 4.997 0.923 4.969 0 3.722 0n 0n 2.049 3.722 0n ...0n 0.653 0n 0 5.263 0.479 2.636 2.898 5.263 4.057...
TEXT
COSINE SIMILARITY
▸ Used to calculate the angle between two vectors
▸ Dot product is the sum of the pair-wise products
▸ Given two vectors aligned such that each index i refers to the same element in each vector, the q is 0 ^ (sum x * y) % (*) . {sqrt sum x xexp 2} each (x; y)
TEXT
TF-IDF
▸ Significance depends on the document and the corpus
▸ A word is significant if it is common in a document, but uncommon in a corpus
▸ Term Frequency * Inverse Document Frequency
▸ IDF: log count[corpora] % sum containsTerm;TF: (1 + log occurrences) | 0;significance: TF * IDF;
TEXT
LEVEL STATISTICS
▸ Within a single document, clustered terms are more significant than uniformly distributed ones
▸ Compares the standard deviation of distances between words to the distribution that would be predicted by a geometric distribution.
▸ Where “distances” is a vector of the distance between words: σ : (dev each distances) % avg each distances; σnor : σ % sqrt 1 - p; sd(σnor) : 1 % (sqrt n) * (1 + 2.8 * n xexp -0.865); // factors in # of occurrences ⟨σnor⟩ : ((2 * n) - 1)%((2 * n) + 2); signifigance: (σnor - ⟨σnor⟩) % sd(σnor);
▸ Carpena, P., et al. "Level statistics of words: Finding keywords in literary texts and symbolic sequences." Physical Review E 79.3 (2009): 035102.
TEXT
CLUSTERING
▸ Find groupings of entities
▸ Cluster documents, terms, proper nouns
▸ Find natural divisions of text
▸ Can be random or deterministic
▸ Can take as parameters: similarity of documents, number of clusters, time spent clustering
▸ Cluster centroids can be represented as feature vectors
TEXT
CLUSTERING ALGORITHMS
▸ K means
▸ Pick k random documents, and cluster around these, then use the centroids as the new clustering points, and repeat until convergence
▸ Buckshot clustering
▸ Cluster sqrt(n) of the documents with an O(n^2) algorithm, then match the remaining documents to the centroids
▸ Group Average
▸ Starting with buckshot clusters, cluster each cluster into sub-clusters, merge any similar sub-clusters , repeat as long as you want
TEXT
THE MARKOV CLUSTERING ALGORITHM
▸ “… random walks on the graph will infrequently go from one natural cluster to another.” - Stijn van Dongen
▸ Multiply the matrix by itself, square every element, normalize the columns, and repeat this process until it converges.
▸ Rows with multiple non-zero values gives the clusters
▸ http://micans.org/mcl/
TEXT
THE MARKOV CLUSTERING FOR DOCUMENTS
.80+ Form letters, similar versions, updated articles
.60 to .80 Articles translated from English, then back
.50 to .60 Articles about the same events
.25 to .50 Articles about the same topics
.5 to .25 Articles about the same, more general, topics
less than .10 Several very large clusters, outliers become obvious
Minimum similarity is passed in as a parameter
TEXT
MARKOV CLUSTERING THE BIBLE (OR KJB IN KDB)
▸ At .49, you get Matthew, Mark and Luke in a cluster
▸ All .11, you get the New Testament, the Old Testament, and the Epistles, clustered by author
▸ At .05, you get the Epistles of John in one cluster, and everything else in another
Similarity of .06
Similarity of .08
TEXT
EXPLAINING SIMILARITY
▸ Useful for explaining why
▸ a document is in a cluster
▸ a document matches a query
▸ two documents are similar
▸ product : (terms1 % magnitude terms1) * (terms2 % magnitude terms2);desc alignedKeys ! product % sum product;
Diogenes sitting in his tub, Jean-Léon Gerôme
TEXT
EXPLAINING SIMILARITY 3
▸ Given the cluster containing the three gospels Matthew, Mark and Luke, described by the keywordsdisciples, pharisees, john, peter, herod, mary, answering, scribes, simon, pilate
▸ The Gospel According to Saint Matthew (Relevance 0.84)disciples 0.310, pharisees 0.146, peter 0.087, herod 0.085, john 0.082, scribes 0.062, mary 0.061, hour 0.057, publicans 0.056, simon 0.053
▸ The Gospel According to Saint Matthew (Relevance 0.78)disciples 0.291, pharisees 0.115, john 0.093, peter 0.090, herod 0.081, immediately 0.072, scribes 0.069, answering 0.066, pilate 0.062, mary 0.061
▸ The Gospel According to Saint Luke (Relevance 0.82)disciples 0.244, pharisees 0.133, john 0.093, answering 0.091, herod 0.088, peter 0.086, mary 0.076, simon 0.067, pilate 0.062, immediately 0.061
TEXT
COMPARE CORPORA
▸ Find, for each term, the difference in relative frequency via log-likelihood
▸ totalFreq : (termCountA + termCountB) % (totalWordCountA + totalWordCountB);expectedA : totalWordCountA * totalFreq;desc (termCountA * log[termCountA % expectedA]);
▸ Rayson, Paul, and Roger Garside. "Comparing corpora using frequency profiling." Proceedings of the workshop on Comparing Corpora. Association for Computational Linguistics, 2000.
TEXT
COMPARE CORPORA - KJB
▸ Old Testament - Lord, shall, thy, Israel, king, thee, thou, land, shalt, children, house
▸ New Testament - Jesus, ye, Christ, things, unto, god, faith, disciples, man, world, say
TEXT
COMPARE CORPORA - JEFF SKILLING’S EMAILS
▸ Business emails - enron, please, jeff, energy, information, market, business
▸ Fraternity emails - yahoo, beta, betas, reunion, kai, ewooglin
TEXT
WORDS AS VECTORS
▸ Words can be described as vectors
▸ All previously mentioned operations become available on individual words
▸ Vectors are based on co-occurence
▸ word2vec uses machine learning to find which co-occuring words are most predictive
TEXT
CALCULATING VECTORS FOR WORDS
▸ Finding the significance of “captain” to “ahab”
▸ Of the 272 sentences containing “captain”, 78 contain “ahab”
▸ “captain” occurs in 2.7% of sentences, but occurs in 16% of sentences also containing “ahab”
▸ The likelihood of a sentence that contains “ahab” also containing “captain” is a binomially distributed random variable, as it is the product of a Bernoulli process
▸ The deviation of this random variable is √(np(1-p))where p is the overall probability of “captain” being in a sentence
▸ Significance is (cooccurenceRate - overallFrequency) % deviation (.16 - .027) % .162 .84
TEXT
WORD VECTOR EXAMPLES
Moby
stem relevance tokens ------------------------------------------------------------------dick 11.3 `dick`dick's whale 7.75 `whaling`whale`whales`whale's`whaled white 7.04 `white`whiteness`whitenesses`whites ahab 6.1 `ahab`ahab's`ahabs boat 4.95 `boat`boats`boat's encounter 4.52 `encounter`encountered`encountering`encountersseem 4.31 `seemed`seem`seems`seeming sea 4.13 `sea`seas`sea's
TEXT
WORD VECTOR EXAMPLES
harpoon stem relevance tokens -------------------------------------------------------whale 3.918473 `whaling`whale`whales`whale's`whaled boat 2.902082 `boat`boats`boat's line 2.235111 `line`lines`lined`lining sea 1.991354 `sea`seas`sea's iron 1.973497 `iron`irons`ironical dart 1.964671 `dart`darted`darts`darting ship 1.888228 `ship`ships`shipped`ship's`shipping queequeg 1.825947 `queequeg`queequeg's
TEXT
WORD VECTOR EXAMPLES - PROPER NOUNS ONLY
Moses
stem relevance ------------------- aaron 2.76 israel 2.26 pharaoh 1.39 egypt 1.31 egyptians 1.23 levites 1.19 eleazar 1.07 sinai 1.06 joshua 1 jordan 0.921 god 0.9
Jesus stem relevance ------------------- galilee 1.85 god 1.85 son 1.73 lord 1.68 john 1.57 peter 1.5 jerusalem 1.47 jews 1.45 pilate 1.37 david 1.33 pharisees 1.24
Pharaoh
stem relevance ----------------------- egypt 5.53 moses 3.9 joseph 3.46 egyptians 3.26 goshen 2.21 aaron 2.03 israel 1.58 god 1.16 red sea 1.16 canaan 1.13 hebrews 1.13
TEXT
WORDS AS VECTORS
▸ Clustering words becomes possible
▸ Given the names: pharaoh jude simon noah lamech judas ham methuselah aaron levi moses shem japeth jesus
▸ Cluster 1: noah lamech ham methuselah shemCluster 2: pharaoh aaron levi moses Cluster 3: simon judas jesus
TEXT
ANSWER QUERIES▸ Find the harmonic mean of each tokens relevance for each
search term
▸ Drop any terms with above average significance to the anti-search terms
Search Terms: captain, pequod captain ahab | 0.672187 captain peleg | 0.4844358 captains | 0.4797662 captain bildad| 0.4429896
Search Terms: captain, !pequod captain sleet | 0.5986764 captain scoresby| 0.5184432 captain pollard | 0.5184432 captain mayhew | 0.5184432 captain boomer | 0.5184432
TEXT
ANSWER QUERIES
How are Captain Bildad and Captain Peleg related?
captain| 0.0341 hand | 0.00935 old | 0.00896 ship | 0.00894 owner | 0.00883
TEXT
EXPAND SETS
Summing the vectors for a set of words will give an expanded set
expanding simon, andrew, james, and john givesbartholomew, alphaeus, matthew, thaddaeus, canaanite, zelotes, thomas, brother, iscariot, zebedee, james, peter, lebbaeus, boanerges, traitor, andrew, philip, simon, judas, and john
expanding bread, fish, milk, and beans givesbutter, honey, lentiles, cheese, millet, kine(cows), fitches(spelt), parched, shobi, bason, earthen, pulse, wheat, and barley
TEXT
STEMMING
▸ Stemming removes what it guesses are inflections antidisestablishmentarianism -> establish programmer -> program brother -> broth
▸ Produces a root word, which may not be a real word: happiness -> happi
▸ Stemmers can be compared by aggressiveness
▸ Stemming is rule based, does not require extensive datasets
TEXT
STEMMING
▸ Moby Dick has 16950 distinct words 10466 distinct stems nearly 700 words have 4 of more inflected forms
▸ general generally generous generic generously generalizing generations generated
▸ admirer admire admirals admiral admirable admirably admiral’s admirers
TEXT
TOKENIZING
▸ Tokens are individual words, names, numbers, etc.
▸ Proper names are counted as a single token
▸ Rule based, for simplicity
▸ To tokenize, all characters not in [a-zA-Z0-9’\u00C0-\u017F] get replaced with whitespace, then split on whitespace, remove terminal apostrophes, then join consecutive proper nouns
TEXT
PROPER NOUNS
▸ Any run of title cased word not at the start of a sentence is treated as a proper noun
▸ Any title cased word at the start of a sentence is treated as a proper noun if it is found as a proper noun elsewhere
TEXT
SENTENCE DETECTION
▸ Rule based, for simplicity
▸ Break on ‘?’, ‘!’ or any period which isn’t in an ellipsis (…), used as an infix (P.E.I.), preceding a number (.1416), following a title (Ms.)
▸ Sentence boundaries get modified to include quotes or multiple punctuation marks
▸ Not as accurate as machine learning algorithms which can detect sentence breaks such as “My uncle is from P.E.I. He’s in the potato business”
EXPLORATORY NLP UI IN ANALYST FOR KXSearching, managing document collections, and common nlp operations are available through the UI
VISUALIZING NLP DATA IN ANALYST FOR KXData from prose text can be visualized, here showing a discontinuity from chapter 32-45,which is on zoology, instead of part of the narrative
TEXT
CLOSING COMMENTS
▸ Feature vectors and the vector space model work for things other than documents
▸ Used with dialectology, phonetics, music classification, recommender systems
▸ Other operations that can be done with the same techniques include summarization, phrase detection, and natural language generation