nlp in q - kx systems · text nlp in q no existing nlp libraries parsing is expensive, simple...

NLP IN Q

▸ No existing NLP libraries

▸ Parsing is expensive, simple vector operations are cheap

▸ Focus on vector operations, rather than named entity recognition, part-of-speech tagging, co-reference resolution

Union of keys `circumambulate`sail`cook`whaleìshmael`fish`harpoons`jollyìshmaelìnn

Key aligned vectors 4.997 0.923 4.969 0 3.722 0n 0n 2.049 3.722 0n ...0n 0.653 0n 0 5.263 0.479 2.636 2.898 5.263 4.057...

COSINE SIMILARITY

▸ Used to calculate the angle between two vectors

▸ Dot product is the sum of the pair-wise products

▸ Given two vectors aligned such that each index i refers to the same element in each vector, the q is 0 ^ (sum x * y) % (*) . {sqrt sum x xexp 2} each (x; y)

TF-IDF

▸ Significance depends on the document and the corpus

▸ A word is significant if it is common in a document, but uncommon in a corpus

▸ Term Frequency * Inverse Document Frequency

▸ IDF: log count[corpora] % sum containsTerm;TF: (1 + log occurrences) | 0;significance: TF * IDF;

LEVEL STATISTICS

▸ Within a single document, clustered terms are more significant than uniformly distributed ones

▸ Compares the standard deviation of distances between words to the distribution that would be predicted by a geometric distribution.

▸ Where “distances” is a vector of the distance between words: σ : (dev each distances) % avg each distances; σnor : σ % sqrt 1 - p; sd(σnor) : 1 % (sqrt n) * (1 + 2.8 * n xexp -0.865); // factors in # of occurrences ⟨σnor⟩ : ((2 * n) - 1)%((2 * n) + 2); signifigance: (σnor - ⟨σnor⟩) % sd(σnor);

▸ Carpena, P., et al. "Level statistics of words: Finding keywords in literary texts and symbolic sequences." Physical Review E 79.3 (2009): 035102.

CLUSTERING

▸ Find groupings of entities

▸ Cluster documents, terms, proper nouns

▸ Find natural divisions of text

▸ Can be random or deterministic

▸ Can take as parameters: similarity of documents, number of clusters, time spent clustering

▸ Cluster centroids can be represented as feature vectors

CLUSTERING ALGORITHMS

▸ K means

▸ Pick k random documents, and cluster around these, then use the centroids as the new clustering points, and repeat until convergence

▸ Buckshot clustering

▸ Cluster sqrt(n) of the documents with an O(n^2) algorithm, then match the remaining documents to the centroids

▸ Group Average

▸ Starting with buckshot clusters, cluster each cluster into sub-clusters, merge any similar sub-clusters , repeat as long as you want

THE MARKOV CLUSTERING ALGORITHM

▸ “… random walks on the graph will infrequently go from one natural cluster to another.” - Stijn van Dongen

▸ Multiply the matrix by itself, square every element, normalize the columns, and repeat this process until it converges.

▸ Rows with multiple non-zero values gives the clusters

▸ http://micans.org/mcl/

THE MARKOV CLUSTERING FOR DOCUMENTS

.80+ Form letters, similar versions, updated articles

.60 to .80 Articles translated from English, then back

.50 to .60 Articles about the same events

.25 to .50 Articles about the same topics

.5 to .25 Articles about the same, more general, topics

less than .10 Several very large clusters, outliers become obvious

Minimum similarity is passed in as a parameter

MARKOV CLUSTERING THE BIBLE (OR KJB IN KDB)

▸ At .49, you get Matthew, Mark and Luke in a cluster

▸ All .11, you get the New Testament, the Old Testament, and the Epistles, clustered by author

▸ At .05, you get the Epistles of John in one cluster, and everything else in another

Similarity of .06

Similarity of .08

EXPLAINING SIMILARITY

▸ Useful for explaining why

▸ a document is in a cluster

▸ a document matches a query

▸ two documents are similar

▸ product : (terms1 % magnitude terms1) * (terms2 % magnitude terms2);desc alignedKeys ! product % sum product;

Diogenes sitting in his tub, Jean-Léon Gerôme

EXPLAINING SIMILARITY 3

▸ Given the cluster containing the three gospels Matthew, Mark and Luke, described by the keywordsdisciples, pharisees, john, peter, herod, mary, answering, scribes, simon, pilate

▸ The Gospel According to Saint Matthew (Relevance 0.84)disciples 0.310, pharisees 0.146, peter 0.087, herod 0.085, john 0.082, scribes 0.062, mary 0.061, hour 0.057, publicans 0.056, simon 0.053

▸ The Gospel According to Saint Matthew (Relevance 0.78)disciples 0.291, pharisees 0.115, john 0.093, peter 0.090, herod 0.081, immediately 0.072, scribes 0.069, answering 0.066, pilate 0.062, mary 0.061

▸ The Gospel According to Saint Luke (Relevance 0.82)disciples 0.244, pharisees 0.133, john 0.093, answering 0.091, herod 0.088, peter 0.086, mary 0.076, simon 0.067, pilate 0.062, immediately 0.061

COMPARE CORPORA

▸ Find, for each term, the difference in relative frequency via log-likelihood

▸ totalFreq : (termCountA + termCountB) % (totalWordCountA + totalWordCountB);expectedA : totalWordCountA * totalFreq;desc (termCountA * log[termCountA % expectedA]);

▸ Rayson, Paul, and Roger Garside. "Comparing corpora using frequency profiling." Proceedings of the workshop on Comparing Corpora. Association for Computational Linguistics, 2000.

COMPARE CORPORA - KJB

▸ Old Testament - Lord, shall, thy, Israel, king, thee, thou, land, shalt, children, house

▸ New Testament - Jesus, ye, Christ, things, unto, god, faith, disciples, man, world, say

COMPARE CORPORA - JEFF SKILLING’S EMAILS

▸ Business emails - enron, please, jeff, energy, information, market, business

▸ Fraternity emails - yahoo, beta, betas, reunion, kai, ewooglin

WORDS AS VECTORS

▸ Words can be described as vectors

▸ All previously mentioned operations become available on individual words

▸ Vectors are based on co-occurence

▸ word2vec uses machine learning to find which co-occuring words are most predictive

CALCULATING VECTORS FOR WORDS

▸ Finding the significance of “captain” to “ahab”

▸ Of the 272 sentences containing “captain”, 78 contain “ahab”

▸ “captain” occurs in 2.7% of sentences, but occurs in 16% of sentences also containing “ahab”

▸ The likelihood of a sentence that contains “ahab” also containing “captain” is a binomially distributed random variable, as it is the product of a Bernoulli process

▸ The deviation of this random variable is √(np(1-p))where p is the overall probability of “captain” being in a sentence

▸ Significance is (cooccurenceRate - overallFrequency) % deviation (.16 - .027) % .162 .84

WORD VECTOR EXAMPLES

stem relevance tokens ------------------------------------------------------------------dick 11.3 `dick`dick's whale 7.75 `whaling`whale`whales`whale's`whaled white 7.04 `white`whiteness`whitenesses`whites ahab 6.1 àhabàhab'sàhabs boat 4.95 `boat`boats`boat's encounter 4.52 èncounterèncounteredèncounteringèncountersseem 4.31 `seemed`seem`seems`seeming sea 4.13 `sea`seas`sea's

WORD VECTOR EXAMPLES

harpoon stem relevance tokens -------------------------------------------------------whale 3.918473 `whaling`whale`whales`whale's`whaled boat 2.902082 `boat`boats`boat's line 2.235111 `line`lines`lined`lining sea 1.991354 `sea`seas`sea's iron 1.973497 ìronìronsìronical dart 1.964671 `dart`darted`darts`darting ship 1.888228 `ship`ships`shipped`ship's`shipping queequeg 1.825947 `queequeg`queequeg's

WORD VECTOR EXAMPLES - PROPER NOUNS ONLY

stem relevance ------------------- aaron 2.76 israel 2.26 pharaoh 1.39 egypt 1.31 egyptians 1.23 levites 1.19 eleazar 1.07 sinai 1.06 joshua 1 jordan 0.921 god 0.9

Jesus stem relevance ------------------- galilee 1.85 god 1.85 son 1.73 lord 1.68 john 1.57 peter 1.5 jerusalem 1.47 jews 1.45 pilate 1.37 david 1.33 pharisees 1.24

Pharaoh

stem relevance ----------------------- egypt 5.53 moses 3.9 joseph 3.46 egyptians 3.26 goshen 2.21 aaron 2.03 israel 1.58 god 1.16 red sea 1.16 canaan 1.13 hebrews 1.13

WORDS AS VECTORS

▸ Clustering words becomes possible

▸ Given the names: pharaoh jude simon noah lamech judas ham methuselah aaron levi moses shem japeth jesus

▸ Cluster 1: noah lamech ham methuselah shemCluster 2: pharaoh aaron levi moses Cluster 3: simon judas jesus

ANSWER QUERIES▸ Find the harmonic mean of each tokens relevance for each

search term

▸ Drop any terms with above average significance to the anti-search terms

Search Terms: captain, pequod captain ahab | 0.672187 captain peleg | 0.4844358 captains | 0.4797662 captain bildad| 0.4429896

ANSWER QUERIES

How are Captain Bildad and Captain Peleg related?

EXPAND SETS

Summing the vectors for a set of words will give an expanded set

expanding simon, andrew, james, and john givesbartholomew, alphaeus, matthew, thaddaeus, canaanite, zelotes, thomas, brother, iscariot, zebedee, james, peter, lebbaeus, boanerges, traitor, andrew, philip, simon, judas, and john

expanding bread, fish, milk, and beans givesbutter, honey, lentiles, cheese, millet, kine(cows), fitches(spelt), parched, shobi, bason, earthen, pulse, wheat, and barley

STEMMING

▸ Stemming removes what it guesses are inflections antidisestablishmentarianism -> establish programmer -> program brother -> broth

▸ Produces a root word, which may not be a real word: happiness -> happi

▸ Stemmers can be compared by aggressiveness

▸ Stemming is rule based, does not require extensive datasets

STEMMING

▸ Moby Dick has 16950 distinct words 10466 distinct stems nearly 700 words have 4 of more inflected forms

▸ general generally generous generic generously generalizing generations generated

▸ admirer admire admirals admiral admirable admirably admiral’s admirers

TOKENIZING

▸ Tokens are individual words, names, numbers, etc.

▸ Proper names are counted as a single token

▸ Rule based, for simplicity

▸ To tokenize, all characters not in [a-zA-Z0-9’\u00C0-\u017F] get replaced with whitespace, then split on whitespace, remove terminal apostrophes, then join consecutive proper nouns

PROPER NOUNS

▸ Any run of title cased word not at the start of a sentence is treated as a proper noun

▸ Any title cased word at the start of a sentence is treated as a proper noun if it is found as a proper noun elsewhere

SENTENCE DETECTION

▸ Rule based, for simplicity

▸ Break on ‘?’, ‘!’ or any period which isn’t in an ellipsis (…), used as an infix (P.E.I.), preceding a number (.1416), following a title (Ms.)

▸ Sentence boundaries get modified to include quotes or multiple punctuation marks

▸ Not as accurate as machine learning algorithms which can detect sentence breaks such as “My uncle is from P.E.I. He’s in the potato business”

EXPLORATORY NLP UI IN ANALYST FOR KXSearching, managing document collections, and common nlp operations are available through the UI

VISUALIZING NLP DATA IN ANALYST FOR KXData from prose text can be visualized, here showing a discontinuity from chapter 32-45,which is on zoology, instead of part of the narrative

CLOSING COMMENTS

▸ Feature vectors and the vector space model work for things other than documents

▸ Used with dialectology, phonetics, music classification, recommender systems

▸ Other operations that can be done with the same techniques include summarization, phrase detection, and natural language generation

CONTACT INFObjeffery@kx.com

nlp in q - kx systems · text nlp in q no existing nlp libraries parsing is expensive, simple...

Documents

chem. biochem. eng. q. support vector machine-based soft

n physics / higher physics 1b...

cs 224d: deep learning for nlp · cs 224d: deep learning...

introduction to nlp - ils · 1. what is nlp? 2. some...

vector semantics - carnegie mellon...

2015 sep - nlp workshop - nlp center

nlp-10005586(1-2018) · nlp businessnlp nlp...

continuous vector spaces for cross-language nlp...

(nlp) nlp secrets

life changing nlp training - home - dr bridget nlp · life...

powder diffraction of modulated and composite … · the...

video activity recognition and nlp q&a model example

chapter 2 - resultant of coplanar force systems reading...

spin dynamics of the quasi-2d heisenberg antiferromagnet...

vectors vector example - bournemouth university · more...

nlp master coach course - edge nlp limited...nlp master...

nlp training - nlp certification

synoptic meteorology ii: the q-vector form of the...

ku nlp artificial intelligence1 ch 15. an introduction to...

query2vec: an evaluation of nlp techniques for generalized...