Download - Search4similars
SEARCH4SIMILARS
at scale
What do you mean by similar?
■ Jaccard distance
■ Cosine distance
■ Lot’s of others
Deduplication / Plagiarism LSH
A B C D E F G
A
B
C
D
E
F
G
All you need is to compare each object with all the another.O (n*n)
Your cap:Compare only similar items.
LSH Applications
■ Near-duplicate detection■ Hierarchical clustering■ Genome-wide association study■ Image similarity identification■ VisualRank■ Gene expression similarity identification■ Audio similarity identification■ Nearest neighbor search■ Audio fingerprint■ Digital video fingerprinting
LSH is a dimensionality reduction technique■ Batch algorithm■ Word “the” is not the same as word “bozo” when we compare two documents
– LSH for Cosine Distance (http://arxiv.org/pdf/1110.1328.pdf)■ Hard to analyze■ If you add new documents, you can’t find similar in real-time
– some online-related works for restricted cases (http://www.cs.jhu.edu/~vandurme/papers/VanDurmeLallACL11.pdf )
■ LSH will treat “cool project” and “cool room” as more similar than “cool room” and “cold hall”
■ Fits for searching very similar objects. Not optimal to search for not too similar.
Search 4 sense
■ Bayes theorem■ Bayesian statistics■ Conjugate prior■ Probabilistic graphical models■ Topic modeling■ pLSA / LDA
Bayes' theorem
where A and B are events.■ P(A) and P(B) are the probabilities of A and B without regard to each other.■ P(A | B), a conditional probability, is the probability of observing event A given that B is
true.■ P(B | A) is the probability of observing event B given that A is true.
Bayesian vs Frequentist statistics
■ Coin tossing– coin fell 4 times of 5 on a head
■ Сonjugate prior ■ Exponential family■ Sufficient statistic
Probabilistic Graphical Models
Topic modeling
Topic modeling assumptions
■ Document order does not matter (Bag of words)■ Most common words do not characterize topic■ Document collection could be represented as document-word pair ■ Each topic could be described via unknown distribution
■ Independency assumption
probabilistic Latent Semantic Analysis
LDA■ Almost the same as pLSA, but with Dirichlet distribution as prior
LinksMining Massive Datasets■ http://infolab.stanford.edu/~ullman/mmds/book.pdf■ https://ru.coursera.org/course/mmds■ http://www.mmds.org/
K. Vorontsov. Machine Learning■ https://www.youtube.com/watch?v=H7hlSz4WWhQ■ https://www.youtube.com/watch?v=EOmv7fakk5E■ http://
www.machinelearning.ru/wiki/images/2/22/Voron-2013-ptm.pdf
D.Vetrov. Bayes Statistics■ https://compscicenter.ru/courses/bayes-course/2015-su
mmer/
D.Koller. Probabilistic Graphical Models
■ https://ru.coursera.org/course/pgm
■ https://en.wikipedia.org/wiki/Jaccard_index■ https://en.wikipedia.org/wiki/Cosine_similarity■ https://en.wikipedia.org/wiki/MinHash■ https://en.wikipedia.org/wiki/Locality-sensitive_hashing■ LSH for Cosine Distance (
http://arxiv.org/pdf/1110.1328.pdf)
■ https://en.wikipedia.org/wiki/Bayesian_statistics■ https://en.wikipedia.org/wiki/Conjugate_prior■ https://en.wikipedia.org/wiki/Sufficient_statistic■ https://en.wikipedia.org/wiki/Graphical_model■ https://en.wikipedia.org/wiki/Topic_model■ https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation■ https://en.wikipedia.org/wiki/Probabilistic_latent_semant
ic_analysis
Repository & Chats announcements■ Github: https://github.com/scalalab3
– https://github.com/scalalab3/chatbot-engine– https://github.com/scalalab3/logs-service– https://github.com/scalalab3/lyrics-engine
■ Gitter: https://gitter.im/scalalab3/all– https://gitter.im/scalalab3/lyrics-engine– https://gitter.im/scalalab3/logs-service– http://gitter.im/scalalab3/chatbot-engine