search4similars

Post on 22-Mar-2017

341 Views

Category:

Engineering

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

SEARCH4SIMILARS

at scale

What do you mean by similar?

■ Jaccard distance

■ Cosine distance

■ Lot’s of others

Deduplication / Plagiarism LSH

A B C D E F G

A

B

C

D

E

F

G

All you need is to compare each object with all the another.O (n*n)

Your cap:Compare only similar items.

LSH Applications

■ Near-duplicate detection■ Hierarchical clustering■ Genome-wide association study■ Image similarity identification■ VisualRank■ Gene expression similarity identification■ Audio similarity identification■ Nearest neighbor search■ Audio fingerprint■ Digital video fingerprinting

LSH is a dimensionality reduction technique■ Batch algorithm■ Word “the” is not the same as word “bozo” when we compare two documents

– LSH for Cosine Distance (http://arxiv.org/pdf/1110.1328.pdf)■ Hard to analyze■ If you add new documents, you can’t find similar in real-time

– some online-related works for restricted cases (http://www.cs.jhu.edu/~vandurme/papers/VanDurmeLallACL11.pdf )

■ LSH will treat “cool project” and “cool room” as more similar than “cool room” and “cold hall”

■ Fits for searching very similar objects. Not optimal to search for not too similar.

Search 4 sense

■ Bayes theorem■ Bayesian statistics■ Conjugate prior■ Probabilistic graphical models■ Topic modeling■ pLSA / LDA

Bayes' theorem

where A and B are events.■ P(A) and P(B) are the probabilities of A and B without regard to each other.■ P(A | B), a conditional probability, is the probability of observing event A given that B is

true.■ P(B | A) is the probability of observing event B given that A is true.

Bayesian vs Frequentist statistics

■ Coin tossing– coin fell 4 times of 5 on a head

■ Сonjugate prior ■ Exponential family■ Sufficient statistic

Probabilistic Graphical Models

Topic modeling

Topic modeling assumptions

■ Document order does not matter (Bag of words)■ Most common words do not characterize topic■ Document collection could be represented as document-word pair ■ Each topic could be described via unknown distribution

■ Independency assumption

probabilistic Latent Semantic Analysis

LDA■ Almost the same as pLSA, but with Dirichlet distribution as prior

LinksMining Massive Datasets■ http://infolab.stanford.edu/~ullman/mmds/book.pdf■ https://ru.coursera.org/course/mmds■ http://www.mmds.org/

K. Vorontsov. Machine Learning■ https://www.youtube.com/watch?v=H7hlSz4WWhQ■ https://www.youtube.com/watch?v=EOmv7fakk5E■ http://

www.machinelearning.ru/wiki/images/2/22/Voron-2013-ptm.pdf

D.Vetrov. Bayes Statistics■ https://compscicenter.ru/courses/bayes-course/2015-su

mmer/

D.Koller. Probabilistic Graphical Models

■ https://ru.coursera.org/course/pgm

■ https://en.wikipedia.org/wiki/Jaccard_index■ https://en.wikipedia.org/wiki/Cosine_similarity■ https://en.wikipedia.org/wiki/MinHash■ https://en.wikipedia.org/wiki/Locality-sensitive_hashing■ LSH for Cosine Distance (

http://arxiv.org/pdf/1110.1328.pdf) 

■ https://en.wikipedia.org/wiki/Bayesian_statistics■ https://en.wikipedia.org/wiki/Conjugate_prior■ https://en.wikipedia.org/wiki/Sufficient_statistic■ https://en.wikipedia.org/wiki/Graphical_model■ https://en.wikipedia.org/wiki/Topic_model■ https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation■ https://en.wikipedia.org/wiki/Probabilistic_latent_semant

ic_analysis

Repository & Chats announcements■ Github: https://github.com/scalalab3

– https://github.com/scalalab3/chatbot-engine– https://github.com/scalalab3/logs-service– https://github.com/scalalab3/lyrics-engine

■ Gitter: https://gitter.im/scalalab3/all– https://gitter.im/scalalab3/lyrics-engine– https://gitter.im/scalalab3/logs-service– http://gitter.im/scalalab3/chatbot-engine

top related