search4similars

SEARCH4SIMILARS

at scale

What do you mean by similar?

■ Jaccard distance

■ Cosine distance

■ Lot’s of others

Deduplication / Plagiarism LSH

A B C D E F G

All you need is to compare each object with all the another.O (n*n)

Your cap:Compare only similar items.

LSH Applications

■ Near-duplicate detection■ Hierarchical clustering■ Genome-wide association study■ Image similarity identification■ VisualRank■ Gene expression similarity identification■ Audio similarity identification■ Nearest neighbor search■ Audio fingerprint■ Digital video fingerprinting

LSH is a dimensionality reduction technique■ Batch algorithm■ Word “the” is not the same as word “bozo” when we compare two documents

– LSH for Cosine Distance (http://arxiv.org/pdf/1110.1328.pdf)■ Hard to analyze■ If you add new documents, you can’t find similar in real-time

– some online-related works for restricted cases (http://www.cs.jhu.edu/~vandurme/papers/VanDurmeLallACL11.pdf )

■ LSH will treat “cool project” and “cool room” as more similar than “cool room” and “cold hall”

■ Fits for searching very similar objects. Not optimal to search for not too similar.

Search 4 sense

■ Bayes theorem■ Bayesian statistics■ Conjugate prior■ Probabilistic graphical models■ Topic modeling■ pLSA / LDA

Bayes' theorem

where A and B are events.■ P(A) and P(B) are the probabilities of A and B without regard to each other.■ P(A | B), a conditional probability, is the probability of observing event A given that B is

true.■ P(B | A) is the probability of observing event B given that A is true.

Bayesian vs Frequentist statistics

■ Coin tossing– coin fell 4 times of 5 on a head

■ Сonjugate prior ■ Exponential family■ Sufficient statistic

Probabilistic Graphical Models

Topic modeling

Topic modeling assumptions

■ Document order does not matter (Bag of words)■ Most common words do not characterize topic■ Document collection could be represented as document-word pair ■ Each topic could be described via unknown distribution

■ Independency assumption

probabilistic Latent Semantic Analysis

LDA■ Almost the same as pLSA, but with Dirichlet distribution as prior

LinksMining Massive Datasets■ http://infolab.stanford.edu/~ullman/mmds/book.pdf■ https://ru.coursera.org/course/mmds■ http://www.mmds.org/

K. Vorontsov. Machine Learning■ https://www.youtube.com/watch?v=H7hlSz4WWhQ■ https://www.youtube.com/watch?v=EOmv7fakk5E■ http://

www.machinelearning.ru/wiki/images/2/22/Voron-2013-ptm.pdf

D.Vetrov. Bayes Statistics■ https://compscicenter.ru/courses/bayes-course/2015-su

D.Koller. Probabilistic Graphical Models

■ https://ru.coursera.org/course/pgm

■ https://en.wikipedia.org/wiki/Jaccard_index■ https://en.wikipedia.org/wiki/Cosine_similarity■ https://en.wikipedia.org/wiki/MinHash■ https://en.wikipedia.org/wiki/Locality-sensitive_hashing■ LSH for Cosine Distance (

http://arxiv.org/pdf/1110.1328.pdf)

■ https://en.wikipedia.org/wiki/Bayesian_statistics■ https://en.wikipedia.org/wiki/Conjugate_prior■ https://en.wikipedia.org/wiki/Sufficient_statistic■ https://en.wikipedia.org/wiki/Graphical_model■ https://en.wikipedia.org/wiki/Topic_model■ https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation■ https://en.wikipedia.org/wiki/Probabilistic_latent_semant

ic_analysis

Repository & Chats announcements■ Github: https://github.com/scalalab3

– https://github.com/scalalab3/chatbot-engine– https://github.com/scalalab3/logs-service– https://github.com/scalalab3/lyrics-engine

■ Gitter: https://gitter.im/scalalab3/all– https://gitter.im/scalalab3/lyrics-engine– https://gitter.im/scalalab3/logs-service– http://gitter.im/scalalab3/chatbot-engine

search4similars

Engineering

the value-adding test strategist

android wear and the future of smartwatch

ofsted: inspecting computing

deeplearning on hadoop @oscon 2014

cell phone operated robot

cement paints

om (cont.)

scaling big data with hadoop and mesos

glass pre-processing before tempering

generation of high resolution dsm using uav images

android development basic _zuosyuanwang

igem 2014: uc santa cruz bioe project

beyond static configuration

css pattern libraries

schneider electric vp david gardner scopes the big picture...

corrosion under insulation ( cui)

reactive supply to changing demand

sleipnir presentation

freelance web developer: tips to get started

contributor and non-contributor cpd ch-1 ppt