multi feature indexing networkdistributed similarity search structures •native metric structures:...
TRANSCRIPT
Multi Feature Indexing Network MUFIN
Similarity Search Platform for many Applications
Pavel Zezula
Faculty of Informatics
Masaryk University, Brno
23.1.2012 1 MUFIN: Multi Feature Indexing Network
Outline of the talk
• Why similarity
• Principles of metric similarity searching
• The MUFIN approach
• Demo applications
• Future directions
23.1.2012 MUFIN: Multi Feature Indexing Network 2
Real-Life Motivation The social psychology view
• Any event in the history of organism is, in a sense, unique.
• Recognition, learning, and judgment presuppose an ability to categorize stimuli and classify situations by similarity.
• Similarity (proximity, resemblance, communality, representativeness, psychological distance, etc.) is fundamental to theories of perception, learning, judgment, etc.
23.1.2012 3 MUFIN: Multi Feature Indexing Network
Contemporary Networked Media The digital data view
• Almost everything that we see, read, hear, write, measure, or observe can be digital.
• Users autonomously contribute to production of global media and the growth is exponential.
• Sites like Flickr, YouTube, Facebook host user contributed content for a variety of events.
• The elements of networked media are related by numerous multi-facet links of similarity.
23.1.2012 4 MUFIN: Multi Feature Indexing Network
Examples with Similarity
• Does the computer disk of a suspected criminal contain illegal multimedia material?
• What are the stocks with similar price histories?
• Which companies advertise their logos in the direct TV transmission of football match?
• Is it the situation on the web getting close to any of the network attacks which resulted in significant damage in the past?
23.1.2012 5 MUFIN: Multi Feature Indexing Network
Challenge
• Networked media is getting close to the human “fact-bases” – the gap between physical and digital has blurred
• Similarity data management is needed to connect,
search, filter, merge, relate, rank, cluster, classify, identify, or categorize objects across various collections.
WHY? It is the similarity which is in the world revealing.
23.1.2012 6 MUFIN: Multi Feature Indexing Network
Limitations: Data Types
We have • Attributes
– Numbers, strings, etc.
• Text (text-based) – Documents, annotations
We need • Multimedia
– Image, video, audio
• Security – Biometrics
• Medicine – EKG, EEG, EMG, EMR, CT, etc.
• Scientific data – Biology, chemistry, physics,
life sciences, economics
• Others – Motion, emotion, events, etc.
23.1.2012 7 MUFIN: Multi Feature Indexing Network
Limitations: Models of Similarity
We have • Simple geometric models,
typically vector spaces
We need • More complex model
• Non metric models
• Asymmetric similarity
• Subjective similarity
• Context aware similarity
• Complex similarity
• Etc.
23.1.2012 8 MUFIN: Multi Feature Indexing Network
Limitations: Queries
We have • Simple query
– Nearest neighbor
– Range
We need • More query types
– Reverse NN, distinct NN, similarity join
• Other similarity-based operations – Filtering, classification, event
detection, clustering, etc.
• Similarity algebra – May become the basis of a
“Similarity Data Management System”
23.1.2012 9 MUFIN: Multi Feature Indexing Network
Limitations: Implementation Strategies
We have • Centralized or parallel
processing
We need • Scalable and distributed
architectures
• MapReduce like approaches
• P2P architectures
• Cloud computing
• Self-organized architectures
• Etc.
23.1.2012 10 MUFIN: Multi Feature Indexing Network
Search Strategy Evolution
Scalability ● data volume - exponential ● number of users (queries) ● variety of data types ● multi-lingual, -feature –modal queries
Determinism exact match ► similarity precise ► approximate same answer ► good answer; recommendation fixed query ► personalized; context aware fixed infrastr. ► dynamic mapping; mobile dev.
grad
e
high
low
well established cutting-edge research
pe
er-
to-p
ee
r
cen
tral
ize
d
par
alle
l
dis
trib
ute
d
self
-org
aniz
ed
23.1.2012 11 MUFIN: Multi Feature Indexing Network
similarity effectiveness efficiency
stimuli
algebra
Similarity Data Management System
Similarity Data
Management System
23.1.2012 12 MUFIN: Multi Feature Indexing Network
Metric Search Grows in Popularity
Hanan Samet Foundation of Multidimensional and Metric Data Structures Morgan Kaufmann, 2006
P. Zezula, G. Amato, V. Dohnal, and M. Batko Similarity Search: The Metric Space Approach Springer, 2006
23.1.2012 13 MUFIN: Multi Feature Indexing Network
The MUFIN Approach
MUFIN: MUlti-Feature Indexing Network
SEARCH
infrastructure
Scalability P2P structure
Extensibility metric space
Independence Infrastructure as a service
23.1.2012 14 MUFIN: Multi Feature Indexing Network
Extensibility: Metric Abstraction of Similarity
• Metric space: M = (D,d) – D – domain
– distance function d(x,y)
x,y,z D
• d(x,y) > 0 - non-negativity
• d(x,y) = 0 x = y - identity
• d(x,y) = d(y,x) - symmetry
• d(x,y) ≤ d(x,z) + d(z,y) - triangle inequality
23.1.2012 15 MUFIN: Multi Feature Indexing Network
Examples of Distance Functions
• Lp Minkovski distance (for vectors) • L1 – city-block distance
• L2 – Euclidean distance
• L – infinity
• Edit distance (for strings) • minimal number of insertions, deletions and substitutions
• d(‘application’, ‘applet’) = 6
• Jaccard’s coefficient (for sets A,B)
n
i
ii yxyxL1
1 ||),(
n
i
ii yxyxL1
2
2 ),(
ii
n
i
yxyxL max),(1
BA
BABAd 1,
23.1.2012 16 MUFIN: Multi Feature Indexing Network
Examples of Distance Functions
• Mahalanobis distance
– for vectors with correlated dimensions
• Hausdorff distance
– for sets with elements related by another distance
• Earth movers distance
– primarily for histograms (sets of weighted features)
• and many others
23.1.2012 17 MUFIN: Multi Feature Indexing Network
Similarity Search Problem
• For X D in metric space M,
pre-process X so that the similarity queries
are executed efficiently.
No total ordering exists!
23.1.2012 18 MUFIN: Multi Feature Indexing Network
23.1.2012 MUFIN: Multi Feature Indexing Network 19
Similarity Queries
• Range query
• Nearest neighbor query
• Similarity join
• Combined queries
• Complex queries
23.1.2012 MUFIN: Multi Feature Indexing Network 20
Similarity Range Query
• range query
– R(q,r) = { x X | d(q,x) ≤ r }
… all museums up to 2km from my hotel …
r q
23.1.2012 MUFIN: Multi Feature Indexing Network 21
Nearest Neighbor Query
• the nearest neighbor query – NN(q) = x – x X, y X, d(q,x) ≤ d(q,y)
• k-nearest neighbor query
– k-NN(q,k) = A – A X, |A| = k – x A, y X – A, d(q,x) ≤ d(q,y)
… five closest museums to my hotel …
q
k=5
23.1.2012
MUFIN: Multi Feature Indexing Network 22
Similarity Join Queries
• similarity join of two data sets
• similarity self join X = Y
…pairs of hotels and museums which are five minutes walk
apart …
}),(:),{(),,(
0,,
yxdYXyxYXJ
YX DD
23.1.2012 MUFIN: Multi Feature Indexing Network 23
Combined Queries
• Range + Nearest neighbors
• Nearest neighbor + similarity joins
– by analogy
}),(),(),(
:,||,{),(
rxqdyqdxqd
RXyRxkRXRrqkNN
23.1.2012 MUFIN: Multi Feature Indexing Network 24
Complex Queries
• Find the best matches of circular shape objects with red color
• The best match for circular shape or red color needs not be the best match combined
• A0 algorithm
• Threshold algorithm
23.1.2012 MUFIN: Multi Feature Indexing Network 25
Partitioning Principles
• Given a set X D in M=(D,d), basic
partitioning principles have been defined:
– Ball partitioning
– Generalized hyper-plane partitioning
– Excluded middle partitioning
– Clustering
23.1.2012 MUFIN: Multi Feature Indexing Network 26
Ball Partitioning
• Inner set: { x X | d(p,x) ≤ dm }
• Outer set: { x X | d(p,x) > dm }
p dm
23.1.2012 MUFIN: Multi Feature Indexing Network 27
Generalized Hyper-plane
• { x X | d(p1,x) ≤ d(p2,x) }
• { x X | d(p1,x) > d(p2,x) }
p2
p1
23.1.2012 MUFIN: Multi Feature Indexing Network 28
Excluded Middle Partitioning
• Inner set: { x X | d(p,x) ≤ dm - } • Outer set: { x X | d(p,x) > dm + }
• Excluded set: otherwise
p
dm
2
p
dm
23.1.2012 MUFIN: Multi Feature Indexing Network 29
Clustering
• Cluster data into sets
– bounded by a ball region
– { x X | d(pi,x) ≤ ric }
Scalability: Peer-to-Peer Indexing
• Local search: M-tree, D-Index, M-Index
• Native metric techniques: GHT*, VPT*
• Transformation techniques: M-CAN, M-Chord
23.1.2012 30 MUFIN: Multi Feature Indexing Network
The M-tree [Ciaccia, Patella, Zezula, VLDB 1997]
1) Paged organization
2) Dynamic
3) Suitable for arbitrary metric spaces
4) I/O and CPU optimization - computing d can be time-consuming
23.1.2012 31 MUFIN: Multi Feature Indexing Network
The M-tree Idea
• Depending on the metric, the “shape” of index regions changes
C D E F
A B
B
F D
E A
C
Metric: L2 (Euclidean)
L1 (city-block) L (max-metric) weighted-Euclidean quadratic form
23.1.2012 32 MUFIN: Multi Feature Indexing Network
23.1.2012 MUFIN: Multi Feature Indexing Network 33
o7
M-tree: Example
o1 o6
o10
o3
o2
o5
o4
o9
o8
o11
o1 4.5 -.- o2 6.9 -.-
o1 1.4 0.0 o10 1.2 3.3 o7 1.3 3.8 o2 2.9 0.0 o4 1.6 5.3
o2 0.0 o8 2.9 o1 0.0 o6 1.4 o10 0.0 o3 1.2
o7 0.0 o5 1.3 o11 1.0 o4 0.0 o9 1.6
Covering radius
Distance to parent Distance to parent Distance to parent
Distance to parent Leaf entries
M-tree family
• Bulk loading
• Slim-tree
• Multi-way insertion
• PM-tree
• M2-tree
• etc.
23.1.2012 34 MUFIN: Multi Feature Indexing Network
D-Index [Dohnal, Gennaro, Zezula, MTA 2002]
4 separable buckets at
the first level
2 separable buckets at
the second level
exclusion bucket of
the whole structure
23.1.2012 35 MUFIN: Multi Feature Indexing Network
D-index: Insertion
23.1.2012 36 MUFIN: Multi Feature Indexing Network
D-index: Range Search
q
r
q
r
q
r
q
r
q
r
q
r
23.1.2012 37 MUFIN: Multi Feature Indexing Network
Implementation Postulates of Distributed Indexes
• dynamism – nodes can be added and removed
• no hot-spots – no centralized nodes, no flooding by messages (transactions)
• update independence – network update at one site does not require an immediate change propagation to all the other sites
23.1.2012 38 MUFIN: Multi Feature Indexing Network
Distributed Similarity Search Structures
• Native metric structures:
– GHT* (Generalized Hyperplane Tree)
– VPT* (Vantage Point Tree)
• Transformation approaches:
– M-CAN (Metric Content Addressable Network)
– M-Chord (Metric Chord)
23.1.2012 39 MUFIN: Multi Feature Indexing Network
23.1.2012 MUFIN: Multi Feature Indexing Network 40
GHT* Address Search Tree
• Based on the Generalized Hyperplane Tree [Uhl91]
– two pivots for binary partitioning
p6
p5
p3
p4
p1
p2
p1 p2
p5 p6 p3 p4
23.1.2012 MUFIN: Multi Feature Indexing Network 41
GHT* Address Search Tree
• Inner node
– two pivots (reference objects)
• Leaf node
– BID pointer to a bucket if data stored on the current peer
– NNID pointer to a peer if data stored on a different peer
p1 p2
p5 p6 p3 p4
BID1 BID2 BID3 NNID2
Peer 2
23.1.2012 MUFIN: Multi Feature Indexing Network 42
GHT* Address Search Tree
Peer 2 Peer 3
Peer 1
23.1.2012 MUFIN: Multi Feature Indexing Network 43
BID1 BID2 BID3 NNID2
Peer 2
p1 p2
p5 p6 p3 p4
Peer 2
BID3 NNID2
p5 p6
p1 p2
GHT* Range Query
• Range query R(q,r)
– traverse peer’s own AST
– search buckets for all BIDs found
– forward query to all NNIDs found
p6
p5
p3
p4
r q
p1
p2
23.1.2012 MUFIN: Multi Feature Indexing Network 44
AST: Logarithmic replication
• Full AST on every peer is space consuming
– replication of pivots grows in a linear way
• Store only a part of the AST:
– all paths to local buckets
• Deleted sub-trees:
– replaced by NNID of the leftmost peer
p13 p14 p11 p12
p5 p6
p1 p2
p3 p4
p7 p8 p9 p10
NNID2 NNID3 BID1 NNID4 NNID5 NNID6 NNID7 NNID8
p1 p2
p3 p4
p7 p8
BID1 NNID3 NNID5
23.1.2012 MUFIN: Multi Feature Indexing Network 45
AST: Logarithmic Replication (cont.)
• Resulting tree
– replication of pivots grows in a logarithmic way
p1 p2
p3 p4
p7 p8
NNID2
NNID3
BID1
NNID5
p1 p2
p3 p4
p7 p8
BID1
23.1.2012 MUFIN: Multi Feature Indexing Network 46
p1
r1
p3
r3
VPT* Structure
• Similar to the GHT* - ball partitioning is used for AST
Based on the Vantage Point Tree [Yia93]
• inner nodes have one pivot and a radius
• different traversing conditions
p2
r2
p1 (r1)
p2 (r2) p3 (r3)
M-Chord: The Metric Chord
• Transform metric space to one-dimensional domain
– Use M-Index - a generalized version of the iDistance
• Divide the domain into intervals
– assign each interval to a peer
• Use the Chord P2P protocol for navigation
• The Skip graphs distributed protocol can be used, alternatively
23.1.2012 47 MUFIN: Multi Feature Indexing Network
– range query R(q,r): identify intervals of
interest
• Generalization to metric spaces
– select pivots – then partition: Voronoi-style
M-Chord: Indexing the Distance
• iDistance – indexing technique for vector domains
– cluster analysis = centers = reference points pi
– assign iDistance keys to objects iCx
cixpdxiDist i ),()(
},...,{ 0 npp
23.1.2012
48 MUFIN: Multi Feature Indexing Network
M-Chord: Chord Protocol
• Peer-to-Peer navigation protocol
• Peers are responsible for intervals of keys
• hops to localize a node storing a key
M-Chord
set the iDistance domain
make it uniform: function h
Use Chord on this domain
)(logn
)),(()( cixpdhxmchord i
23.1.2012 49 MUFIN: Multi Feature Indexing Network
M-Chord: Range Query
• Node Nq initiates the search
• Determine intervals
– generalized iDistance
• Forward requests to peers on
intervals
• Search in the nodes
– using local organization
• Merge the received partial
answers
23.1.2012
50 MUFIN: Multi Feature Indexing Network
23.1.2012 MUFIN: Multi Feature Indexing Network 51
M-CAN: The Metric CAN
• Based on the Content-Addressable Network (CAN) – a DHT navigating in an N-dimensional vector space
• The Idea: 1. Map the metric space to a vector space
– given N pivots: p1, p2 , … , pN, transform every o into vector F(o)
2. Use CAN to
– distribute the vector space zones among the nodes – navigate in the network
23.1.2012 MUFIN: Multi Feature Indexing Network 52
CAN: Principles & Navigation
• CAN – the principles – the space is divided in zones
– each node “owns” a zone
– nodes know their neighbors
• CAN – the navigation – greedy routing
– in every step, move to the neighbor closer to the target location
2-d
imensi
onal vect
or
space
1
6 2
5 3
4
x,y
23.1.2012 MUFIN: Multi Feature Indexing Network 53
M-CAN: Contractiveness & Filtering
• Use the L∞ as a distance measure
– the mapping F is contractive
• More pivots better filtering
– but, CAN routing is better for less dimensions
• Additional filtering
– some pivots are only used for filtering data (inside the explored nodes)
– they are not used for mapping into CAN vector space
),())(),(( yxdyFxFL
Infrastructure Independence: MESSIF Metric Similarity Search Implementation Framework
Metric space (D,d) Operations Storage
Centralized index structures
Distributed index structures
Communication
Net Vectors
• Lp and quadratic form
Strings
• (weighted) edit and
protein sequence
Insert, delete,
range query,
k-NN query,
Incremental k-NN
Volatile memory
Persistent memory
Performance statistics
23.1.2012 54 MUFIN: Multi Feature Indexing Network
Applications: a Word Cloud
23.1.2012 57 MUFIN: Multi Feature Indexing Network
Concepts of the Image search
Image base
23.1.2012 58 MUFIN: Multi Feature Indexing Network
Images and their Descriptors
Image level
R
B
G
Descriptor level
23.1.2012 59 MUFIN: Multi Feature Indexing Network
• Largest publicly available collection of high-quality images metadata: 106 million images
• Each image contains: • Five MPEG-7 VDs: Scalable Color, Color Structure, Color Layout, Edge
Histogram, Homogeneous Texture
• Other textual information: title, tags, comments, etc.
• Photos have been crawled from the Flickr photo-sharing site.
http://cophir.isti.cnr.it/
100M images + metadata + MPEG-7 VDs
CoPhIR: Content-based Photo Image Retrieval
23.1.2012 60 MUFIN: Multi Feature Indexing Network
MUFIN SEARCH ENGINE
infrastructure
Scalability M-Chord + M-Index
Extensibility COPHIR
edge histogram
color structure
scalable color
homogeneous texture
color layout
6 x IBM server x3400 – 2 servers used
Image Search Demo http://mufin.fi.muni.cz/imgsearch/
23.1.2012 61 MUFIN: Multi Feature Indexing Network
MUFIN demos
• http://mufin.fi.muni.cz/imgsearch/similar
• http://www.pixmac.com/
• http://mufin.fi.muni.cz/twenga/random
• http://mufin.fi.muni.cz/fingerprints/random
• http://mufin.fi.muni.cz/subseq/random
• http://mufin.fi.muni.cz/plugins/annotation
23.1.2012 62 MUFIN: Multi Feature Indexing Network
MUFIN Future Research Directions
• MUFIN - a universal similarity search technology
• Research directions in: – Core technology
– Applications
– A style of computing
MUFIN Search Engine
infrastructure
Scalability P2P structures
Extensibility metric space
Performance Tuning
23.1.2012 63 MUFIN: Multi Feature Indexing Network
MUFIN Future Research Directions
October 28, 2011
MUFIN Search Engine
infrastructure New style of computing
Cloud Computing Similarity Search as Service
23.1.2012 64 MUFIN: Multi Feature Indexing Network
Major Applications
– Images: • Sub-image retrieval
• Ranking
• Annotation
• Categorization
• Benchmarking
– Biometrics: • Face recognition
• Fingerprint recognition
• Gait recognition
– Signals: • Audio recognition
• Time series similarity
– Videos: • Event detection
23.1.2012 65 MUFIN: Multi Feature Indexing Network
A New Style of Computing
• From the project-oriented approach towards similarity cloud for multimedia findability
through similarity searching
Advantages: – Cloud makes similarity search accessible to common
users
– Computational resources are shared – users don’t need to maintain any hardware infrastructure
– Users don’t need to care for the OS, security, software platform, etc.
23.1.2012 66 MUFIN: Multi Feature Indexing Network