Scalable high-dimensional indexing with HadoopTEXMEX team, INRIA Rennes, France
Denis Shestakov, PhDdenis.shestakov at {aalto.fi,inria.fr}linkedin: linkedin.com/in/dshestakovmendeley: mendeley.com/profiles/denis-shestakov
Denis Shestakov, Diana Moise, Gylfi Gudmundsson, Laurent Amsaleg
Outline● Motivation● Approach overview: scaling indexing &
searching using Hadoop● Experimental setup: datasets, resources,
configuration● Results● Observations & implications● Things to share● Future directions
Motivation● Big data is here
○ Lots of multimedia content○ Even forgetting 'big' companies, 1TB/day of
multimedia is now common for many parties● Solution: apply more computational power
○ Luckily, easier access to such power via grid/cloud resources
● Applications:○ Large-scale image retrieval: e.g., detecting copyright
violations in huge image repositories○ Google Goggles-like systems: annotating the scene
Our approach● Index & search huge image collection using
MapReduce-based eCP algorithm○ See our work at ICMR'13: Indexing and searching
100M images with MapReduce [7]○ See Section II for quick overview
● Use the Grid5000 plartform○ Distributed infrastructure available to French
researchers & their partners● Use the Hadoop framework
○ Most popular open-source implementation of MapReduce model
○ Data stored in HDFS that splits it into chunks (64MB or often bigger) and distributes it across nodes
Our approach● Hadoop used for both indexing and searching● Our search scenario:
■ Searching for batch of images● Thousands of images in one run● Focus on throughput, not on response time
for individual image■ Use case: copyright violation detection
● Note: indexed dataset can be searched on single machine with adequate disk capacity if necessary
Experimental setup● Used Grid5000 platform:
○ Nodes in rennes site of Grid5000■ Up to 110 nodes available■ Nodes capacity/performance varied
● Heterogenous, come from three clusters● From 8 cores to 24 cores per node● From 24GB to 48GB RAM per node
● Hadoop ver.1.0.1○ (!) No changes in Hadoop internals
■ Pros: easy to migrate, try and compare by others■ Cons: not top performance
Experimental setup● Over 100 mln images (~30 billion SIFT descriptors)
○ Collected from the Web and provided by one of the partners in Quaero project■ One of the largest reported in literature
○ Images resized to 150px on largest side○ Worked with
■ The whole set (~4TB)■ The subset, 20mln images (~1TB)
○ Used as distracting dataset
Experimental setup● For evaluation of indexing quality:
○ Added to distracting datasets:■ INRIA Copydays (127 images)
○ Queried for■ Copydays batch (3055 images = 127 original
images and their associated variants incl. strong distortions, e.g. print-crumple-scan )
■ 12k batch (12081 images = 245 random images from dataset and their variants)
○ Checked if original images returned as top voted search results
Results: workflow overview● Experiment on indexing & searching 1TB took 5-6
hours
Results: indexing 1TB
Results: indexing 4TB● 4TB● 100 nodes● Used tuned parameters
○ Except change in #mappers/#reducers per node■ To fit bigger index tree (for 4TB) to RAM■ 4 mappers/2 reducers
● Time: 507min
Results: search quality
Results: search scalability
Results: search executionSearch 12k batch over 1TB using 100 nodes
Results: searching 4TB● 4TB● 87 nodes● Copydays query batch (3k images)
○ Throughput: 460ms per image● 12k query batch
○ Throughput: 210ms per image● Bigger batches improve throughput insignificantly
○ bigger batch -> bigger lookup table -> more RAM per mapper required -> less mappers per node
Observations &implications● HDFS block size limits scalability
○ 1TB dataset => 1186 blocks of 1024MB size○ Assuming 8-core nodes and reported searching
method: no scaling after 149 nodes (i.e. 8x149=1192)
○ Solutions:■ Smaller HDFS blocks, e.g., scaling up to 280 nodes for
512MB blocks■ Re-visit search process: e.g., partial-loading of lookup
table● Big data is here but not resources to process
○ E.g, indexing&searching >10TB not possible given resources we had
Things to share● Our methods/system can be applied to audio datasets
○ No major changes expected○ Contact me if interested
● Code for MapReduce-eCP algorithm available on request○ Should run smoothly on your Hadoop cluster○ Interested in comparisons
● Hadoop job history logs behind our experiments (not only for those reported at CBMI) available on request○ Describe indexing/searching our dataset by giving details
on map/reduce tasks execution○ Insights on better analysis/visualization are welcome○ Job logs for CBMI'13 experiments: http://goo.gl/e06wE
Future directions● Deal with big batches of query images
○ ~200k query images● Share auxiliary data (index tree, lookup table) by
mappers○ Multithreaded map tasks
● (environment-specific) Test scalability on more nodes○ Use several sites of Grid5000 infrastructure
■ rennes+nancy sites (up to 300 nodes) --in progress
Acknowledgements● TEXMEX team, INRIA Rennes http://www.
irisa.fr/texmex/index_en.php● Quaero project, http://www.quaero.org/● Grid5000 infrastructure & its Rennes
maintenance team, https://www.grid5000.fr
Thank you!Questions?