Download - HIPI
HIPI: Computer Vision atLarge Scale
Chris SweenyLiu Liu
Intro to MapReduceSIMD at ScaleMapper / Reducer
MapReduce, Main TakeawayData Centric, Data Centric, Data Centric!
Hadoop, a Java ImplAn Implementation of MapReduce originated
from Yahoo!The Cluster we worked at has 625.5 nodes,
with map task capacity of 2502 and reduce task capacity of 834
Computer Vision at ScaleThe “computational vision”The sheer size of dataset:
PCA of Natural Images (1992): 15 images, 4096 patches
High-perf Face Detection (2007): 75,000 samples
IM2GPS (2008): 6,472,304 images
HIPI Workflow
HIPI Image Bundle SetupMoral of the story:
Many small files are killing the performance in distributed file system.
Redo PCA in Natural Images at ScaleThe first 15 principal components with 15
images (Hancock, 1992):
Redo PCA in Natural Images at ScaleComparison:
Hancock, 1992
HIPI, 100
HIPI, 1,000
HIPI, 10,000
HIPI, 100,000
Optimize HIPI PerformanceCulling: because decompression is costly
Decompress at needA boolean cull(ImageHeader header) method
for conditional decompression
Culling, to inspect specific camera effectsCanon Powershot S500, at 2592x1944
HIPI, Glance at Performance figuresAn empty job (only decompressing and
looping over images), 5 run, using minimal figure, in seconds, lower is better:
10 100 1000 100001000000
100
200
300
400
500
Many Small FilesHadoop Sequence FileHIPI Image Bundle
HIPI, Glance at Performance figuresIm2gray job (converting images to gray
scale), 5 run, using minimal figure, in seconds, lower is better:
10 100 1000 100001000000
100
200
300
400
500
Many Small FilesHadoop Sequence FileHIPI Image Bundle
HIPI, Glance at Performance figuresCovariance job (compute covariance matrix
of patches, 100 patches per image), 1~3 run*, using minimal figure, in seconds, lower is better:
10 100 1000 100001000000
10002000300040005000600070008000
Many Small FilesHadoop Sequence FileHIPI Image Bundle
HIPI, Glance at Performance figuresCulling job (decompressing all images V.S.
decompressing images we care about), 1~3 run, using minimal figure, in seconds, lower is better:
10 100 1000 100001000000
100
200
300
400
500
600
700
Without CullingWith Culling
ConclusionEverything at large scale gets better.HIPI provides an image-centric interface that
performs on par or better than the leading alternative
Cull method provides significant improvement and convenience
HIPI offers noticeable improvements!
Future workRelease HIPI as Opensource Project.Work on deep integration with Hadoop.Making HIPI work-load more configurable.Making work-load more balanced.