accelerating pathology image data cross-comparison on cpu-gpu hybrid systems
DESCRIPTION
Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems. Kaibo Wang 1 , Yin Huai 1 , Rubao Lee 1 , Fusheng Wang 2,3 , Xiaodong Zhang 1 , Joel H. Saltz 2,3 1 Department of Computer Science and Engineering, The Ohio State University - PowerPoint PPT PresentationTRANSCRIPT
Accelerating Pathology Image Data Cross-
Comparison onCPU-GPU Hybrid Systems
Kaibo Wang 1, Yin Huai 1, Rubao Lee 1,Fusheng Wang 2,3, Xiaodong Zhang 1, Joel H. Saltz 2,3
1 Department of Computer Science and Engineering, The Ohio State University
2 Center for Comprehensive Informatics, Emory University3 Department of Biomedical Informatics, Emory University
2
Background: Digital Pathology• Digital pathology imaging has become an
increasingly important field in the past decade• Examination of high-resolution tissue images
enables more effective prediction, diagnosis, and therapy of diseases
Glass slides Scanning Whole slide images Image analysis
3
Background: Image Algorithm Evaluation
• High-quality image analysis algorithms are essential to support biomedical research and diagnosis– Validate algorithms with human annotations– Compare and consolidate multiple algorithm
results– Sensitivity study of algorithm parameters
Green: algorithm oneRed: algorithm two
4
• Spatial cross-comparison: identify and compare derived spatial objects belonging to different observations or analyses
• Jaccard similarity: the overlap ratio of intersecting polygons from two result sets
Problem: Spatial Cross-Comparison
p q
5
Both Data- and Compute-Intensive• Increasingly large data sets
– 105x105 pixels per image– 1 million objects per image– Hundreds to thousands of images per study– Big data demanding high throughput
• High computation intensity– Computing Jaccard similarity requires
heavy-duty geometric operations– Demanding high performance
Parallel computing techniques must be utilized to handle such intensive
workloads
6
Existing Approach: Spatial DBMS• Extension of RDBMS with spatial data types
and operators (PostGIS, DB2, etc.)• A typical cross-comparison takes many
hours to finish on a single machine– 90+% computing time spent on computing the
areas of polygon intersections and unions– Algorithms used by SDBMS are highly branch-
intensive and difficult to parallelize• Task-based (MIMD) parallel computing can
be applied on large-scale clusters– Expensive in facility of many high end nodes
A high-performance/throughput and cost-effective solution is highly desirable
7
Jun-06
Aug-06Oct-
06
Jan-07
Mar-07
May-07
Jul-0
7
Sep-07
Nov-07
Jan-08
Apr-08
Jun-08
Aug-08Oct-
08
Dec-08
Feb-09
Apr-09Ju
l-09
Sep-09
Nov-09
Jan-10
Mar-10
May-10
Jul-1
0
Sep-10
0
200
400
600
800
1000
1200
1400
1600
1800 GPU
Peak
Per
form
ance
(GFL
OPS
)
Jul-06 Jul-07 Jul-08 Jul-09 Jul-100
200
400
600
800
1000
1200
1400
1600
1800 CPU
Peak
Per
form
ance
(GFL
OPS
)
7900 GTX
8800 GTX 9800 GTX
GTX 280
GTX 285GTX 480
GTX 580
E4300 E6850 Q9650 X7460 980 XE
• Low-cost and powerful data-parallel devices• SIMD data parallel architectureGraphics Processing Units (GPU)
All cores on a streaming multiprocessor (SM) execute
the same instruction on different data
Applications must exploit SIMD data parallelism in order to best utilize the
power of GPUs
E.g., NVIDIA GTX 580 has 512 cores (16 SMs, 32 cores
each)
USENIX ATC’11
8
Our Solution: SCCG• Spatial Cross-comparison on CPUs
and GPUs– Utilize GPUs with CPUs for both high
throughput and high performance in a cost-effective way
• Critical challenges– SIMD data-parallel algorithms on GPU– CPU-GPU hybrid computing framework– Load balancing between CPU and GPU
9
Outline• Introduction• SCCG
– PixelBox GPU algorithm– Cross-comparing framework– Load balancing
• Experiments• Conclusions
10
The PixelBox GPU Algorithm• Given an array of polygon pairs, compute the
area of intersection and area of union for each polygon pair
• Algorithm principles– Exploit SIMD data parallelism
Compute areas of polygon intersections and unions in an SIMD data parallelism mode
– Maximize data parallelism and minimize unnecessary compute intensityReduce compute intensity while maintain high data parallelism
11
Exploit SIMD Data Parallelism• Monte-Carlo approach (a basic method)
p qp qp qUnion
Union
Compute areas of intersection/union by counting the number of pixels lying within
each regionPerfect data parallelism, but high compute
intensity when polygons are largeConsider polygons
lying on a pixel map
UnionIntersectio
n
12
p q
Reduce Unnecessary Compute Intensity
• Partition the bounding rectangular of a polygon pair into boxes (like grid cells)
p q
• Use sampling boxes• Compute areas box by box, thus avoiding lots of costly per-pixel testing
Completely within the intersection of p
and q
Completely within the union of p and q
Don’t belong to either intersection or
union
• Recursively explore unsettled boxes by partitioning them into smaller sub-boxes
p qFurther
partitionedAccording to our testing, 50+% of areas can be determined with only one level of box
partitioning
Unsettled, thus need further exploration
13
The PixelBox Algorithm• PixelBox works on both pixels and boxes
p qp qp q
First, box by box to quickly finish the testing of large
regions
Then, pixel by pixel within small sub-boxes that need
further testingIn this way, PixelBox preserves the benefits of both high data parallelism and low
compute intensity
14
1
• Use a shared stack to store the boxes to be tested
Implementation for GPU
1Thread
0Thread
1Thread
2Thread
3 ……
1
stack
0
0
0
Sampling box
Mark:1 – need further testing0 – no further testing
• The stack top is fetched by all threads, and partitioned if it needs further testing
Partitioned into sub-boxes, then tested by different
threads in parallel
Contribution of each sub-box is computed; also see whether further
testing is needed0 0 1 0
• New sub-boxes are pushed to the top of stack again
• All threads keep popping boxes from the stack for processingA box with mark 0 needs no
further testing, thus ignored by all threads
A box with mark 1 needs to be further partitioned, or apply Monte Carlo if it has
been small enough
0 0 0 0
0
0
0
0
• Computation finishes when the stack becomes empty
Using a stack improves data parallelism: both popping and pushing can be done in
SIMD fashionThe bounding rectangular of two polygons is pushed on the stack as
the first box
15
The Cross-Comparing Framework• A pipelined framework that executes the
whole cross-comparing workflow• Pipelined execution reduces resource
contention over GPUs– GPU is an exclusive, non-preemptive device– Multiple threads trying to access a GPU
simultaneously are serialized– A single initiator (Aggregator) reduces
blocking over GPUs
CPU CPU GPU (PixelBox)Load
data from disk
Find intersecting polygon pairs from
the data
Compute Jaccard
similarity
16
Load data from disk
Find intersecting polygon pairs from
the data
Compute Jaccard
similarity
CPU-GPU Load Balancing• Both CPUs and GPUs have to be fully
utilized in order to maximize throughput
Produce polygon pairs
on CPUs
Consume polygon pairs
on GPUsCPUs idle if production speed > consumption
speedGPUs idle if production speed < consumption
speed
Tasks have to be dynamically migrated among CPUs and GPUs to achieve load
balancing(please read paper for details)
17
Experiments
Both simplifications favor the performance of PostGIS
• Methodology– Spatial DBMS: PostgresQL 9.1.3 + PostGIS
1.5.3– Disk loading time not considered
• “First-load-then-query” scheme of database is known to be inefficient to process one-time data
• Storage devices are improving (e.g., SSD)– Ignore data format conversion and data
partitioning time in SDBMS
• Platforms– Dell T1500 workstation
• One Intel Core i7 860 CPU (4 cores)• One NVIDIA GTX 580 GPU
– Amazon EC2 instance• Two Intel Xeon X5570 CPUs (8 cores, 16 threads)
• Data sets– 18 pairs of polygon sets extracted from 18
real-world brain tumor pathology images– 12GiB in raw text format
18
Effectiveness of PixelBox on GPU• Algorithm performance
GEOS PixelBox-CPU PixelBox1
10
100
1000
Exec
utio
n Ti
me
(s)
PixelBox-CPU PixelBox1
10
100
1000
Spee
dup
over
GEO
S
On Dell T1500, compute the Jaccard similarity of 619609 pairs of polygons in a
representative data set
PixelBox on GPU
GEOS on a single core
CPU-version PixelBox on a
single core
Over 430 s Over 290
s
Only 3.6 s
1.5 x
120 x
This experiment shows the effectiveness of PixelBox and its best utilization of SIMD data
parallelism of GPUs
19
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18GMea
n05
101520253035404550
Data Set Index
Spee
dup
over
Pos
tGIS
-MOverall Performance
• Cross-comparing performance
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 180
50
100
150
200
250
300 PostGIS-M SCCG
Data Set Index
Exec
utio
n Ti
me
(s)
Parallelized PostGIS on
EC2
Our solution on Dell T1500
18x speedup on
average
Two Intel Xeon X5570: $ 2000
Core i7 860 + GTX 580: $ 800
Two Intel Xeon X5570: 190 w
Core i7 860 + GTX 580: 339 w
Our solution is 2.4x lower in hardware cost, and 10x higher in performance per watt
20
Conclusions• Spatial cross-comparison is a data- and
compute-intensive operation• Existing approach with SDBMS is not
high-performance and low-cost• We provide a software solution based on
GPUs and CPUs to significantly accelerate the work at low cost
21
Thank You• Q & A