accelerating pathology image data cross-comparison on cpu-gpu hybrid systems

Accelerating Pathology Image Data Cross-

Comparison onCPU-GPU Hybrid Systems

Kaibo Wang 1, Yin Huai 1, Rubao Lee 1,Fusheng Wang 2,3, Xiaodong Zhang 1, Joel H. Saltz 2,3

1 Department of Computer Science and Engineering, The Ohio State University

2 Center for Comprehensive Informatics, Emory University3 Department of Biomedical Informatics, Emory University

2

Background: Digital Pathology• Digital pathology imaging has become an

increasingly important field in the past decade• Examination of high-resolution tissue images

enables more effective prediction, diagnosis, and therapy of diseases

Glass slides Scanning Whole slide images Image analysis

3

Background: Image Algorithm Evaluation

• High-quality image analysis algorithms are essential to support biomedical research and diagnosis– Validate algorithms with human annotations– Compare and consolidate multiple algorithm

results– Sensitivity study of algorithm parameters

Green: algorithm oneRed: algorithm two

4

• Spatial cross-comparison: identify and compare derived spatial objects belonging to different observations or analyses

• Jaccard similarity: the overlap ratio of intersecting polygons from two result sets

Problem: Spatial Cross-Comparison

p q

5

Both Data- and Compute-Intensive• Increasingly large data sets

– 105x105 pixels per image– 1 million objects per image– Hundreds to thousands of images per study– Big data demanding high throughput

• High computation intensity– Computing Jaccard similarity requires

heavy-duty geometric operations– Demanding high performance

Parallel computing techniques must be utilized to handle such intensive

workloads

6

Existing Approach: Spatial DBMS• Extension of RDBMS with spatial data types

and operators (PostGIS, DB2, etc.)• A typical cross-comparison takes many

hours to finish on a single machine– 90+% computing time spent on computing the

areas of polygon intersections and unions– Algorithms used by SDBMS are highly branch-

intensive and difficult to parallelize• Task-based (MIMD) parallel computing can

be applied on large-scale clusters– Expensive in facility of many high end nodes

A high-performance/throughput and cost-effective solution is highly desirable

7

Jun-06

Aug-06Oct-

06

Jan-07

Mar-07

May-07

Jul-0

7

Sep-07

Nov-07

Jan-08

Apr-08

Jun-08

Aug-08Oct-

08

Dec-08

Feb-09

Apr-09Ju

l-09

Sep-09

Nov-09

Jan-10

Mar-10

May-10

Jul-1

0

Sep-10

0

200

400

600

800

1000

1200

1400

1600

1800 GPU

Peak

Per

form

ance

(GFL

OPS

)

Jul-06 Jul-07 Jul-08 Jul-09 Jul-100

200

400

600

800

1000

1200

1400

1600

1800 CPU

Peak

Per

form

ance

(GFL

OPS

)

7900 GTX

8800 GTX 9800 GTX

GTX 280

GTX 285GTX 480

GTX 580

E4300 E6850 Q9650 X7460 980 XE

• Low-cost and powerful data-parallel devices• SIMD data parallel architectureGraphics Processing Units (GPU)

All cores on a streaming multiprocessor (SM) execute

the same instruction on different data

Applications must exploit SIMD data parallelism in order to best utilize the

power of GPUs

E.g., NVIDIA GTX 580 has 512 cores (16 SMs, 32 cores

each)

USENIX ATC’11

8

Our Solution: SCCG• Spatial Cross-comparison on CPUs

and GPUs– Utilize GPUs with CPUs for both high

throughput and high performance in a cost-effective way

• Critical challenges– SIMD data-parallel algorithms on GPU– CPU-GPU hybrid computing framework– Load balancing between CPU and GPU

9

Outline• Introduction• SCCG

– PixelBox GPU algorithm– Cross-comparing framework– Load balancing

• Experiments• Conclusions

10

The PixelBox GPU Algorithm• Given an array of polygon pairs, compute the

area of intersection and area of union for each polygon pair

• Algorithm principles– Exploit SIMD data parallelism

Compute areas of polygon intersections and unions in an SIMD data parallelism mode

– Maximize data parallelism and minimize unnecessary compute intensityReduce compute intensity while maintain high data parallelism

11

Exploit SIMD Data Parallelism• Monte-Carlo approach (a basic method)

p qp qp qUnion

Union

Compute areas of intersection/union by counting the number of pixels lying within

each regionPerfect data parallelism, but high compute

intensity when polygons are largeConsider polygons

lying on a pixel map

UnionIntersectio

n

12

p q

Reduce Unnecessary Compute Intensity

• Partition the bounding rectangular of a polygon pair into boxes (like grid cells)

p q

• Use sampling boxes• Compute areas box by box, thus avoiding lots of costly per-pixel testing

Completely within the intersection of p

and q

Completely within the union of p and q

Don’t belong to either intersection or

union

• Recursively explore unsettled boxes by partitioning them into smaller sub-boxes

p qFurther

partitionedAccording to our testing, 50+% of areas can be determined with only one level of box

partitioning

Unsettled, thus need further exploration

13

The PixelBox Algorithm• PixelBox works on both pixels and boxes

p qp qp q

First, box by box to quickly finish the testing of large

regions

Then, pixel by pixel within small sub-boxes that need

further testingIn this way, PixelBox preserves the benefits of both high data parallelism and low

compute intensity

14

1

• Use a shared stack to store the boxes to be tested

Implementation for GPU

1Thread

0Thread

1Thread

2Thread

3 ……

1

stack

0

0

0

Sampling box

Mark:1 – need further testing0 – no further testing

• The stack top is fetched by all threads, and partitioned if it needs further testing

Partitioned into sub-boxes, then tested by different

threads in parallel

Contribution of each sub-box is computed; also see whether further

testing is needed0 0 1 0

• New sub-boxes are pushed to the top of stack again

• All threads keep popping boxes from the stack for processingA box with mark 0 needs no

further testing, thus ignored by all threads

A box with mark 1 needs to be further partitioned, or apply Monte Carlo if it has

been small enough

0 0 0 0

0

0

0

0

• Computation finishes when the stack becomes empty

Using a stack improves data parallelism: both popping and pushing can be done in

SIMD fashionThe bounding rectangular of two polygons is pushed on the stack as

the first box

15

The Cross-Comparing Framework• A pipelined framework that executes the

whole cross-comparing workflow• Pipelined execution reduces resource

contention over GPUs– GPU is an exclusive, non-preemptive device– Multiple threads trying to access a GPU

simultaneously are serialized– A single initiator (Aggregator) reduces

blocking over GPUs

CPU CPU GPU (PixelBox)Load

data from disk

Find intersecting polygon pairs from

the data

Compute Jaccard

similarity

16

Load data from disk

Find intersecting polygon pairs from

the data

Compute Jaccard

similarity

CPU-GPU Load Balancing• Both CPUs and GPUs have to be fully

utilized in order to maximize throughput

Produce polygon pairs

on CPUs

Consume polygon pairs

on GPUsCPUs idle if production speed > consumption

speedGPUs idle if production speed < consumption

speed

Tasks have to be dynamically migrated among CPUs and GPUs to achieve load

balancing(please read paper for details)

17

Experiments

Both simplifications favor the performance of PostGIS

• Methodology– Spatial DBMS: PostgresQL 9.1.3 + PostGIS

1.5.3– Disk loading time not considered

• “First-load-then-query” scheme of database is known to be inefficient to process one-time data

• Storage devices are improving (e.g., SSD)– Ignore data format conversion and data

partitioning time in SDBMS

• Platforms– Dell T1500 workstation

• One Intel Core i7 860 CPU (4 cores)• One NVIDIA GTX 580 GPU

– Amazon EC2 instance• Two Intel Xeon X5570 CPUs (8 cores, 16 threads)

• Data sets– 18 pairs of polygon sets extracted from 18

real-world brain tumor pathology images– 12GiB in raw text format

18

Effectiveness of PixelBox on GPU• Algorithm performance

GEOS PixelBox-CPU PixelBox1

10

100

1000

Exec

utio

n Ti

me

(s)

PixelBox-CPU PixelBox1

10

100

1000

Spee

dup

over

GEO

S

On Dell T1500, compute the Jaccard similarity of 619609 pairs of polygons in a

representative data set

PixelBox on GPU

GEOS on a single core

CPU-version PixelBox on a

single core

Over 430 s Over 290

s

Only 3.6 s

1.5 x

120 x

This experiment shows the effectiveness of PixelBox and its best utilization of SIMD data

parallelism of GPUs

19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18GMea

n05

101520253035404550

Data Set Index

Spee

dup

over

Pos

tGIS

-MOverall Performance

• Cross-comparing performance

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 180

50

100

150

200

250

300 PostGIS-M SCCG

Data Set Index

Exec

utio

n Ti

me

(s)

Parallelized PostGIS on

EC2

Our solution on Dell T1500

18x speedup on

average

Two Intel Xeon X5570: $ 2000

Core i7 860 + GTX 580: $ 800

Two Intel Xeon X5570: 190 w

Core i7 860 + GTX 580: 339 w

Our solution is 2.4x lower in hardware cost, and 10x higher in performance per watt

20

Conclusions• Spatial cross-comparison is a data- and

compute-intensive operation• Existing approach with SDBMS is not

high-performance and low-cost• We provide a software solution based on

GPUs and CPUs to significantly accelerate the work at low cost

21

Thank You• Q & A

accelerating pathology image data cross-comparison on cpu-gpu hybrid systems

Documents

studybig data

spatial data types

sccgspatial crosscomparison

typical crosscomparison

spatial crosscomparisonpq3

parallel computing techniques

algorithm onered

computing time