numagic: data-intensive applications need large machines ......numagic: a garbage collector for...

10
NumaGiC: A garbage collector for big-data on big NUMA machines Lokesh Gidra , Gaël Thomas , Julien Sopena , Marc Shapiro , Nhan Nguyen LIP6/UPMC-INRIA Telecom SudParis Chalmers University Motivation Data-intensive applications need large machines with plenty of cores and memory Lokesh Gidra 2 Motivation Data-intensive applications need large machines with plenty of cores and memory But, for large heaps, GC is inefficient on such machines Lokesh Gidra 3 GC Throughput (GB collected per second) Baseline PS # of cores Ideal Scalability Page rank computation of 100million edge Friendster dataset with Spark on Hotspot/Parallel Scavenge with 40GB on a 48-core machine Motivation Data-intensive applications need large machines with plenty of cores and memory But, for large heaps, GC is inefficient on such machines Lokesh Gidra 4 GC Throughput (GB collected per second) Baseline PS # of cores Ideal Scalability GC takes roughly 60% of the total time Page rank computation of 100million edge Friendster dataset with Spark on Hotspot/Parallel Scavenge with 40GB on a 48-core machine

Upload: others

Post on 22-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NumaGiC: Data-intensive applications need large machines ......NumaGiC: A garbage collector for big-data on big NUMA machines LokeshGidra‡, GaëlThomas , JulienSopena‡, Marc Shapiro‡,

NumaGiC: A garbage collector for big-data

on big NUMA machines

Lokesh Gidra‡, Gaël Thomas�, Julien Sopena‡,Marc Shapiro‡, Nhan Nguyen♀

‡ LIP6/UPMC-INRIA �Telecom SudParis ♀Chalmers University

Motivation

◼Data-intensive applications need large machines with plenty of cores and memory

Lokesh Gidra 2

Motivation

◼Data-intensive applications need large machines with plenty of cores and memory

◼But, for large heaps, GC is inefficient on such machines

Lokesh Gidra 3

GC

Thr

ough

put

(GB

col

lect

ed p

er se

cond

)

Baseline PS

# of cores

Ideal Scalability

Page rank computation of 100million edge Friendster datasetwith Spark on Hotspot/Parallel Scavenge with 40GB on a 48-core machine

Motivation

◼Data-intensive applications need large machines with plenty of cores and memory

◼But, for large heaps, GC is inefficient on such machines

Lokesh Gidra 4

GC

Thr

ough

put

(GB

col

lect

ed p

er se

cond

)Baseline PS

# of cores

Ideal Scalability

GC takes roughly 60% of the total time

Page rank computation of 100million edge Friendster datasetwith Spark on Hotspot/Parallel Scavenge with 40GB on a 48-core machine

Page 2: NumaGiC: Data-intensive applications need large machines ......NumaGiC: A garbage collector for big-data on big NUMA machines LokeshGidra‡, GaëlThomas , JulienSopena‡, Marc Shapiro‡,

Outline

◼Why GC doesn’t scale?

◼Our Solution: NumaGiC

◼Evaluation

Lokesh Gidra 5

GCs don’t scale because machines are NUMA

Hardware hides the distributed memory � application silently creates inter-node references

Node 0 Node 1

Node 2 Node 3

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Lokesh Gidra 6

GCs don’t scale because machines are NUMA

But memory distribution is also hidden to the GC threadswhen they traverse the object graph

Node 0 Node 1

Node 2 Node 3

Mem

ory

Mem

ory

Mem

ory

Mem

ory

GC thread

Lokesh Gidra 7

GCs don’t scale because machines are NUMA

But memory distribution is also hidden to the GC threadswhen they traverse the object graph

Node 0 Node 1

Node 2 Node 3

Mem

ory

Mem

ory

Mem

ory

Mem

ory

GC thread

Lokesh Gidra 8

Page 3: NumaGiC: Data-intensive applications need large machines ......NumaGiC: A garbage collector for big-data on big NUMA machines LokeshGidra‡, GaëlThomas , JulienSopena‡, Marc Shapiro‡,

GCs don’t scale because machines are NUMA

A GC thread thus silently traverses remote references

Node 0 Node 1

Node 2 Node 3

Mem

ory

Mem

ory

Mem

ory

Mem

ory

GC thread

Lokesh Gidra 9

GCs don’t scale because machines are NUMA

A GC thread thus silently traverses remote referencesand continues its graph traversal on any node

Node 0 Node 1

Node 2 Node 3

Mem

ory

Mem

ory

Mem

ory

Mem

ory

GC thread

Lokesh Gidra 10

GCs don’t scale because machines are NUMA

A GC thread thus silently traverses remote referencesand continues its graph traversal on any node

Node 0 Node 1

Node 2 Node 3

Mem

ory

Mem

ory

Mem

ory

Mem

ory

GC thread

Lokesh Gidra 11

GCs don’t scale because machines are NUMA

When all GC threads access any memory nodes, the inter-connect potentially saturates

=> high memory access latency

Node 0 Node 1

Node 2 Node 3

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Lokesh Gidra 12

Page 4: NumaGiC: Data-intensive applications need large machines ......NumaGiC: A garbage collector for big-data on big NUMA machines LokeshGidra‡, GaëlThomas , JulienSopena‡, Marc Shapiro‡,

Outline

◼Why GC doesn’t scale?

◼Our Solution: NumaGiC

◼Evaluation

Lokesh Gidra 13

How can we fix the memory locality issue?

Simply by preventing any remote memory access

Lokesh Gidra 14

Prevent remote access using messages

Enforces memory access localityby trading remote memory accesses by messages

Mem

ory

Mem

ory

Node 0 Node 1

Thread 0 Thread 1

Lokesh Gidra 15

Prevent remote access using messages

Enforces memory access localityby trading remote memory accesses by messages

Mem

ory

Mem

ory

Node 0 Node 1

Thread 0 Thread 1

Lokesh Gidra 16

Page 5: NumaGiC: Data-intensive applications need large machines ......NumaGiC: A garbage collector for big-data on big NUMA machines LokeshGidra‡, GaëlThomas , JulienSopena‡, Marc Shapiro‡,

Prevent remote access using messages

Enforces memory access localityby trading remote memory accesses by messages

Mem

ory

Mem

ory

Node 0 Node 1

Thread 0 Thread 1

Remote reference � sends it to its home-nodeLokesh Gidra 17

Prevent remote access using messages

Enforces memory access localityby trading remote memory accesses by messages

Remote reference � sends it to its home-node

Mem

ory

Mem

ory

Node 0 Node 1

Thread 0 Thread 1

Lokesh Gidra 18

Prevent remote access using messages

Enforces memory access localityby trading remote memory accesses by messages

And continue the graph traversal locally

Mem

ory

Mem

ory

Node 0 Node 1

Thread 0 Thread 1

Lokesh Gidra 19

Prevent remote access using messages

Enforces memory access localityby trading remote memory accesses by messages

And continue the graph traversal locallyM

emor

y

Mem

ory

Node 0 Node 1

Thread 0 Thread 1

Lokesh Gidra 20

Page 6: NumaGiC: Data-intensive applications need large machines ......NumaGiC: A garbage collector for big-data on big NUMA machines LokeshGidra‡, GaëlThomas , JulienSopena‡, Marc Shapiro‡,

Using messages enforces local access…

…but opens up other performance challenges

Lokesh Gidra 21

Problem1: a msg is costlier than a remote access

Node 0 Node 1

Too many messages

Lokesh Gidra 22

Inter-node messages must be minimized

Problem1: a msg is costlier than a remote access

Node 0 Node 1

Too many messages

Lokesh Gidra 23

Inter-node messages must be minimized

• Observation: app threads naturally create clusters of new allocated objs• 99% of recently allocated objects are clustered

Node 0 Node 1

Problem1: a msg is costlier than a remote access

Node 0 Node 1

Too many messages

Lokesh Gidra 24

Inter-node messages must be minimized

• Observation: app threads naturally create clusters of new allocated objs• 99% of recently allocated objects are clustered

Approach: let objects allocated by a thread stay on its node

Page 7: NumaGiC: Data-intensive applications need large machines ......NumaGiC: A garbage collector for big-data on big NUMA machines LokeshGidra‡, GaëlThomas , JulienSopena‡, Marc Shapiro‡,

Problem2: Limited parallelism

◼Due to serialized traversal of object clusters across nodesNode 0 Node 1

Node 1 idles while node 0 collects its memory

Lokesh Gidra 25

Problem2: Limited parallelism

◼Due to serialized traversal of object clusters across nodes

◼ Solution: adaptive algorithmTrade-off between locality and parallelism1. Prevent remote access by using messages when not idling2. Steal and access remote objects otherwise

Node 0 Node 1

Node 1 idles while node 0 collects its memory

Lokesh Gidra 26

Outline

◼Why GC doesn’t scale?

◼Our Solution: NumaGiC

◼Evaluation

Lokesh Gidra 27

Evaluation

◼Comparison of NumaGiC with –1. ParallelScavenge (PS): baseline stop-the-world GC of Hotspot2. Improved PS: PS with lock-free data structures and interleaved

heap space3. NAPS: Improved PS + slightly better locality, but no messages

◼Metrics• GC throughput –

– amount of live data collected per second (GB/s)– Higher is better

• Application performance –– Relative to improved PS– Higher is better

Lokesh Gidra 28

Page 8: NumaGiC: Data-intensive applications need large machines ......NumaGiC: A garbage collector for big-data on big NUMA machines LokeshGidra‡, GaëlThomas , JulienSopena‡, Marc Shapiro‡,

Name Description Heap SizeAmd48 Intel80

Spark In-memory dataanalytics(pagerankcomputation)

110to160GB

250to350GB

Neo4j Objectgraphdatabase(SingleSourceShortestPath)

110to160GB

250to350GB

SPECjbb2013 Business-logic server 24to40GB 24to40GB

SPECjbb2005 Business-logicserver 4 to8GB 8to 12GB

Experiments

1 billion edgeFriendster dataset

The 1.8 billion edgeFriendster dataset

Lokesh Gidra 29

Hardwaresettings–1. AMDMagny Cours with8nodes,48threads,256GBofRAM2. XeonE7-2860with4nodes,80threads,512GBofRAM

GC Throughput (GB collected per second)

GC

Thr

ough

put

Heap Sizes

Spark Neo4j SpecJBB13 SpecJBB05

onAmd48

Lokesh Gidra 30

Improved PS NAPS NumaGiC

GC Throughput (GB collected per second)

GC

Thr

ough

put

Heap Sizes

Spark Neo4j SpecJBB13 SpecJBB05

onAmd48

NumaGiC multiplies GC performance up to 5.4X

Lokesh Gidra 31

Improved PS NAPS NumaGiC

5.4X

2.9X

GC Throughput (GB collected per second)

Heap Sizes

Spark Neo4j SpecJBB13 SpecJBB05

onAmd48

onIntel80

Lokesh Gidra 32

GC

Thr

ough

put

3.6X

Page 9: NumaGiC: Data-intensive applications need large machines ......NumaGiC: A garbage collector for big-data on big NUMA machines LokeshGidra‡, GaëlThomas , JulienSopena‡, Marc Shapiro‡,

GC Throughput Scalability

Spark on Amd48 with asmaller dataset of 40GB

GC

Thr

ough

put

# of nodes

Improved PS

Baseline PS

NumaGiC

Ideal Scalability

Lokesh Gidra 33

NAPS

Application speedup

Spee

dup

rela

tive

to Im

prov

ed P

S

Lokesh Gidra 34

Spark Neo4j SpecJBB13 SpecJBB05

NAPS NumaGiC

94%82%

36%

64%55%

61%

27%42%

Application speedup

Spee

dup

rela

tive

to Im

prov

ed P

S

Lokesh Gidra 35

Spark Neo4j SpecJBB13 SpecJBB05

NAPS NumaGiC

12%

21%37%

35%

26%

37%

33%37%

Conclusion

◼ Performance of data-intensive apps relies on GC performance

◼Memory access locality has huge effect on GC performance

◼Enforcing locality can be detrimental for parallelism in GCs

◼ Future work: NUMA-aware concurrent GCs

Lokesh Gidra 36

Page 10: NumaGiC: Data-intensive applications need large machines ......NumaGiC: A garbage collector for big-data on big NUMA machines LokeshGidra‡, GaëlThomas , JulienSopena‡, Marc Shapiro‡,

Conclusion

◼ Performance of data-intensive apps relies on GC performance

◼Memory access locality has huge effect on GC performance

◼Enforcing locality can be detrimental for parallelism in GCs

◼ Future work: NUMA-aware concurrent GCs

Lokesh Gidra 37

Thank You J

Large multicores provide this power

But scalability is hard to achieve because software stack was not designed for

Data analytic

Cores Memory Banks

I/O controllers

Operating system

Application

Language runtime

Middleware

Hypervisor

Hadoop, Spark, Neo4j, Cassandra…

JVM, CLI, Python, R…

Linux, Windows…

Xen, VMWare…

Lokesh Gidra 38

Large multicores provide this power

But scalability is hard to achieve because software stack was not designed for

Data analytic

Cores Memory Banks

I/O controllers

Operating system

Application

Language runtime

Middleware

Hypervisor

Hadoop, Spark, Neo4j, Cassandra…

JVM, CLI, Python, R…

Linux, Windows…Do not consider hypervisors in this talk:

Software stack is already complexand hard to analyze!

Lokesh Gidra 39