numagic: data-intensive applications need large machines ......numagic: a garbage collector for...

NumaGiC: A garbage collector for big-data

on big NUMA machines

Lokesh Gidra‡, Gaël Thomas�, Julien Sopena‡,Marc Shapiro‡, Nhan Nguyen♀

‡ LIP6/UPMC-INRIA �Telecom SudParis ♀Chalmers University

Motivation

◼Data-intensive applications need large machines with plenty of cores and memory

Lokesh Gidra 2

Motivation


◼But, for large heaps, GC is inefficient on such machines

Lokesh Gidra 3

GC

Thr

ough

put

(GB

col

lect

ed p

er se

cond

)

Baseline PS

# of cores

Ideal Scalability

Page rank computation of 100million edge Friendster datasetwith Spark on Hotspot/Parallel Scavenge with 40GB on a 48-core machine

Motivation


◼But, for large heaps, GC is inefficient on such machines

Lokesh Gidra 4

GC

Thr

ough

put

(GB

col

lect

ed p

er se

cond

)Baseline PS

# of cores

Ideal Scalability

GC takes roughly 60% of the total time

Page rank computation of 100million edge Friendster datasetwith Spark on Hotspot/Parallel Scavenge with 40GB on a 48-core machine

Outline

◼Why GC doesn’t scale?

◼Our Solution: NumaGiC

◼Evaluation

Lokesh Gidra 5

GCs don’t scale because machines are NUMA

Hardware hides the distributed memory � application silently creates inter-node references

Node 0 Node 1

Node 2 Node 3

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Lokesh Gidra 6


But memory distribution is also hidden to the GC threadswhen they traverse the object graph

Node 0 Node 1

Node 2 Node 3

Mem

ory

Mem

ory

Mem

ory

Mem

ory

GC thread

Lokesh Gidra 7


But memory distribution is also hidden to the GC threadswhen they traverse the object graph

Node 0 Node 1

Node 2 Node 3

Mem

ory

Mem

ory

Mem

ory

Mem

ory

GC thread

Lokesh Gidra 8


A GC thread thus silently traverses remote references

Node 0 Node 1

Node 2 Node 3

Mem

ory

Mem

ory

Mem

ory

Mem

ory

GC thread

Lokesh Gidra 9


A GC thread thus silently traverses remote referencesand continues its graph traversal on any node

Node 0 Node 1

Node 2 Node 3

Mem

ory

Mem

ory

Mem

ory

Mem

ory

GC thread

Lokesh Gidra 10


A GC thread thus silently traverses remote referencesand continues its graph traversal on any node

Node 0 Node 1

Node 2 Node 3

Mem

ory

Mem

ory

Mem

ory

Mem

ory

GC thread

Lokesh Gidra 11


When all GC threads access any memory nodes, the inter-connect potentially saturates

=> high memory access latency

Node 0 Node 1

Node 2 Node 3

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Lokesh Gidra 12

Outline



◼Evaluation

Lokesh Gidra 13

How can we fix the memory locality issue?

Simply by preventing any remote memory access

Lokesh Gidra 14

Prevent remote access using messages

Enforces memory access localityby trading remote memory accesses by messages

Mem

ory

Mem

ory

Node 0 Node 1

Thread 0 Thread 1

Lokesh Gidra 15



Mem

ory

Mem

ory

Node 0 Node 1

Thread 0 Thread 1

Lokesh Gidra 16



Mem

ory

Mem

ory

Node 0 Node 1

Thread 0 Thread 1

Remote reference � sends it to its home-nodeLokesh Gidra 17



Remote reference � sends it to its home-node

Mem

ory

Mem

ory

Node 0 Node 1

Thread 0 Thread 1

Lokesh Gidra 18



And continue the graph traversal locally

Mem

ory

Mem

ory

Node 0 Node 1

Thread 0 Thread 1

Lokesh Gidra 19



And continue the graph traversal locallyM

emor

y

Mem

ory

Node 0 Node 1

Thread 0 Thread 1

Lokesh Gidra 20

Using messages enforces local access…

…but opens up other performance challenges

Lokesh Gidra 21

Problem1: a msg is costlier than a remote access

Node 0 Node 1

Too many messages

Lokesh Gidra 22

Inter-node messages must be minimized


Node 0 Node 1

Too many messages

Lokesh Gidra 23


• Observation: app threads naturally create clusters of new allocated objs• 99% of recently allocated objects are clustered

Node 0 Node 1


Node 0 Node 1

Too many messages

Lokesh Gidra 24


• Observation: app threads naturally create clusters of new allocated objs• 99% of recently allocated objects are clustered

Approach: let objects allocated by a thread stay on its node

Problem2: Limited parallelism

◼Due to serialized traversal of object clusters across nodesNode 0 Node 1

Node 1 idles while node 0 collects its memory

Lokesh Gidra 25

Problem2: Limited parallelism

◼Due to serialized traversal of object clusters across nodes

◼ Solution: adaptive algorithmTrade-off between locality and parallelism1. Prevent remote access by using messages when not idling2. Steal and access remote objects otherwise

Node 0 Node 1

Node 1 idles while node 0 collects its memory

Lokesh Gidra 26

Outline



◼Evaluation

Lokesh Gidra 27

Evaluation

◼Comparison of NumaGiC with –1. ParallelScavenge (PS): baseline stop-the-world GC of Hotspot2. Improved PS: PS with lock-free data structures and interleaved

heap space3. NAPS: Improved PS + slightly better locality, but no messages

◼Metrics• GC throughput –

– amount of live data collected per second (GB/s)– Higher is better

• Application performance –– Relative to improved PS– Higher is better

Lokesh Gidra 28

Name Description Heap SizeAmd48 Intel80

Spark In-memory dataanalytics(pagerankcomputation)

110to160GB

250to350GB

Neo4j Objectgraphdatabase(SingleSourceShortestPath)

110to160GB

250to350GB

SPECjbb2013 Business-logic server 24to40GB 24to40GB

SPECjbb2005 Business-logicserver 4 to8GB 8to 12GB

Experiments

1 billion edgeFriendster dataset

The 1.8 billion edgeFriendster dataset

Lokesh Gidra 29

Hardwaresettings–1. AMDMagny Cours with8nodes,48threads,256GBofRAM2. XeonE7-2860with4nodes,80threads,512GBofRAM

GC Throughput (GB collected per second)

GC

Thr

ough

put

Heap Sizes

Spark Neo4j SpecJBB13 SpecJBB05

onAmd48

Lokesh Gidra 30

Improved PS NAPS NumaGiC


GC

Thr

ough

put

Heap Sizes


onAmd48

NumaGiC multiplies GC performance up to 5.4X

Lokesh Gidra 31

Improved PS NAPS NumaGiC

5.4X

2.9X


Heap Sizes


onAmd48

onIntel80

Lokesh Gidra 32

GC

Thr

ough

put

3.6X

GC Throughput Scalability

Spark on Amd48 with asmaller dataset of 40GB

GC

Thr

ough

put

# of nodes

Improved PS

Baseline PS

NumaGiC

Ideal Scalability

Lokesh Gidra 33

NAPS

Application speedup

Spee

dup

rela

tive

to Im

prov

ed P

S

Lokesh Gidra 34


NAPS NumaGiC

94%82%

36%

64%55%

61%

27%42%

Application speedup

Spee

dup

rela

tive

to Im

prov

ed P

S

Lokesh Gidra 35


NAPS NumaGiC

12%

21%37%

35%

26%

37%

33%37%

Conclusion

◼ Performance of data-intensive apps relies on GC performance

◼Memory access locality has huge effect on GC performance

◼Enforcing locality can be detrimental for parallelism in GCs

◼ Future work: NUMA-aware concurrent GCs

Lokesh Gidra 36

Conclusion

◼ Performance of data-intensive apps relies on GC performance

◼Memory access locality has huge effect on GC performance

◼Enforcing locality can be detrimental for parallelism in GCs

◼ Future work: NUMA-aware concurrent GCs

Lokesh Gidra 37

Thank You J

Large multicores provide this power

But scalability is hard to achieve because software stack was not designed for

Data analytic

Cores Memory Banks

I/O controllers

Operating system

Application

Language runtime

Middleware

Hypervisor

Hadoop, Spark, Neo4j, Cassandra…

JVM, CLI, Python, R…

Linux, Windows…

Xen, VMWare…

Lokesh Gidra 38

Large multicores provide this power

But scalability is hard to achieve because software stack was not designed for

Data analytic

Cores Memory Banks

I/O controllers

Operating system

Application

Language runtime

Middleware

Hypervisor

Hadoop, Spark, Neo4j, Cassandra…

JVM, CLI, Python, R…

Linux, Windows…Do not consider hypervisors in this talk:

Software stack is already complexand hard to analyze!

Lokesh Gidra 39

numagic: data-intensive applications need large machines ......numagic: a garbage collector for...

Documents