numa and java databases

31
NUMA & Java Databases Should we worry Raghavendra Prabhu [email protected] @randomsurfer

Upload: raghavendra-prabhu

Post on 15-Apr-2017

131 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: NUMA and Java Databases

NUMA&

Java DatabasesShould we worry

Raghavendra Prabhu [email protected] @randomsurfer

Page 2: NUMA and Java Databases

NUMA Reference architecture

Page 3: NUMA and Java Databases
Page 4: NUMA and Java Databases

What is NUMA● Stands for Non Uniform Memory Access

○ Non Uniform to whom.

○ Von Neumann bottleneck.

○ Cache coherent NUMA

● How does it work○ Memory is placed local to the processes.

○ Balancing access to data over the available processors on multiple nodes.

● Large memory installations are becoming the norm○ The i2 series on AWS.

○ Databases are the main consumers.

● Constraints○ Speed of light

○ Interconnect saturation

Page 5: NUMA and Java Databases

What is NUMA● Constraints

○ Speed of light

■ Higher latency of accessing remote memory.

○ Interconnect saturation

■ Performance counters.

● Slow abundant memory○ Fast limited memory

● Cache coherence○ Processor threads and cores share resources

■ Execution units (between HT threads)

■ Cache (between threads and cores)

Page 6: NUMA and Java Databases

Exotic cases● Network cards

● PCIe storage

● NVRAM

● Nodes without memory

● Nodes without processors

● Unbalanced

● Central/Large memory

● Big Little architecture

● GPU

Page 7: NUMA and Java Databases

Numa statistics

Page 8: NUMA and Java Databases

Tools/libraries for NUMA● Supported by Linux since 2.5

○ Symmetric and CPU/Memory

● Numactl

● Hwloc / lstopo

● Numad

● Numatop

● Libnuma

● Numastat

● Taskset

● KVM for simulation and testing

● Perf

Page 9: NUMA and Java Databases

Tools/libraries for NUMA● KVM for simulation and testing

● Useful for testing databases.

qemu-system-x86_64 -enable-kvm -drive file=./debian-8.1-lxc-puppet.qcow2 -net nic,macaddr=52:54:00:00:EE:03 -net vde -smp sockets=2,cores=2,threads=2,maxcpus=16 -numa node,nodeid=0,cpus=0-3 -numa node,nodeid=1,cpus=4-7 -numa node,nodeid=2,cpus=8-15 -m 2G

Page 10: NUMA and Java Databases

NUMA Policies● MPOL_DEFAULT● MPOL_BIND● MPOL_INTERLEAVE

○ Memory striping in hardware

● MPOL_PREFERRED● MPOL_MF_MOVE | MPOL_MF_MOVE_ALL

Page 11: NUMA and Java Databases

JVM GC spaces● Concepts

○ Weak Generational Hypothesis:

■ Most objects soon become unreachable.

■ References from old objects to young objects only exist in small numbers.

■ The ones that do not usually survive for a (very) long time

○ Garbage Collection Roots

○ Mark &

■ Copy

■ Compact

■ Sweep

○ Minor and Major GC

○ Stop-the-World

Page 12: NUMA and Java Databases

GC graphs

Page 13: NUMA and Java Databases

JVM GC spaces● Generations:

○ Young Generation

■ Eden space

● Mutable Space.

● Thread Local Allocation Buffer.

● Mark and Copy.

■ Survivor spaces (S0 and S1).

○ Old/Tenured Generation

○ Permanent Generation

■ => native MetaSpace in Java8

● Cross-generation links.

● Card-marking

Page 14: NUMA and Java Databases

Garbage collectors Located in hotspot/src/share/vm/gc_implementation

● Serial

● Parallel○ Only GC which is fully NUMA aware.

● ParNew

● Concurrent Mark and Sweep (CMS)

● Garbage First (G1)

● Official Oracle documentation is notoriously bad! ○ Code and comments are the (only) documentation (sadly).

■ Try searching for ‘NUMAPageScanRate’ - find a page from 2008 with links to sun.com and Solaris examples.

Page 15: NUMA and Java Databases

GC Options

Page 16: NUMA and Java Databases

● UseNUMA

● UseNUMAInterleaving

● ForceNUMA

● NUMAStats

● ParallelGC only○ NUMAChunkResizeWeight

○ NUMASpaceResizeRate

○ UseAdaptiveNUMAChunkSizing

○ NUMAPageScanRate

Defined in hotspot/src/share/vm/runtime/globals.hpp and used in

hotspot/src/os/linux/vm

NUMA options

Page 17: NUMA and Java Databases

NUMA and Collectors● -XX:+UseNUMA -XX:+UseNUMAInterleaving: All GC spaces.

○ Independent of GC choices.

○ NUMA interleaved allocation. (numactl --interleave)

● ParallelGC (in addition to above)○ Supports all exotic NUMA options.

○ Eden mutableSpace (even without NUMA)

■ Pretouching the pages.

○ Eden mutableNUMASpace (with above NUMA options)

■ Space split into LG chunks.

● Adaptive Resizing.

■ Does thread-local NUMA allocation.

● allocations performed in chunk corresponding to the home locality.

Page 18: NUMA and Java Databases

Cassandra● JVM options are supported through environment variable.

● Cassandra’s ‘supported’ NUMA is through numactl in shell wrapper.○ This interleaves ‘everything’.

○ When you have numactl (hammer), everything looks like a (binary?) nail.

● Cassandra memory model○ JVM GC spaces.

○ OHC - off heap cache: https://github.com/snazy/ohc

■ Written specifically for Cassandra 2.x

○ MemoryUtil.java

■ com.sun.jna.Native - Native.malloc

■ sun.nio.ch.DirectBuffer

■ sun.misc.Unsafe - unsafe.allocateMemory

■ java.nio.ByteBuffer - ByteBuffer.allocateDirect

Page 19: NUMA and Java Databases

Cassandra off-heap● Why off-heap

○ Reduce GC pressure

○ Access patterns

○ Lack of support for primitives such as O_DIRECT. (https://bugs.openjdk.java.net/browse/JDK-8164900)

○ Lack of NUMA support in newer GCs.

■ ( JEP 157: G1 GC: NUMA-Aware Allocation http://openjdk.java.net/jeps/157)

● Off-heap caches are used for:○ Row cache

○ Key cache

○ Counter cache

● 2.x onwards, actually better with 2.2.

Page 20: NUMA and Java Databases

Cassandra off-heap● Cache Providers:

○ SerializingCache

■ Issues with serialization and CPU usage.

○ OHCP - org.caffinitas.ohc.OHCacheBuilder - 2.2 onwards

■ “OHC shall provide a good performance on both commodity hardware and big systems using

non-uniform-memory-architectures.”

■ sun.misc.Unsafe: unsafe.allocateMemory

■ Linked: For Larger entries

● Malloc and fragmentation

■ Chunked: For smaller entries

Page 21: NUMA and Java Databases

Numa issues● Numactl --interleave:

○ Thread-local native allocations - Bad [X]

■ Tons of them throughout code which bypass JVM.

○ JVM’s Eden space will also be interleaved - Bad [X]

● JVM’s options only:○ Native allocations will be local.

○ Large off-heap allocations can suffer.

● Numactl + JVM■ JVM-aware GC (Parallel)

● Best possible combination (without invasive code changes in cassandra).

● JVM’s memory options will override numactl.

● But, ParallelGC is not comparable to new ones (G1).

Page 22: NUMA and Java Databases

Interpretation● Low off-heap usage

○ Use the JVM NUMA options. Don’t interleave with numactl, it is a hammer.

● High off-heap usage (like cassandra)○ Just go with the flow, and do numactl.

■ -XX:+AlwaysPreTouch? (MAP_POPULATE)

○ Cost-benefit analysis.

● ParallelGC is too old (and bad for latency) - don’t use it just for NUMA.○ Well-implemented NUMA can easily pique anyone’s geeky senses. :)

○ Ask Cassandra or Oracle to add NUMA support to G1 ;)

● In newer kernels (Xenial), one can try AutoNUMA. ○ Completely managed by kernel based on access patterns.

○ Has caveats but one can always benchmark and see. :)

Page 23: NUMA and Java Databases

Interpretation● JVM is (still) not good with native primitives such as O_DIRECT or NUMA (there is a

jnuma which is not that well maintained).

○ Many database authors write their own off-JVM implementations for these. (there are so many java

databases these days)

○ Some also do things like this.

○ MySQL (InnoDB) can (and does) take advantage of these for good performance.

■ InnoDB was in Cassandra’s place about two years ago, till fixes landed.

● How InnoDB does it.

○ May be ScyllaDB in future. ;)

Page 24: NUMA and Java Databases

Wishlist for cassandra● Use whatever GC fits best. (G1?)

■ Ask for NUMA support in this.

● Use the JVM NUMA options when supported.

■ Having NUMA support for Eden spaces will help a lot.

● Don’t use numactl.○ Let all native allocations be local (OS default).

○ Use jnuma (or equivalent, it is just a JNI wrapper) for OHCP and other large non-local caches.

■ Use numa interleaving here.

■ This requires cassandra or OHCP code to be changed.

● Changing OHCP code is easier.

● Benchmark○ ??

○ Profit!

Page 25: NUMA and Java Databases

AutoNUMA● Introduced late in 4.x kernel● CPU follows memory

○ Reschedule tasks on same nodes as memory

● Memory follows CPU○ Copy memory pages to same nodes as tasks/threads

● Heuristics○ Fault statistics○ Task grouping○ Multi-resource optimization - cache, cpu, memory, starvation

■ Avoid thrashing

Page 26: NUMA and Java Databases

Tunings and observables● /proc/zoneinfo

○ Sysctl vm.zone_reclaim_mode OR /proc/sys/vm/zone_reclaim

○ /proc/sys/vm/min_unmapped_ratio

● /proc/meminfo

● /proc/vmstat

● Ftrace / Perf

● Cgroup hierarchy○ Memory

● Per process: ○ /proc/<pid>/numa_maps

○ /proc/<pid>/sched

Page 27: NUMA and Java Databases

Numa statistics

Page 29: NUMA and Java Databases

Credits!● http://queue.acm.org/detail.cfm?id=2513149 ● www.linux-kvm.org/images/7/75/01x07b-NumaAutobalancing.pdf ● http://events.linuxfoundation.org/sites/events/files/slides/Normal%20and

%20Exotic%20use%20cases%20for%20NUMA%20features.pdf ● https://en.wikipedia.org/wiki/Non-uniform_memory_access ● https://lihz1990.gitbooks.io/transoflptg/content/02.%E7%9B%91%E6%8E

%A7%E5%92%8C%E5%8E%8B%E6%B5%8B%E5%B7%A5%E5%85%B7/sample-output-of-the-numastat-command.png

Page 30: NUMA and Java Databases
Page 31: NUMA and Java Databases