numa and java databases
TRANSCRIPT
NUMA Reference architecture
What is NUMA● Stands for Non Uniform Memory Access
○ Non Uniform to whom.
○ Von Neumann bottleneck.
○ Cache coherent NUMA
● How does it work○ Memory is placed local to the processes.
○ Balancing access to data over the available processors on multiple nodes.
● Large memory installations are becoming the norm○ The i2 series on AWS.
○ Databases are the main consumers.
● Constraints○ Speed of light
○ Interconnect saturation
What is NUMA● Constraints
○ Speed of light
■ Higher latency of accessing remote memory.
○ Interconnect saturation
■ Performance counters.
● Slow abundant memory○ Fast limited memory
● Cache coherence○ Processor threads and cores share resources
■ Execution units (between HT threads)
■ Cache (between threads and cores)
Exotic cases● Network cards
● PCIe storage
● NVRAM
● Nodes without memory
● Nodes without processors
● Unbalanced
● Central/Large memory
● Big Little architecture
● GPU
Numa statistics
Tools/libraries for NUMA● Supported by Linux since 2.5
○ Symmetric and CPU/Memory
● Numactl
● Hwloc / lstopo
● Numad
● Numatop
● Libnuma
● Numastat
● Taskset
● KVM for simulation and testing
● Perf
Tools/libraries for NUMA● KVM for simulation and testing
● Useful for testing databases.
qemu-system-x86_64 -enable-kvm -drive file=./debian-8.1-lxc-puppet.qcow2 -net nic,macaddr=52:54:00:00:EE:03 -net vde -smp sockets=2,cores=2,threads=2,maxcpus=16 -numa node,nodeid=0,cpus=0-3 -numa node,nodeid=1,cpus=4-7 -numa node,nodeid=2,cpus=8-15 -m 2G
NUMA Policies● MPOL_DEFAULT● MPOL_BIND● MPOL_INTERLEAVE
○ Memory striping in hardware
● MPOL_PREFERRED● MPOL_MF_MOVE | MPOL_MF_MOVE_ALL
JVM GC spaces● Concepts
○ Weak Generational Hypothesis:
■ Most objects soon become unreachable.
■ References from old objects to young objects only exist in small numbers.
■ The ones that do not usually survive for a (very) long time
○ Garbage Collection Roots
○ Mark &
■ Copy
■ Compact
■ Sweep
○ Minor and Major GC
○ Stop-the-World
GC graphs
JVM GC spaces● Generations:
○ Young Generation
■ Eden space
● Mutable Space.
● Thread Local Allocation Buffer.
● Mark and Copy.
■ Survivor spaces (S0 and S1).
○ Old/Tenured Generation
○ Permanent Generation
■ => native MetaSpace in Java8
● Cross-generation links.
● Card-marking
Garbage collectors Located in hotspot/src/share/vm/gc_implementation
● Serial
● Parallel○ Only GC which is fully NUMA aware.
● ParNew
● Concurrent Mark and Sweep (CMS)
● Garbage First (G1)
● Official Oracle documentation is notoriously bad! ○ Code and comments are the (only) documentation (sadly).
■ Try searching for ‘NUMAPageScanRate’ - find a page from 2008 with links to sun.com and Solaris examples.
GC Options
● UseNUMA
● UseNUMAInterleaving
● ForceNUMA
● NUMAStats
● ParallelGC only○ NUMAChunkResizeWeight
○ NUMASpaceResizeRate
○ UseAdaptiveNUMAChunkSizing
○ NUMAPageScanRate
Defined in hotspot/src/share/vm/runtime/globals.hpp and used in
hotspot/src/os/linux/vm
NUMA options
NUMA and Collectors● -XX:+UseNUMA -XX:+UseNUMAInterleaving: All GC spaces.
○ Independent of GC choices.
○ NUMA interleaved allocation. (numactl --interleave)
● ParallelGC (in addition to above)○ Supports all exotic NUMA options.
○ Eden mutableSpace (even without NUMA)
■ Pretouching the pages.
○ Eden mutableNUMASpace (with above NUMA options)
■ Space split into LG chunks.
● Adaptive Resizing.
■ Does thread-local NUMA allocation.
● allocations performed in chunk corresponding to the home locality.
Cassandra● JVM options are supported through environment variable.
● Cassandra’s ‘supported’ NUMA is through numactl in shell wrapper.○ This interleaves ‘everything’.
○ When you have numactl (hammer), everything looks like a (binary?) nail.
● Cassandra memory model○ JVM GC spaces.
○ OHC - off heap cache: https://github.com/snazy/ohc
■ Written specifically for Cassandra 2.x
○ MemoryUtil.java
■ com.sun.jna.Native - Native.malloc
■ sun.nio.ch.DirectBuffer
■ sun.misc.Unsafe - unsafe.allocateMemory
■ java.nio.ByteBuffer - ByteBuffer.allocateDirect
Cassandra off-heap● Why off-heap
○ Reduce GC pressure
○ Access patterns
○ Lack of support for primitives such as O_DIRECT. (https://bugs.openjdk.java.net/browse/JDK-8164900)
○ Lack of NUMA support in newer GCs.
■ ( JEP 157: G1 GC: NUMA-Aware Allocation http://openjdk.java.net/jeps/157)
● Off-heap caches are used for:○ Row cache
○ Key cache
○ Counter cache
● 2.x onwards, actually better with 2.2.
Cassandra off-heap● Cache Providers:
○ SerializingCache
■ Issues with serialization and CPU usage.
○ OHCP - org.caffinitas.ohc.OHCacheBuilder - 2.2 onwards
■ “OHC shall provide a good performance on both commodity hardware and big systems using
non-uniform-memory-architectures.”
■ sun.misc.Unsafe: unsafe.allocateMemory
■ Linked: For Larger entries
● Malloc and fragmentation
■ Chunked: For smaller entries
Numa issues● Numactl --interleave:
○ Thread-local native allocations - Bad [X]
■ Tons of them throughout code which bypass JVM.
○ JVM’s Eden space will also be interleaved - Bad [X]
● JVM’s options only:○ Native allocations will be local.
○ Large off-heap allocations can suffer.
● Numactl + JVM■ JVM-aware GC (Parallel)
● Best possible combination (without invasive code changes in cassandra).
● JVM’s memory options will override numactl.
● But, ParallelGC is not comparable to new ones (G1).
Interpretation● Low off-heap usage
○ Use the JVM NUMA options. Don’t interleave with numactl, it is a hammer.
● High off-heap usage (like cassandra)○ Just go with the flow, and do numactl.
■ -XX:+AlwaysPreTouch? (MAP_POPULATE)
○ Cost-benefit analysis.
● ParallelGC is too old (and bad for latency) - don’t use it just for NUMA.○ Well-implemented NUMA can easily pique anyone’s geeky senses. :)
○ Ask Cassandra or Oracle to add NUMA support to G1 ;)
● In newer kernels (Xenial), one can try AutoNUMA. ○ Completely managed by kernel based on access patterns.
○ Has caveats but one can always benchmark and see. :)
Interpretation● JVM is (still) not good with native primitives such as O_DIRECT or NUMA (there is a
jnuma which is not that well maintained).
○ Many database authors write their own off-JVM implementations for these. (there are so many java
databases these days)
○ Some also do things like this.
○ MySQL (InnoDB) can (and does) take advantage of these for good performance.
■ InnoDB was in Cassandra’s place about two years ago, till fixes landed.
● How InnoDB does it.
○ May be ScyllaDB in future. ;)
Wishlist for cassandra● Use whatever GC fits best. (G1?)
■ Ask for NUMA support in this.
● Use the JVM NUMA options when supported.
■ Having NUMA support for Eden spaces will help a lot.
● Don’t use numactl.○ Let all native allocations be local (OS default).
○ Use jnuma (or equivalent, it is just a JNI wrapper) for OHCP and other large non-local caches.
■ Use numa interleaving here.
■ This requires cassandra or OHCP code to be changed.
● Changing OHCP code is easier.
● Benchmark○ ??
○ Profit!
AutoNUMA● Introduced late in 4.x kernel● CPU follows memory
○ Reschedule tasks on same nodes as memory
● Memory follows CPU○ Copy memory pages to same nodes as tasks/threads
● Heuristics○ Fault statistics○ Task grouping○ Multi-resource optimization - cache, cpu, memory, starvation
■ Avoid thrashing
Tunings and observables● /proc/zoneinfo
○ Sysctl vm.zone_reclaim_mode OR /proc/sys/vm/zone_reclaim
○ /proc/sys/vm/min_unmapped_ratio
● /proc/meminfo
● /proc/vmstat
● Ftrace / Perf
● Cgroup hierarchy○ Memory
● Per process: ○ /proc/<pid>/numa_maps
○ /proc/<pid>/sched
Numa statistics
Further● http://frankdenneman.nl/2016/07/07/numa-deep-dive-part-1-uma-numa/ ● http://queue.acm.org/detail.cfm?id=2852078 ● https://plumbr.eu/java-garbage-collection-handbook● http://mechanical-sympathy.blogspot.in/2013/07/java-garbage-collection-
distilled.html
Credits!● http://queue.acm.org/detail.cfm?id=2513149 ● www.linux-kvm.org/images/7/75/01x07b-NumaAutobalancing.pdf ● http://events.linuxfoundation.org/sites/events/files/slides/Normal%20and
%20Exotic%20use%20cases%20for%20NUMA%20features.pdf ● https://en.wikipedia.org/wiki/Non-uniform_memory_access ● https://lihz1990.gitbooks.io/transoflptg/content/02.%E7%9B%91%E6%8E
%A7%E5%92%8C%E5%8E%8B%E6%B5%8B%E5%B7%A5%E5%85%B7/sample-output-of-the-numastat-command.png