performance van java 8 en verder - jeroen borgers

45
Performance of Java 8 and beyond Performance van Java 8 en verder By Jeroen Borgers 1

Upload: nljug

Post on 11-Jul-2015

226 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Performance van Java 8 en verder - Jeroen Borgers

Performance of Java 8 and beyond

Performance van Java 8 en verder

By Jeroen Borgers

1

Page 2: Performance van Java 8 en verder - Jeroen Borgers

Contents1. Introduction 2. Lambda expressions 3. Stream API 4. Parallel execution & cores 5. Filter map reduce, parallel streams internals 6. Fork-join framework use 7. Lambda’s versus inner classes 8. Tiered compilation 9. PermGen removal 10.java.time performance 11.Accumulators en Adders 12.Map improvements 13.Java 9+ improvements 14.Utilization of GPU's 15.Value Types 16.Arrays 2.0 17.Summary and conclusions 2

Page 3: Performance van Java 8 en verder - Jeroen Borgers

Introduction to lambdas and streams• Java 8 introduces lambda expressions for functional

programming • With the Stream API iteration can be handled internally by a

library • Tell don’t ask for applying a function on a collection • or tell to do that in parallel, on multiple cores • question is if this improves your response time

3

Page 4: Performance van Java 8 en verder - Jeroen Borgers

Lambda expressions and streams• Example

4

Page 5: Performance van Java 8 en verder - Jeroen Borgers

Lambda expressions and streams• Example with method references

5

Page 6: Performance van Java 8 en verder - Jeroen Borgers

Lambda expressions: short notation• instance of anonymous inner class of functional interface • functional interface has only one abstract method • Runnable: void run() • Executor: void execute(Runnable r) • Iterable<T>: Iterator<T> iterator() • new: java.util.function • Consumer<T>: void accept(T t) • Function<T, R>: R apply(T t) • Predicate<T>: boolean test(T t)

• Annotation: @FunctionalInterface6

Page 7: Performance van Java 8 en verder - Jeroen Borgers

Anonymous inner class instance example

7

Page 8: Performance van Java 8 en verder - Jeroen Borgers

Inner class has boiler plate code

8

Page 9: Performance van Java 8 en verder - Jeroen Borgers

Lambda expression is concise

9

Page 10: Performance van Java 8 en verder - Jeroen Borgers

Stream pipeline

10

Source Intermediate operations lazy evaluation

Terminal operations eager evaluation

Page 11: Performance van Java 8 en verder - Jeroen Borgers

Stream lazy evaluation

11

Page 12: Performance van Java 8 en verder - Jeroen Borgers

Stream lazy evaluation optimizes with short-circuiting - can be big win

12

Page 13: Performance van Java 8 en verder - Jeroen Borgers

Stream executed in parallel

13

Page 14: Performance van Java 8 en verder - Jeroen Borgers

Parallel execution & hardware threads• Parallel != concurrent • CPU Frequency at max • #cores/hardware threads increase 64+ • Must be able to utilize those cores • need to process data faster: BigData, IoT • Runtime.getRuntime().availableProcessors() • reports #hardware threads • my Mac: 2 cores with 2 hyper threads = 4

• Can we get a speedup of ~4?

14

Page 15: Performance van Java 8 en verder - Jeroen Borgers

Parallel streams utilize ForkJoinPool• Java 8 ForkJoinPool introduces a common pool for any ForkJoinTask • one per JVM

• Used in Array.parallelSort, .parallelSetAll and parallelStream • Size defaults to Runtime.getRuntime().availableProcessors() - 1 • Can be set with: • -Djava.util.concurrent.ForkJoinPool.common.parallelism=N

• Multiple JVM’s on a machine • consider lowering the pool size

• Tasks waiting for I/O • consider increasing the pool size

15

Page 16: Performance van Java 8 en verder - Jeroen Borgers

Fork-join framework: divide-and-conquer• Divide task recursively in smaller tasks • Divide array of 640 elements into 64

leaf tasks of 10 elements • e.g. sum or sort on each level

• Many ForkJoinTasks processed by limited threads, e.g. ForEachTask • like ThreadPoolExecutor • worse: overhead of creating tasks • better: work stealing from queue

of other threads • great for unbalanced tasks!

16

Page 17: Performance van Java 8 en verder - Jeroen Borgers

Performance of Lambda’s versus inner classes• Lambdas seem syntactic sugar around creating anonymous class • in fact, it is not

• Inner class • Actual class loaded by class loader • New object created, allocation, initialization, gc

• Lambda • creates a static method called through helper class

• Performance is similar • Only first time loading inner class in class loader is slower

17

Page 18: Performance van Java 8 en verder - Jeroen Borgers

When to use parallel streams?• source.parallelStream().operation(F) • F independent • computation on element does not rely on or impact other • stateless, non-interfering

• source is efficiently splittable • Collections, Arrays, SplittableRandom • not I/O based: designed for sequential use

• computationally expensive • ROT: sequential version > 100 µs

18

Page 19: Performance van Java 8 en verder - Jeroen Borgers

Parallel when computationally expensive• source.parallelStream().operation(F) • ROT: sequential version > 100 µs • N * Q > 10 000 • N = #elements • Q = cost per element of F: #operations • small function like x -> x * x: N > 10 000 elements • moderately large function Q = 100: N > 100 elements

19

Page 20: Performance van Java 8 en verder - Jeroen Borgers

Overhead of parallel execution• Startup of power-controlled cores • Sequential part of setting up parallel calculation • Splittability = ease of partitioning • efficient if random access or efficient search: • ArrayLists, [Concurrent]HashMaps, arrays

• inefficient: LinkedLists, BlockingQueues, IO-based • Stream BufferedReader.lines() currently for sequential

use • might by improved in future JDK, for highly efficient

bulk processing of buffered IO

20

Page 21: Performance van Java 8 en verder - Jeroen Borgers

Creating the micro benchmarkTiny calculation per element

21

Page 22: Performance van Java 8 en verder - Jeroen Borgers

Creating the micro benchmark 2

22

Page 23: Performance van Java 8 en verder - Jeroen Borgers

Micro benchmark demo

23

Page 24: Performance van Java 8 en verder - Jeroen Borgers

Medium sized calculation benchmark

• 1000 elements • Speedup by using serial lambda's = 0.95884454 • Speedup of parallel over serial lambda's= 1.2968781 • Speedup of parallel over oldSchool = 1.2435045

• 100_000 elements • Speedup by using serial lambda's = 0.9760258 • Speedup of parallel over serial lambda's= 2.1337924 • Speedup of parallel over oldSchool = 2.0826366

24

Page 25: Performance van Java 8 en verder - Jeroen Borgers

Utilization of coresMedium calculation, 1000 and 100_000 elements

25Parallel part

Page 26: Performance van Java 8 en verder - Jeroen Borgers

26

Page 27: Performance van Java 8 en verder - Jeroen Borgers

27

Page 28: Performance van Java 8 en verder - Jeroen Borgers

Tiny calculation benchmark

• 1000 elements • Speedup by using serial lambda's = 0.12944984 • Speedup of parallel over serial lambda's= 0.46804 • Speedup of parallel over oldSchool = 0.0605877

• 100_000 elements • Speedup by using serial lambda's = 0.10920245 • Speedup of parallel over serial lambda's= 5.905797 • Speedup of parallel over oldSchool = 0.64492756

28

Page 29: Performance van Java 8 en verder - Jeroen Borgers

Utilization of coresTiny calculation, 1000 and 100_000 elements

29

Page 30: Performance van Java 8 en verder - Jeroen Borgers

Micro benchmark conclusions(for this benchmark, on this computer)• For high performance and small functions: use old school loops • lambda’s infrastructure takes more overhead than function

• For high performance and large functions • serial • if N * Q > 100 000 then parallel

• I need more cores!

30

Page 31: Performance van Java 8 en verder - Jeroen Borgers

Tiered compilation• JIT-compiler came in 2 flavors, now 3 • -client (C1) • quick startup time

• -server (C2) • best performance in long run

• -XX:+TieredCompilation • first C1, then C2

• only Java 8: TieredCompilation default • Java 7: often need to increase code cache • -XX:ReservedCodeCacheSize=96M (7) 240M (8)

31

Page 32: Performance van Java 8 en verder - Jeroen Borgers

Permgen removal• Upto Java 7: Permgen; Java 8: Metaspace • Permgen (wrong name) • data not related to classes: String pool

• Metaspace • only class meta data • Class objects itself on heap • String pool on heap • -XX:[Max]MetaspaceSize=N • Default max ‘unlimited’ (1 GB)

• OutOfMemoryError: Metaspace instead of PermGen space32

Page 33: Performance van Java 8 en verder - Jeroen Borgers

java.time performance • Finally a proper library for Date and Time that replaces the • Crappy stuff: • java.util.Date • mutable - defensive copies needed

• java.util.Calendar • 540 bytes to store timestamp, Locale, TZ - heap/gc

• java.text.SimpleDateFormat • not thread safe - so have to re-create

• Stephen Colebourne spec lead, from Joda time

33

Page 34: Performance van Java 8 en verder - Jeroen Borgers

java.util.concurrent.atomic Accumulators and Adders

34

Page 35: Performance van Java 8 en verder - Jeroen Borgers

Map improvements• HashMap, LinkedHashMap and

ConcurrentHashMap • collisions on keys: keys end up in same bucket • access time O(1) -> O(n) • follow LinkedList until key.equals() returns

true • Balanced tree instead of linked list • if size > TREEIFY_THRESHOLD (8) • worst case access time O(n) -> O(log(n)) • keys should implement Comparable • branches on hashCode, then compareTo

35

Page 36: Performance van Java 8 en verder - Jeroen Borgers

Java 9+ performance improvements

36

Page 37: Performance van Java 8 en verder - Jeroen Borgers

Sumatra: Utilization of GPU's• GPU’s have 100_000’s of stream cores • SIMD - single instruction multiple data • work offloaded to GPU • implemented off-loadable version of parallel().forEach()

• Use parallel streams and lambdas

37

Page 38: Performance van Java 8 en verder - Jeroen Borgers

Value Types (JEP 169)The next big thing!• Currently:

• limited set of primitives, by value: no identity • others by reference: identity • footprint:

• heap allocated • object headers • 1+ pointers pointing to it • burden for small objects

• object identity only serves mutability • JVM attempts to figure out if identity is needed

• escape analysis and object elision can unwrap in cases • fragile • Object might be used as lock, then needs identity

38

Page 39: Performance van Java 8 en verder - Jeroen Borgers

Integer overhead

39

mark word class pointer value

object pointer

value

Page 40: Performance van Java 8 en verder - Jeroen Borgers

Point example

40

Page 41: Performance van Java 8 en verder - Jeroen Borgers

Point - class versus value type • Point object layout

• @Value Point layout

41

mark word class pointer x

object pointer

x

y padding

y

Page 42: Performance van Java 8 en verder - Jeroen Borgers

Arrays 2.0 Improvements• array[(long)i] = 5; • array[i, j, k] = 7; • Arrays.chop(T[] a, int newLen); • prevents copying in StringBuilder.toString()

• arrays become real Java objects • indexes of other types than int, long • like Map

• thread-safe access for array slices • final/volatile

42

Page 43: Performance van Java 8 en verder - Jeroen Borgers

• Summary and conclusions• Lamdas and streams offer possible performance improvement • lazy evaluation • tiny calcs or small #elements & medium size calc • don’t use parallel() • consider old school iterations if performance important

• Many performance improvements in Java 8 • Use it if you can and get better performance

• Several performance improvements planned for Java 9+ (10?) • Better support for Big Data & number crunching

43

Page 44: Performance van Java 8 en verder - Jeroen Borgers

Want to know more?• www.jpinpoint.com / www.profactive.com • references, presentations

• Accelerating Java Applications • 3 days technical training • 24-25-26 November 2014 • nl-jug members 10% discount • hand-in business card today: 20% discount

44

Page 45: Performance van Java 8 en verder - Jeroen Borgers

Questions?

45