java performance tweaks

Post on 13-Jan-2017

112 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

29 June 2016

Java Performance TweaksHow I brought TopicViewer’s analysis runtime from 2:10 down to 18 seconds

Repo: https://github.com/jimbethancourt/topic-viewer

2

Use a Faster Library 1

Avoid the Unnecessary 3

Looping 9

Parallelization 13

Finally 17

Contents

3

Use a Faster Library

• Swapping out off-the-shelf components is the first step you’ll want to take and will likely provide the most immediate gain.

• Replacing Colt with ParallelColt was relatively straightforward

• No migration guides, but the fact that both libraries were open source and classes had relatively similar names helped.

• Reduced the runtime from 2:10 to 0:55

Migrating from Colt to ParallelColt

4

Use a Faster Library 1

Avoid the Unnecessary 3

Looping 9

Parallelization 13

Finally 17

5

Avoid the Unnecessary

Don’t use Objects when primitives will provide the same functionality

Old

clustersTree.findSet(new Vertex(i)).index == rootIndexNew

clustersTree.findSet(i).index == rootIndex

Avoid Unnecessary Object Creation

6

Avoid the Unnecessary

Use Object arrays instead of Maps when you primarily read values and can index via numbers:

In DisjointTree

OriginalMap<Vertex, Vertex> vertexMapping = new HashMap<Vertex, Vertex>();findSet(Vertex v) { Vertex mapped = vertexMapping.get(v); …}NewVertex[] vertexMapping = new Vertex[numDocuments];findSet (int v) { vertexMapping[v.index] = v; …}

Brought runtime down to 42 seconds (if I remember correctly) between using arrays and avoiding object creation.

Use Object Arrays instead of Maps

7

Avoid the Unnecessary

• Avoid Unnecessary calls when there are collections that already perform the operations for you

• Constantly calling collection.contains() added up – I was surprised

• Avoid complex, object-heavy equals() operations when possible

• Splitting work into producer + consumer and leveraging natural properties of collections cut 15 seconds off of runtime

• See updateCorrelationMatrix() method and Pair.equals()

Avoid Unnecessary Operations

8

Avoid the Unnecessary

• Use method parameter values when looping heavily

• Method parameters live on the processor stack, not the heap

• As a result, accessing them is much faster

• Shaved off a second or two

Original:

private int[] getLeastDissimilarPair(DoubleMatrix2D correlationMatrix, boolean force) {

for (int i = 0; i < this.numDocuments; i++)

for (int j = 0; j < this.numDocuments; j++)

New:

private int[] getLeastDissimilarPair(DoubleMatrix2D correlationMatrix, int numDocuments, boolean force) {

for (int i = 0; i < numDocuments; i++) for (int j = 0; j < numDocuments; j++)

Use Method Parameter Values

9

Avoid the Unnecessary

Original:

for (int i = 0; i < this.numDocuments; i++)

for (int j = 0; j < this.numDocuments; j++) {

double similarity = correlationMatrix.get(i, j);

if (i < j && similarity != Double.NEGATIVE_INFINITY && similarity > bestSimilarity)

New:

for (int i = 0; i < numDocuments; i++) for (int j = 0; j < numDocuments; j++) { if (i < j) { double similarity = correlationMatrix.get(i, j); if (similarity != Double.NEGATIVE_INFINITY && similarity > bestSimilarity)

Make Absolutely Sure you Need I/O

10

Use a Faster Library 1

Avoid the Unnecessary 3

Looping 9

Parallelization 13

Finally 17

11

Looping

for (int i = 0; i < numDocuments; i++) for (int j = 0; j < numDocuments; j++) if (i < j)Is faster thanfor (int j = 0; j < numDocuments; j++) for (int i = 0; i < j; i++)

Likely faster due to reuse of register values and JITting

Don’t Get Fancy With Inner Loops

12

Looping

It was cheaper to re-run the operation instead of caching in a map:

private Set<Integer> getClusterSet(int rootIndex, int numDocuments) {

Set<Integer> clusterSet = new LinkedHashSet<>(); for (int i = 0; i < numDocuments; i++) if (this.clustersTree.findSet(i).index == rootIndex) clusterSet.add(i); return clusterSet;}

Don't Cache Where it Won't Help

13

Looping

https://dzone.com/articles/java-collection-performance was pure gold!

Use the Fastest Collection for your Dominant Operation

14

Use a Faster Library 1

Avoid the Unnecessary 3

Looping 9

Parallelization 13

Finally 17

15

Parallelization

What do you do when you need to parallelize something like this?

for (int i = 0; i < correlationMatrix2D.rows()-1; i++) {…}for (int i = 0; i < correlationMatrix2D.rows()-1; i++) { rowsForUpdating.add(i);}rowsForUpdating.parallelStream().forEach(i -> {…});

Shaved 10 seconds off of runtime.If possible, populate rowsForUpdating only once if quantity is fixed

Need to Parallelize an Indexed For Loop?

16

Parallelization

When organizing parallelization of processing, perform the parallelization at the highest / outermost level possible.This will reduce the cost of context-switching for the CPUNesting parallelization was detrimental to runtime when I attempted to do so. Processing time took longer and CPU utilization was higher.

Parallelize (only) at the Highest Level Possible

17

Parallelization

There is no ConcurrentHashSet data structure in java.util.concurrent Leverage the fact that ConcurrentHashMap’s key is a HashSet:

ConcurrentHashMap<Pair, Integer> calculatedClusters = new ConcurrentHashMap<>();

calculatedClusters.put(newPair, 0);

A ConcurrentHashMap has 16 or more segments and each can be written to at the same time.

Use a ConcurrentHashMap for the Key

18

• When performing a large volume of writes, use a non-blocking data structure.

• br.ufmg.aserg.topicviewer.util.Double2DMatrix is pretty amazing once I realize how it worked

• Each matrix row has its own channel• Each cell in the row has a calculated offset• As a result, there is no contention

Non-Blocking Data Structures for Heavy I/O

19

Use a Faster Library 1

Avoid the Unnecessary 3

Looping 9

Parallelization 13

Finally 17

20

Finally

-XX:+UseG1GCSaved 2 seconds of runtime!

Use the G1 Garbage Collector

21

Finally

Java Mission Control is your new best friend-XX:+UnlockCommercialFeatures -XX:+FlightRecorder

Profile Profile Profile

22

• 18 seconds • Over 8X performance

improvement

Final Runtime

top related