vectors for java - oraclecr.openjdk.java.net/~psandoz/conferences/2016-javaone/j1-2016-ve… · •...

60
Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. CON1560 1 Vectors for Java Ian Graves ([email protected] ) Paul Sandoz (@ PaulSandoz ) Vladimir Ivanov (@iwan0www )

Upload: others

Post on 08-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 1!

Vectors for Java!Ian Graves ([email protected])!

Paul Sandoz (@PaulSandoz)!Vladimir Ivanov (@iwan0www)!

Page 2: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 2!

This presentation is not about!improvements to!

java.util.Vector

Page 3: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 3!

Safe Harbor Statement!Intel!

•  INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

•  Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

•  The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

•  Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. •  Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by

calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm •  Intel, the Intel logo, Intel Xeon, and Xeon logos are trademarks of Intel Corporation in the U.S. and/or other countries. •  Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not

across different processor families: Go to: Learn About Intel® Processor Numbers http://www.intel.com/products/processor_number

•  *Other names and brands may be claimed as the property of others. •  Copyright © 2016 Intel Corporation. All rights reserved.

Page 4: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 4!

Intel Cont’d!•  Some results have been estimated based on internal Intel analysis and are provided for informational purposes

only. Any difference in system hardware or software design or configuration may affect actual performance. •  Software and workloads used in performance tests may have been optimized for performance only on Intel

microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

•  Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.

•  Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported.

•  SPEC, SPECint, SPECfp, SPECrate, SPECpower, SPECjbb, SPECompG, SPEC MPI, and SPECjEnterprise* are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more information.

•  TPC Benchmark, TPC-C, TPC-H, and TPC-E are trademarks of the Transaction Processing Council. See http://www.tpc.org for more information.

•  Intel® Advanced Vector Extensions (Intel® AVX)* are designed to achieve higher throughput to certain integer and floating point operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you should consult your system manufacturer for more information.

•  Intel® Advanced Vector Extensions refers to Intel® AVX, Intel® AVX2 or Intel® AVX-512. For more information on Intel® Turbo Boost Technology 2.0, visit http://www.intel.com/go/turbo

Page 5: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 5!

Safe Harbor Statement!

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.!

Page 6: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 6!

This presentation is about!the design of high level Java APIs!

and their implementations that !leverage modern hardware!

for high-performance data processing!!

Page 7: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 7!

More specifically… about!experimental Java APIs and their implementations!

that can leverage!Single Instruction, Multiple Data instructions!

on modern CPUs!

Page 8: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 8!

Overview!

•  Implementing explicit Vector/Data-Parallel ops!

•  Underpinnings with Code Snippets!

•  The Vector API!

•  Expression Languages for Vectorization!

Page 9: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 9!

Going parallel!

•  Machines!

•  Hadoop (Map/Reduce), Apache Spark!

•  Cores/hardware threads!

•  Java Stream API and Fork/Join framework!

•  CPU instructions or co-processors!

Page 10: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 10!

CPU instructions/Co-processors!

•  Single Instruction, Multiple Data (SIMD)!

•  ARM NEON*, Intel AVX,!Power AltiVec*, SPARC VIS*!

•  Co-processors!

•  SPARC Data Analytics Accelerator* (DAX)!

•  GPUs, FPGAs! * Other names and brands may be claimed as the property of others!

Page 11: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 11!

SIMD example!32 bits!

1 2 3 4 5 6 7

8 7 6 5 4 3 2

+ + + + + + +

8

1

+

float[] a =

float[] b =

9 9 9 9 9 9 9 9 float[] r =

= = = = = = = =

Page 12: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 12!

SIMD example!64 bits!

1 2 3 4 5 6 7

8 7 6 5 4 3 2

+

8

1

9 9 9 9 9 9 9 9

+ + +

= = = =

float[] a =

float[] b =

float[] r =

Page 13: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 13!

SIMD example!128 bits!

1 2 3 4 5 6 7

8 7 6 5 4 3 2

+

8

1

9 9 9 9 9 9 9 9

+

= =

float[] a =

float[] b =

float[] r =

Page 14: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 14!

SIMD example!256 bits!

1 2 3 4 5 6 7

8 7 6 5 4 3 2

+

8

1

9 9 9 9 9 9 9 9

=

float[] a =

float[] b =

float[] r =

Page 15: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 15!

Java and SIMD today!

•  Hotspot supports some of Intel’s AVX instructions!

•  Superword optimizations in HotSpot C2 compiler to derive SIMD code from sequential code!

•  Array copying, filling and comparison intrinsics!

Page 16: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 16!

Java and SIMD today!

•  Roll your own 64 bit operations in Java: VarHandle array views or Unsafe.get/putLong

•  JNI: with large data sets and low marshalling costs to ameliorate the invocation cost!

Page 17: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 17!

Java and SIMD today!

•  Superword optimizations can be very brittle!

•  Intrinsics are point fixes, not general!

•  Rolling your own in Java is limited and might not be portable (endian problem, an unaligned problem, or both, you might have a “byte odor” issue)!

•  JNI is hard to develop and maintain!

Page 18: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 18!

Motivation!•  Better Java support for crunching on data:!

laid out in memory in a regular pattern!

•  Big data applications such as Apache Flink, Apache Spark: currently use Unsafe and JNI!(e.g. netlib-java wrapper for BLAS/*PACK)!

•  Machine learning applications: where in some cases, apparently, 8-bit precision can be good enough !

Page 19: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 19!

Part of the big picture!

•  Project Panama!Moving the Java platform “closer to the metal”!

•  Project Valhalla!Bringing value types to the Java platform!

•  The binary star system that is the bright future of the Java platform!

Page 20: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 20!

Two design approaches!•  Lower-level Vector API and implementation !

•  Width specific stream-like expression of a data oriented computation!

•  Higher-level API and vector-based implementation!

•  Java byte code derived from runtime compiling an expression/AST of a data oriented computation!

Page 21: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 21!

Two design approaches!•  Both are general solutions!

•  Not point solutions such as DART’s explicit types (Float32x4), or Mozilla’s (Int32x4)!

•  Each built on the foundations of some basic value-based types and code snippets!

•  Operations on registers and memory for certain bit sizes!

Page 22: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 22!

Some value-based types!•  Long2 for 128 bits!Long4 for 256 bits!Long8 for 512 bits!

•  Boxes without identity!Special treatment in HotSpot with some escape analysis enhancements to avoid box allocation!

•  Views over packed values!Extract/apply from/to arrays of primitive values!

Page 23: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 23!

Code snippets!

•  “Codes like a Java expression, works like a HotSpot intrinsic”!

•  Wrap a few machine code instructions, with a specified calling convention, in a MethodHandle

•  e.g. using Intel AVX instructions!

Page 24: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 24!

Code snippets!MethodHandle  MHm256_vpaddps  =  MachineCodeSnippet.make(                  "mm256_vaddps",                  methodType(Long4.class,  Long4.class,  Long4.class),                  requires(AVX),                  regs  -­‐>  <<VPADDPS  ymm?,ymm?,ymm?>>                );  

Page 25: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 25!

Code snippets!MethodHandle  MHm256_vpaddps  =  MachineCodeSnippet.make(                  "mm256_vaddps",                  methodType(Long4.class,  Long4.class,  Long4.class),                  requires(AVX),                  regs  -­‐>  <<VPADDPS  ymm?,ymm?,ymm?>>                );    float[]  a  =  ...;  float[]  b  =  ...;  float[]  c  =  ...;  Long4  v1  =  extractFromArray(a,  index);  Long4  v2  =  extractFromArray(b,  index);  //  Add  8  floats  with  one  instruction  Long4  r    =  (Long4)  MHm256_vpaddps.invokeExact(v1,  v2);  applyToArray(c,  index,  r);  

Page 26: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 26!

A Specific Example!•  Code Snippets bound to MethodHandle!

•  invoke*() throws exceptions we want to catch!

•  Arguments unchecked, but we accept one type!

•  Further examples encapsulate these side effects!

•  Inside PatchableVecUtils !

Page 27: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 27!

A Specific Example!

 public  static  void  addArrays(float[]  left,  float[]  right,  float[]  res,  int  i)  {          Long4  l    =  PatchableVecUtils.long4FromFloatArray(left,  i);          Long4  r    =  PatchableVecUtils.long4FromFloatArray(right,  i);          Long4  rr  =  PatchableVecUtils.vaddps(l,  r);          PatchableVecUtils.long4ToFloatArray(res,  i,  rr);  }  

Page 28: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 28!

Assembly!;-­‐XX:+PrintAssembly mov DWORD PTR [rsp-0x16000],eax push rbp sub rsp,0x10 mov r13,rcx mov rbx,rdx mov ebp,r8d

;Kernel follows

vmovdqu ymm0,YMMWORD PTR [rsi+r8*4+0x18] vmovdqu ymm1,YMMWORD PTR [rbx+rbp*4+0x18] vaddps ymm0,ymm0,ymm1 vmovdqu YMMWORD PTR [r13+rbp*4+0x18],ymm0 ;Kernel ends

vzeroupper add rsp,0x10 pop rbp test DWORD PTR [rip+…],eax ret  

Page 29: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 29!

A Specific Example!

 public  static  void  addArrays(float[]  left,  float[]  right,  float[]  res,  int  i)  {          Long4  l    =  PatchableVecUtils.long4FromFloatArray(left,  i);          Long4  rr  =  PatchableVecUtils.vaddps(l,  right,  i);          PatchableVecUtils.long4ToFloatArray(res,  i,  rr);  }  

Page 30: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 30!

Assembly!;-­‐XX:+PrintAssembly  mov        DWORD  PTR  [rsp-­‐0x16000],eax  push      rbp  sub        rsp,0x10  mov        r13,rcx  mov        rbx,rdx  mov        ebp,r8d    ;Kernel  follows  vmovdqu  ymm0,YMMWORD  PTR  [rsi+r8*4+0x18]  vaddps    ymm0,ymm0,YMMWORD  PTR  [rbx+rbp*1+0x18]  vmovdqu  YMMWORD  PTR  [r13+rbp*4+0x18],ymm0  ;Kernel  ends    vzeroupper    add        rsp,0x10  pop        rbp  test      DWORD  PTR  [rip+…],eax  ret  

Page 31: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 31!

Matrix Multiplication Example!

•  Matrix multiplication with sgemm!

•  32-bit floating point, generalized matrix multiply!

•  Naïve approach + Cache Efficiency Tweaks!

•  Scalar, Vectorized, Threaded + Vectorized!

•  Baseline is modestly hand-optimized.!

•  Vectorized version aggressively hand-optimized!

Page 32: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 32!

Scalar!

for  (int  i  =  0;  i  <  m;  i++)  {          int  ixn  =  i  *  n;          for  (int  j  =  0;  j  <  p;  j++)  {                  int  jxn  =  j  *  n;                  float  sum  =  0f;                  for(int  k  =  0;  k  <  n;  k++){                          sum  +=  (A[ixn+k]  *  alpha)  *  b_cmajor[jxn+k];                  }                  C[j  *  m  +  i]  =  C[j  *  m  +  i]  *  beta  +  sum;          }  }  

Page 33: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 33!

Vector!for  (int  i  =  0;  i  <  m;  i++)  {          int  ixn  =  i  *  n;          for  (int  j  =  0;  j  <  p;  j++)  {                  int  jxn  =  j  *  n;                  float  sum  =  0;                  for  (int  k  =  0;  k  <  n;  k  +=  8)  {                          Long4  row_p  =  vmulps(valpha,  A,  ixn  +  k);                          Long4  col_p  =  long4FromFloatArray(b_cmajor,  jxn  +  k);                          sum  =  dot_prod(sum,  row_p,  col_p);                  }                  //C  is  pre-­‐scaled  vector-­‐wise  by  beta  in  this  example                  C[j  *  m  +  i]  =  C[j  *  m  +  i]  +  sum;          }  }  

Page 34: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 34!

Stream Parallel!IntStream.range(0,  m)                  .parallel()                  .forEach((i)  -­‐>  {                          int  ixn  =  i  *  n;                          for  (int  j  =  0;  k  <  p;  k  +=  8)  {                                    int  jxn  =  j  *  n;                                  int  dst  =  j  *  m  +  i;                                  float  sum  =  0;                                  for  (int  k  =  0;  k  <  n;  k  +=  8)  {                                          Long4  row_p  =  vmulps(valpha,  A,  (ixn)  +  k);                                          Long4  col_p  =  long4FromFloatArray(b_cmajor,  (jxn)  +  k);                                          sum                  =  dot_prod(sum,  row_p,  col_p);                                  }                                  //C  is  pre-­‐scaled  vector-­‐wise  by  beta  in  this  example                                  C[dst]  =  C[dst]  +  sum;                          }  });  

Page 35: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 35!

Performance Estimates!

•  Tested using JMH. Intel 6700K. Default C2 config.!

•  Highly preliminary. YMMV!!

•  More precise analysis to come.!

•  Multithreaded analysis left as an exercise to the reader.!

Page 36: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 36!

0!

0.5!

1!

1.5!

2!

2.5!

3!

3.5!

8! 16! 32! 64! 256! 512! 1024! 2048!

Spee

dup

vs. B

asel

ine

sgem

m!

Square Matrix Length!

Relative Speedup!

Page 37: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 37!

Vector API!

•  Data-parallel operations on sized-types!

•  Abstracting over Vector ISA Extensions!

•  Platform Independent (not ISA-specific)!

Page 38: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 38!

Vector API!

•  Encapsulates Code Snippets!

•  Draft proposed in 2015 !

•  Prototypes living at Project Panama!

Page 39: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 39!

Vector API!

•  Vector<E,S extends Shape<Vector<?, S>>>

•  Element-type E

•  Vector size S (bitwise)!

•  Broad support for E ∈ {int, float, double}!

Page 40: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 40!

Small Example!

public  static  void  addArrays(float[]  left,  float[]  right,                                                            float[]  res,    int  i)  {            FloatVector<Shapes.S256Bit>  l,  r,  lr;          l    =  float256FromArray(left,i);          r    =  float256FromArray(right,i);          lr  =  l.add(r);            lr.intoArray(res,i);  }  

Page 41: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 41!

Basics!

interface  Vector<E,S  extends  Shape<Vector<?,S>>>  {          …          Vector<E,S>  add(Vector<E,S>  v2);          Vector<E,S>  mul(Vector<E,S>  v2);          …          Vector<E,S>  and(Vector<E,S>  v2);          …  }  

Page 42: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 42!

Intermediate!

interface  Vector<E,S  extends  Shape<Vector<?,S>>>  {          …          E  getElement(int  i);          Vector<E,S>  putElement(int  i,  E  elem);          …          E  sumAll();          …          E[]  toArray();          fromArray(E[]  ary,  int  offset);          …  }  

Page 43: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 43!

Advanced!

interface  Vector<E,S  extends  Shape<Vector<?,S>>>  {          …          Vector<E,S>  map(UnaryOperator<E>  op);          Vector<E,S>  mapWhere(Mask<S>  mask,  UnaryOperator<E>  op);          …          Vector<E,S>  map(BinaryOperator<E>  op,  Vector<E,S>  v2);          Vector<E,S>  mapWhere(Mask<S>  mask,  BinaryOperator<E>  op,                                                    Vector<E,S>  this2);          …  }  

Page 44: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 44!

Mandelbrot Example!

•  Based on an example from Github!

•  To the editor!!

Page 45: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 45!

Page 46: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 46!

Performance!•  Depends upon boxing elimination!

•  Escape Analysis!

•  Sees significant boxing effects (i.e. not there yet)!

•  Value Types for better implementation & API design!

•  Vector<Float,S> L !

•  FloatVector<S> extends Vector<Float,S> L !

•  Vector<float,S> J !

Page 47: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 47!

Value Types to the Rescue!

•  Part of Project Valhalla!

•  Introducing broader value support.!

•  Minimal design proposed!

•  Gets Code Snippet types “outside the box.”!

•  Timeline TBD.!

Page 48: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 48!

Expression Language!•  Remember those advanced features?!!Vector<E,S>  map(BinaryOperator<E>  op,  Vector<E,S>  v2);  

•  Inbound lambdas are scalar-defined over <E>

•  We want lambdas to be customizable!

•  Java doesn’t (yet) have a way to reify lambdas!

•  How to inspect a lambda?!

Page 49: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 49!

Expression Language!

•  Vectorize expressions provided as an AST!

•  Presented inside lambdas for composability!

•  Based on CodeSnippets + MethodHandles API!

•  Code inside the lane!!

Page 50: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 50!

Expression Language!

•  Many vector operations are simple expressions!

•  Expressions are trees!

•  Parameterized by element type (Float,Integer,etc.)!

•  MethodHandles can be combined in a tree-like way!

•  Good code from MethodHandle combinators!!

Page 51: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 51!

Expression Language!

//f  =  (x,y)  -­‐>  (x+y)  *  y;  MethodType  mt  =  MethodType                                    .methodType(Long4.class,  Long4.class,  Long4.class);    MethodHandle  MHm256_vaddps  =  CodeSnippet.make(…,mt,…),                            MHm256_vmulps  =  CodeSnippet.make(…,mt,…);      MethodHandle  f_pre  =  MethodHandles                                              .collectArguments(MHm256_vmulps,  0,  MHm256_vaddps);    MethodHandle  f  =  MethodHandles.permuteArguments(f_pre,  mt,  0,  1,  1);  

Page 52: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 52!

Trees to MethodHandles!

*!

+! y!

y!x!

(x,y) ->!

vaddps!

vmulps!

x y

AST Visitor!

Page 53: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 53!

Expression Language!interface  Expression<E>  {          default  Expression<E>  add(Expression<E>  right)  {            return  new  AddExpression<>(this,right);          }                default  Expression<E>  mul(Expression<E>  right)  {            return  new  MulExpression<>(this,right);          }          default  Expression<E>  not()  {            return  new  NotExpression<>(this);          }          …          default  Expression<E>  peek(Consumer<E>  f)  {            return  new  PeekExpression<>(this,f);          }          …          default  Expression<Float>  fromFloat(Float  f)  {            return  new  ConstExpression<>(f);          }          …          <R>  R  evaluate(ExpressionEvaluator<E,R>  e);  }  

Page 54: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 54!

Expression Language!

MethodHandle  binaryReduction(float[]  left,                                                            float[]  right,                                                              float[]  dst,                                                            BinaryOperator<Expr<Float>>);    MethodHandle  br  =  binaryReduction(left,right,dst,(l,r)  -­‐>  {          Expression<Float>  e1  =  l.add(r);          return  e1.mul(r);  });    //Execute  the  entire  computation  br.invokeExact();  

Page 55: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 55!

Assembly!<start>:    vmovdqu  ymm0,YMMWORD  PTR  [rbx+rbp*4+0x18]                      vmovdqu  ymm1,YMMWORD  PTR  [r13+rbp*4+0x18]                      vaddps    ymm1,ymm1,ymm0                      vmulps    ymm0,ymm1,ymm0                      vmovdqu  YMMWORD  PTR  [r14+rbp*4+0x18],ymm0                      add          ebp,0x8                      cmp          ebp,0x200000                      jl            <start>                      ;Boilerplate  follows                      vzeroupper                      add          rsp,0x10                      pop          rbp                      test        DWORD  PTR  [rip+…],eax                        ret  

Page 56: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 56!

Applications!

•  Vectorize without Vector (or Superword)!

•  Higher-order flavor of programming works!

•  Can focus on loop kernels!

Page 57: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 57!

Tradeoffs!•  Control flow is hard to get right!

•  ISA needs to have the right features!

•  Deep branching can be penalized!

•  No Vector-level operations, blending/shuffling, etc !

•  These aren’t regular lambdas. Care is required.!

•  Trusting the compiler is necessary!

!

Page 58: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 58!

Conclusion!

•  A Vector API makes sense in Java!

•  Different, complementary APIs!

•  So far so good!

•  Good code quality with CodeSnippets!

•  Speedups observed vs. an optimized baseline!

Page 59: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 59!

Continuing API Work!

•  Enhancing the baseline Vector API!

•  Exploring higher order functionality more!

•  Synergizing with JVM features on the horizon!

Page 60: Vectors for Java - Oraclecr.openjdk.java.net/~psandoz/conferences/2016-JavaOne/j1-2016-ve… · • Software and workloads used in performance tests may have been optimized for performance

Copyright @ 2016 Oracle, Intel and/or its affiliates. All rights reserved. ! CON1560 60!

Interested?!•  Check out the Panama Project!

•  jdk/test/panama/vector-api-patchable!

•  Code samples from here will be up soon.!

•  panama/jdk/tests/panama/vector-api-patchable!

•  Builds with Maven!

•  <legalese>Code Samples GPLv2 w/ Classpath Exception</legalese>!

•  Prototype on Linux/MacOS* x86-64 with AVX2!

* Other names and brands may be claimed as the property of

others!