working together - amazon web servicesconnect.linaro.org.s3.amazonaws.com/sfo17/presentations... ·...

Post on 01-Aug-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ENGINEERS AND DEVICESWORKING TOGETHER

ENGINEERS AND DEVICES

WORKING TOGETHER

●○○○

●●

K (Android 4.4): Dalvik + JIT compilerL (Android 5.0): ART + AOT compilerM (Android 6.0): ART + AOT compilerN (Android 7.0): ART + JIT/AOT compilerO (Android 8.0): ART + JIT/AOT compiler + vectorization

●●●●●●

ENGINEERS AND DEVICES

WORKING TOGETHER

A SIMD instruction performs a single operation to multiple operands in parallel

ARM: NEON Technology (128-bit)

Intel: SSE* (128-bit) AVX* (256-bit, 512-bit)

MIPS: MSA (128-bit)

All modern general-purpose CPUs support small-scale SIMD instructions (typically between 64-bit and 512-bit)

4x32-bit operations

●○○○

○○○

● Many vectorizing compilers were developed by supercomputer vendors

● Intel introduced first vectorizing compiler for SSE in 1999● Since the Android O release, the optimizing compiler of

ART has joined the family of vectorizing compilers

www.aartbik.com

ENGINEERS AND DEVICES

WORKING TOGETHER

for (int i = 0; i < 256; i++) { for (int i = 0; i < 256; i += 4) {

a[i] = b[i] + 1; -> a[i:i+3] = b[i:i+3] + [1,1,1,1];} }

Ronny Reader

Abby AuthorWendy Writer

Perry Presenter Vinny Viewer Molly Maker Casey Creator

VectorOperation

VectorMemOpVectorBinOp

VectorAdd VectorSub VectorLoad VectorStore

….

….

has alignment

has vector lengthhas packed data type

A class hierarchy of general vector operations that is sufficiently powerful to represent SIMD operations common to all architectures

t = [1,1,1,1];

for (int i = 0; i < 256; i += 4) { -> for (int i = 0; i < 256; i += 8) {

a[i:i+3] = b[i:i+3] + [1,1,1,1]; a[i :i+3] = b[i :i+3] + t;} a[i+4:i+7] = b[i+4:i+7] + t; }

t = [1,1,1,1];

for (int i = 0; i < 256; i += 8) { ->

a[i:i+3] = b[i:i+3] + t; a[i+4:i+7] = b[i+4:i+7] + t;}

movi v0.4s, #0x1, lsl #0

mov w3, #0xc

mov w0, #0x0

Loop: cmp w0, #0x100 (256)

b.hs Exit

add w4, w0, #0x4 (4)

add w0, w3, w0, lsl #2

add w5, w3, w4, lsl #2

ldr q1, [x2, x0]

add v1.4s, v1.4s, v0.4s

str q1, [x1, x0]

ldr q1, [x2, x5]

add v1.4s, v1.4s, v0.4s

str q1, [x1, x5]

add w0, w4, #0x4 (4)

ldrh w16, [tr] ; suspend check

cbz w16, Loop

VecReplicateScalar(x)

ARM64 x86-64 MIPS64

dup v0.4s, w2 movdq xmm0, rdx fill.w w0, a2 pshufd xmm0, xmm0, 0

/** * Cross-fade byte arrays x1 and x2 into byte array x_out. */private static void avg(byte[] x_out, byte[] x1, byte[] x2) { // Compute minimum length of the three byte arrays. int min = Math.min(x_out.length, Math.min(x1.length, x2.length));

// Morph with rounding halving add (unsigned). for (int i = 0; i < min; i++) { x_out[i] = (byte) (((x1[i] & 0xff) + (x2[i] & 0xff) + 1) >> 1); }}

SEQUENTIAL (ARMv8 AArch64)

L:cmp w5, w0 b.hs Exit add w4, w2, #0xc (12) add w6, w3, #0xc (12) ldrsb w4, [x4, x5] ldrsb w6, [x6, x5] and w4, w4, #0xff and w6, w6, #0xff add w4, w4, w6 add w6, w1, #0xc (12) add w4, w4, #0x1 (1) asr w4, w4, #1 strb w4, [x6, x5] add w5, w5, #0x1 (1) ldrh w16, [tr] ; suspend check cbz w16, L

SIMD (ARMv8 AArch64 + NEON Technology)

L:cmp w5, w4 b.hs Exit add w16, w2, w5 ldur q0, [x16, #12] add w16, w3, w5 ldur q1, [x16, #12] urhadd v0.16b, v0.16b, v1.16b add w16, w1, w5 stur q0, [x16, #12] add w5, w5, #0x10 (16) ldrh w16, [tr] ; suspend check cbz w16, L

Runs about 10x faster!

ENGINEERS AND DEVICES

WORKING TOGETHER

ENGINEERS AND DEVICESWORKING TOGETHER

Java code Autovectorization result

void mul_add(int[] a, int[] b) -{ for (int i = 0; i < 512; i++) { a[i] += a[i] * b[i]; }}

●○

●○○

ENGINEERS AND DEVICESWORKING TOGETHER

Java code Autovectorization result

void mul_add(int[] a, int[] b) -{ for (int i = 0; i < 512; i++) { a[i] += a[i] * b[i]; }}

L:cmp w0, #0x200b.hs Exit

add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.2s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.2s}, [x16]mul v1.2s, v0.2s, v1.2sadd v0.2s, v0.2s, v1.2sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.2s}, [x16]add w0, w0, #0x2ldrh w16, [tr]cbz w16, L

●○

●○○

●○○

●○

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (68% perf boost)

L:cmp w0, #0x200b.hs Exit

add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.2s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.2s}, [x16]mul v1.2s, v0.2s, v1.2sadd v0.2s, v0.2s, v1.2sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.2s}, [x16]add w0, w0, #0x2ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mul v1.4s, v0.4s, v1.4sadd v0.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (68% perf boost)

L:cmp w0, #0x200b.hs Exit

add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.2s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.2s}, [x16]mul v1.2s, v0.2s, v1.2sadd v0.2s, v0.2s, v1.2sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.2s}, [x16]add w0, w0, #0x2ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mul v1.4s, v0.4s, v1.4sadd v0.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (68% perf boost)

L:cmp w0, #0x200b.hs Exit

add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.2s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.2s}, [x16]mul v1.2s, v0.2s, v1.2sadd v0.2s, v0.2s, v1.2sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.2s}, [x16]add w0, w0, #0x2ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mul v1.4s, v0.4s, v1.4sadd v0.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●○

●○

●○

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (11% perf boost)

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mul v1.4s, v0.4s, v1.4sadd v0.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v2.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●○

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (11% perf boost)

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mul v1.4s, v0.4s, v1.4sadd v0.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v2.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●○

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (11% perf boost)

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mul v1.4s, v0.4s, v1.4sadd v0.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v2.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●○

●○○○

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (23% perf boost)

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v2.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exit add w16, w1, w0, lsl #2ldur q0, [x16, #12]

add w16, w2, w0, lsl #2ldur q1, [x16, #12]

mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, w0, lsl #2stur q2, [x16, #12]

add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●○

●○

●○○○○

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (23% perf boost)

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v2.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exit add w16, w1, w0, lsl #2ldur q0, [x16, #12]

add w16, w2, w0, lsl #2ldur q1, [x16, #12]

mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, w0, lsl #2stur q2, [x16, #12]

add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●○

●○

●○○○○

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (23% perf boost)

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v2.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exit add w16, w1, w0, lsl #2ldur q0, [x16, #12]

add w16, w2, w0, lsl #2ldur q1, [x16, #12]

mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, w0, lsl #2stur q2, [x16, #12]

add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●○

●○

●○○○○

●○○

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (10% perf boost)

L:cmp w0, #0x200b.hs Exit add w16, w1, w0, lsl #2ldur q0, [x16, #12]add w16, w2, w0, lsl #2ldur q1, [x16, #12]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, w0, lsl #2stur q2, [x16, #12]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

mov w3, #0xc

L:cmp w0, #0x200b.hs Exit add w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]

mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]

add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●○

●○○

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (10% perf boost)

L:cmp w0, #0x200b.hs Exit add w16, w1, w0, lsl #2ldur q0, [x16, #12]add w16, w2, w0, lsl #2ldur q1, [x16, #12]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, w0, lsl #2stur q2, [x16, #12]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

mov w3, #0xc

L:cmp w0, #0x200b.hs Exit add w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]

mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]

add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●○

●○○

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (10% perf boost)

L:cmp w0, #0x200b.hs Exit add w16, w1, w0, lsl #2ldur q0, [x16, #12]add w16, w2, w0, lsl #2ldur q1, [x16, #12]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, w0, lsl #2stur q2, [x16, #12]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

mov w3, #0xc

L:cmp w0, #0x200b.hs Exit add w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]

mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]

add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●○

●○○

●●

ENGINEERS AND DEVICESWORKING TOGETHER

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (2.5% perf boost)

L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (2.5% perf boost)

L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (2.5% perf boost)

L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●●

○○

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (12% perf boost)

L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exitadd w4, w0, #0x4add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2ldr q0, [x1, x0]ldr q1, [x2, x0]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4ldrh w16, [tr]cbz w16, L

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (12% perf boost)

L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exitadd w4, w0, #0x4add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2ldr q0, [x1, x0]ldr q1, [x2, x0]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4ldrh w16, [tr]cbz w16, L

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (12% perf boost)

L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exitadd w4, w0, #0x4add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2ldr q0, [x1, x0]ldr q1, [x2, x0]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4ldrh w16, [tr]cbz w16, L

●●

●○○

●○

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (12% perf boost)

L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exitadd w4, w0, #0x4add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2ldr q0, [x1, x0]ldr q1, [x2, x0]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4ldrh w16, [tr]cbz w16, L

●●

●○○

●○

ENGINEERS AND DEVICESWORKING TOGETHER

for (int i = 0; i < LENGTH; i++) { c[i] = (byte)(a[i] + b[i]);}

i87 Add [i80,i79]i102 IntermediateAddressIndex [i87,i98,i3]i99 IntermediateAddressIndex [i80,i98,i3]d89 VecLoad [l35,i102]d84 VecLoad [l35,i99]d83 VecLoad [l29,i99]d88 VecLoad [l29,i102]d85 VecAdd [d83,d84]d90 VecAdd [d88,d89]d86 VecStore [l27,i99,d85]d91 VecStore [l27,i102,d90]i92 Add [i87,i79]v78 Goto

ENGINEERS AND DEVICESWORKING TOGETHER

(gdb) x/64u 0xefc0b0000xefc0b000: 0 28 192 18 0 0 0 00xefc0b008: 0 0 4 0 100 101 102 1030xefc0b010: 104 105 106 107 108 109 110 1110xefc0b018: 112 113 114 115 116 117 118 1190xefc0b020: 120 121 122 123 124 125 126 1270xefc0b028: 128 129 130 131 132 133 134 1350xefc0b030: 136 137 138 139 140 141 142 1430xefc0b038: 144 145 146 147 148 149 150 151

Java Code static final int LENGTH = 1024 * 256; // 256K elements, 0x40000static byte [] a = new byte[LENGTH];static byte [] b = new byte[LENGTH];static byte [] c = new byte[LENGTH];

Object Header

data[0]

ENGINEERS AND DEVICESWORKING TOGETHER

(gdb) x/64u 0xefc0b0000xefc0b000: 0 28 192 18 0 0 0 00xefc0b008: 0 0 4 0 100 101 102 1030xefc0b010: 104 105 106 107 108 109 110 1110xefc0b018: 112 113 114 115 116 117 118 1190xefc0b020: 120 121 122 123 124 125 126 1270xefc0b028: 128 129 130 131 132 133 134 1350xefc0b030: 136 137 138 139 140 141 142 1430xefc0b038: 144 145 146 147 148 149 150 151

One VecLoad / VecStore

Java Code static final int LENGTH = 1024 * 256; // 256K elements, 0x40000static byte [] a = new byte[LENGTH];static byte [] b = new byte[LENGTH];static byte [] c = new byte[LENGTH];

Object Header

ENGINEERS AND DEVICESWORKING TOGETHER

●○

●○○○

0xefc0b000: 0 28 192 18 0 0 0 0

0xefc0b008: 0 0 4 0 100 101 102 103

0xefc0b010: 104 105 106 107 108 109 110 111

0xefc0b018: 112 113 114 115 116 117 118 119

0xefc0b020: 120 121 122 123 124 125 126 127

0xefc0b028: 128 129 130 131 132 133 134 135

0xefc0b030: 136 137 138 139 140 141 142 143

0xefc0b038: 144 145 146 147 148 149 150 151

SIMD from here->

Avoid SIMD from here

ENGINEERS AND DEVICES

WORKING TOGETHER

ENGINEERS AND DEVICESWORKING TOGETHER

●○

●●

○○

ENGINEERS AND DEVICESWORKING TOGETHER

●○○

●●●●

ENGINEERS AND DEVICESWORKING TOGETHER

●●

○○○○○○○

●○○○

Analyzable and flexible CHECKED!

Embeddable CHECKED!

Stable and reproducible CHECKED!

Recognized CHECKED!

ENGINEERS AND DEVICESWORKING TOGETHER

●●

○○○

●○○○

ENGINEERS AND DEVICESWORKING TOGETHER

ENGINEERS AND DEVICESWORKING TOGETHER

ENGINEERS AND DEVICESWORKING TOGETHER

ENGINEERS AND DEVICESWORKING TOGETHER

ENGINEERS AND DEVICESWORKING TOGETHER

●○

●○

●○

●○ LDR q1, [x16] + LDR q2, [x16, #16] -> LDP q1, q2, [x16]

●○

ENGINEERS AND DEVICESWORKING TOGETHER

●●

○●

○○

Java Scalar version Initial SIMD Version

void mul_add(int[] a, int[] b, int[] c) -{ for (int i=0; i<512; i++) { a[i] += a[i] * b[i]; }}

L:cmp w0, #0x200b.hs Exit

add w4, w1, #0xcldr w6, [x4, x0, lsl #2]add w5, w2, #0xcldr w5, [x5, x0, lsl #2]madd w5, w6, w5, w6str w5, [x4, x0, lsl #2]add w0, w0, #0x1ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exit

add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.2s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.2s}, [x16]mul v1.2s, v0.2s, v1.2sadd v0.2s, v0.2s, v1.2sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.2s}, [x16]add w0, w0, #0x2ldrh w16, [tr]cbz w16, L

top related