october 20-24. primitive performance roger hui, morten kromberg dyalog ltd dyalog13

19
October 20-24

Upload: sterling-porteus

Post on 29-Mar-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

October 20-24

Page 2: October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

Primitive Performance

Roger Hui, Morten KrombergDyalog LTD

Dyalog’13

Page 3: October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

”Primitive Performance”Goal: Constantly improve the performance of the existing primitive functions and operators• Two main problems...• Hard: Deciding what to optimise

– Easy: Clever people must think of better algorithms

• Hard: Don’t ”accidentally” cause slowdowns– Hard: Even understanding whether it happened

Primitive Performance

3

Page 4: October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

Prioritizing Tuning

Deciding where to start:• APLMON: Profiles APL Interpreter• ]PROFILE: Profiles application code• Customer benchmarks

– Please send us your code!• Comparisons with other array languages

– Internal testing– External benchmarks– Conversion projects to Dyalog APL

Primitive Performance

4

Page 5: October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

Don’t slow anything down!!!• Over time, there is a tendency for things

to slow down as features are added– Unicode, 64-bit, OO, better error messages,

etc...– Sometimes even as a side-effect of tuning

work• Solution: The Performance Quality

Assurance (PQA) Framework:• Internal tool for the Dyalog Development

team to measure the performance of individual primitives and the ”execution framework” on a daily basis

Primitive Performance

5

Page 6: October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

PQA Project Goals

• Reliably detect slowdowns greater than 2%, in any primitive function or operator expression

• Publish a ”performance certificate” for each release– No surprises for customers: Slowdowns that we

cannot compensate should be expained (e.g. 64-bit project)

– Hard evidence of speed-ups for the world to see• Run PQA continuously during development,

catch performance degradation immediately!– Avoid the expensive search for the bad code change

”sometime last year”– Important: Avoid false positives (they are VERY

expensive)Primitive Performance

6

Page 7: October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

Challenges

• A huge number of different cases to generate and test

• Getting repeatable timings is extraordinarily difficult– Some timings are TINY e.g. ”0+0”

• Huge volume of data to analyse

Primitive Performance

7

Page 8: October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

Huge Number of Cases

• APL is our friend • PQA framework generates ~14,000

different APL expressions• ~600 different variables are created for

use by different expressions• Each expression is repeated for

approximately 3-4 seconds– Currently split into 10 runs of <0.5 secs

each

Primitive Performance

8

Page 9: October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

100 Expressions (selected at random)+z2 ×i2 |l2 ⌈s4 ÷zn0 xs4+ys4 xi1×yb1 xl2|yi2 xi4⌈yl4 xi4⌊yl4 xi1<ys1 xi1≥ys1 xl0>yb0 xi0≠yl0 ¯11○yd1 xa4∪ya2 xb2⍳yi1 xb2∪ys2 xs0∪yl2 xs0∩ys0

(... and about 13,900 others)

Primitive Performance

9

xs1~ys0 xs4∊yb4 xi2∪yl0 xd0~yd1 xz1⍳yz1 ⌈⍀bw4 +\iw2 ⌊\dw2 ⌊\bt1 ⌈⌿lt4 ⌊⍀lt4 -/bq2 ⌈\bq1 ⌊⌿sq4 xb2∘.+yb1 10↑[0] bw1 10↑[0] zt0 ¯10↓[0] bt1 ,dv1 ⊖av2

⊣lv4 ⍒iv1 ⊖at1 ,xq2 ,zq1 ⍕aq4 s⍴xv4 s⍴dv2 ¯10↓lv0 bv1,sv1 sv4,dv4 av4≢lv4 xv0≢av0 xv0≢dv4 iv4≡zv0 lv0≡zv4 dv0≡xv0 dv0≢iv0 dv4≢bv4 s⍷zv4

¯1↑aw0 ¯1↑lw2 10↑lw2 10↓lw2 11 ¯10↓sw2 ¯1⌽xw4 ¯10⊖lw1 sw2,iw1 zw2,lw2 xw0≢dw4 xw4≢sw4 bw4≢lw0 iw4≡sw0 iw4≢dw0 iw4≢dw4 zw0≢sw4 zw4≡xw4 zw4≢sw4 s⍴at2 10↑it2

¯10↓dt4 11 ¯1↑at0 11 ¯1↑zt2 10⌽bt2 ¯10⊖bt4 zt0⍪it2 zt2⍪lt2 st0,lt0 at4≢dt0 xt4≢it0 st0≡it0 lt4≡zt4 11 1↑bq0 11 ¯10↑sq2 (j0 k)⊃xv4 (k k4)⌷zw4 bv2[j0]←bv0 dv2[j0]←dv0 paren10 {⍵[⍋⍵]}iv4

Page 10: October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

Variables (~600)

left (x) or right (y) argument

datatypeb - boolean, 1 bits - short integer, 1 byte i - integer, 2 bytesl - long integer, 4 bytesd - double, 8 bytesz - complex, 2 doubles, 16 bytesf - DECFa - alphanumeric, 1-byte characterx – enclosed (char vectors)

length, usually number of elements, but can be number of rows or number of columns

0 - 1e0, vector or scalar or singleton1 - 1e12 - 1e24 - 1e4

Examples:zn0: complex non-zero scalarl4: 10,000 element long integersp1: 10 element 1-byte ints >1

kind of arrayv - vector (but *v0 is a scalar)t - tall matrix, 11-column matrix with 10*0 1 2 4 rowsw - wide matrix, 11-row matrix with 10*0 1 2 4 columnsq - square matrix with ⌈10*0 1 2 4×0.6 (1 4 16 252) rows/columns

domain (for variables used in scalar functions)n - non-zerop - positive and ~ ∊ 0 1u - unit circle; used in inverse trig functions

specialc k - scalar indices ?6i – int vector of file com nos or native file indicesj0 j1 j2 - index vectors of length 1e0 1e1 1e2k1 k2 k4 - index vectors of length 7 in the range 1e1 1e2 1e4bvc svc ivc lvc ... - 11-element vectors of various types

d2: 100 element doublebt2: 100x11 boolean matrixxw4: 11x10000 matrix of enclosed char vectorsPrimitive

Performance

10

Page 11: October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

Repeatable Timings...

• Use a dedicated machine (real, NOT virtual!)– At Dyalog: 4 cores, 96Gb RAM,

”nothing installed except APL”• Run processes at high or realtime

priority• ”Pre-expand” workspaces using 2000⌶• Control workspace compactions

carefully• Carefully craft the execution loop to

have minimum variable overheadPrimitive Performance

11

Page 12: October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

The Inner Loop (1/2) ∇ r←a TIMEX b;ai;cnt;n;rep;min;e;sum;kt;m

[1] cnt←0 ⍝ We will try 3 times[2] :Repeat[3] kt←GetPrivilegedProcessorTime ⍝ Will be checked at end[4] cnt←cnt+1[5] ai←⎕AI ⍝ Record CPU & Elapsed time[6] {}⎕WA ⍝ Compact workspace[7] pqa_cal_wait ⍝ Check time of calibration expression[8] pqa_redef b ⍝ ensure args are in new pockets[9] :If 0≠rep←pqa_REPS[pqa_I] ⋄ r←⍬ ⋄ min←pqa_TIME[pqa_I;1] ⋄ e←b ⍝ (use reps set in file to be compared with)[10] :Else[11] min←1⌈⌊/r←10 timefx e←b[12] :If ∨/m←pqa_reps_EXPR∧.=(¯1↑⍴pqa_reps_EXPR)↑b[13] rep←pqa_reps_REPS[m⍳1][14] :Else ⋄ rep←1⌈pqa_rep_ticks⌊⌈pqa_rep_ticks÷min ⍝ reps required to get 200 ticks (70 microsec)[15] :EndIf[16] :EndIf

Primitive Performance

12

Page 13: October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

Recorded Data

• The complete distribution of [several thousand] timings for each expression is recorded– The inner loop size for each expression is recorded

and can be used as input to the next recording to create [more] comparable timings

• Deciding what the data ”means” in not easy...

Primitive Performance

13

Page 14: October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

Reporting

• Providing useful reports on such a large quantity of data is a huge challenge.

• Report needs to quickly identify bad (and good) news, without ”false positives”.– A report with many false positives is ”worse than

useless”• Current run-time for data collection is ~13

hours, which makes tool hard to use during development

Primitive Performance

14

Page 15: October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

The Hardest Part (for me)

Primitive Performance

15

Page 16: October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

A More Interesting Report

Primitive Performance

16

Page 17: October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

Primitive Performance

17

V14.0 With 3 Months To Go

Page 18: October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

Planned Work

• Finalize report format and issue official 13.2 report – then 14.0

• Hook reporting tool up to internal (MiServer-based) web server so entire development team can ”drill down” and schedule runs

• Create shorter test and web-based scheduler for ”ad hoc” use by developers needing short turnaround to verify a change

• Bring APLMON categories in line with PQA, so an APLMON profile can be combined with PQA data to predict performance (might work)

• Holy Grail: Hook PQA up to overnight build system, so updates are blocked if a fix causes a degradation (and responsible developer fined for not running the test himself)

Primitive Performance

18

Page 19: October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

Credits

• Most of the real work done by Roger Hui– (and most of the tuning, too!)

• Morten is still working on getting reproducible/stable numbers and reporting

Primitive Performance

19