performance of mathematical software agner fog technical university of denmark

21
Performance of mathematical software Agner Fog Technical University of Denmark www.agner.org

Upload: sara-wiggins

Post on 03-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance of mathematical software Agner Fog Technical University of Denmark

Performance of mathematical software

Agner FogTechnical University of Denmark

www.agner.org

Page 2: Performance of mathematical software Agner Fog Technical University of Denmark

• Intel and AMD microarchitecture• Performance bottlenecks• Parallelization• C++ vector classes, example• Instruction set dispatching• Performance measuring• Hands-on examples

Agenda

Page 3: Performance of mathematical software Agner Fog Technical University of Denmark

AMD Microarchitecture

• 2 threads per CPU unit

• 4 instructions per clock

• Out-of-order execution

Page 4: Performance of mathematical software Agner Fog Technical University of Denmark

Intel Micro-architecture

Page 5: Performance of mathematical software Agner Fog Technical University of Denmark

Typical bottlenecks• Installation of program• Program start, load framework and

libraries• System database• Network• File input / output• Graphics• RAM access, cache utilization• Algorithm• Dependency chains• CPU pipeline• CPU execution units

Sp

eed

Page 6: Performance of mathematical software Agner Fog Technical University of Denmark

Efficient cache use

• Store data in contiguous blocks• Avoid advanced data containers

such as resizable data structures, linked lists, etc.

• Store local data inside the function they are used in

Page 7: Performance of mathematical software Agner Fog Technical University of Denmark

Branch prediction

Loop

Branch

BA

Page 8: Performance of mathematical software Agner Fog Technical University of Denmark

Out-Of-Order Execution

x = a / b;y = c * d;z = x + y;

Page 9: Performance of mathematical software Agner Fog Technical University of Denmark

Dependency chains

x = a + b + c + d;

x = (a + b) + (c + d);

Page 10: Performance of mathematical software Agner Fog Technical University of Denmark

Loop-carrieddependency chain

for (i = 0; i < n; i++) { sum += x[i]; }

for (i = 0; i < n; i += 2) { sum1 += x[i]; sum2 += x[i+1]; }sum = sum1 + sum2;

Page 11: Performance of mathematical software Agner Fog Technical University of Denmark

Levels of parallelization

• Multiple threads running in different CPU units

• Out-of-order execution• SIMD instructions

Page 12: Performance of mathematical software Agner Fog Technical University of Denmark

SIMD programming methods

• Assembly language• Intrinsic functions• Vector classes• Automatic vectorization by

compiler

Page 13: Performance of mathematical software Agner Fog Technical University of Denmark
Page 14: Performance of mathematical software Agner Fog Technical University of Denmark

Obstacles to automatic vectorization

• Pointer aliasing• Array alignment• Array size• Algebraic reduction• Branches• Table lookup

Page 15: Performance of mathematical software Agner Fog Technical University of Denmark

Vector math libraries

Short vectors

• Intel SVML• AMD libm• Agner’s VCL

Long vectors

• Intel VML• Intel IPP• Yeppp

Page 16: Performance of mathematical software Agner Fog Technical University of Denmark

ExampleExponential function

in vector classes

Page 17: Performance of mathematical software Agner Fog Technical University of Denmark

Instruction set dispatching

SSE

SSE2

SSE3

Generic x86

SSE code

SSE2 code

AVX2 code

N

J

N

N

J

J

GenuineIntel

N

J

Future extensions

Page 18: Performance of mathematical software Agner Fog Technical University of Denmark

Performance measurement

• Profiler

• Insert measuring instruments into code

• Performance monitor counters

Page 19: Performance of mathematical software Agner Fog Technical University of Denmark

When timing results are unstable

• Stop other running programs• Warm up CPU with heavy work• Disable power saving features in

BIOS setup• Disable hyperthreading• Use “core clock cycles” counter

(Intel only)

Page 20: Performance of mathematical software Agner Fog Technical University of Denmark

ExampleMeasuring latency and throughput of

machine instruction

Page 21: Performance of mathematical software Agner Fog Technical University of Denmark

Hands-on example

Compare performance of different mathematical function libraries

http://cern.agner.org