introduction to application optimizations with usage of intel ® performance tools

Software & Services Group, Developer Products DivisionCopyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 1

Introduction to application optimizations with usage of Intel® performance tools.

Andrei AnufrienkoIntel Compiler Group

Software & Services Group, Developer Products DivisionCopyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

The objectives of this course:Get a basic understanding of : • the main factors of the processor performance,• base performance improvement techniques,• Intel® tools for performance analysis,• main options and components of the Intel compiler,• theoretical foundations of some performance optimizations.


You will be able to:• describe the main problems of the processor performance;• investigate the application using the VTune ™ Performance

Analyzer and find problem areas;• identify the main problems of an application analyzed;• develop a strategy to improve application performance;• describe the main components of the compiler and its functions;• control the level of optimization with command line options.


Course plan• Intel microprocessor architecture and main factors affecting processor

performance;• VTune ™ Performance Analyzer usage;• The role of the compiler in improving application performance;• Some theoretical concepts. Control flow graph, data-flow analysis;• Permutation optimizations and their applicability. Dependencies;• Vectorization;• Parallelization using OMP directives and auto parallelization;• The main components of the compiler, their tasks and interconnection.


Intel microprocessor architecture and the main factors affecting the processor performance.


Simplified processor model

Operative memory (RAM) System bus

Arithmetic logicunit

(ALU)

Control unit

Input-output unit

External memory

Input-outputbus

Processor

Commands

Data

Registers


Simplified processor model Control Unit, CU Arithmetic and Logic Unit, ALU System registers Front Side Bus, FSB Memory Peripheral devices

Control Unit (CU):• decodes instructions received from the memory;

controls ALU;• performs data transfer between the CPU registers, memory, peripheral devices.

ALU consists of different parts, allowing to perform arithmetic and logical operations on the system registers.

System registers - a piece of memory inside the CPU that is used for temporary storage of an information processed by the processor.

A system bus is used for data transfer between the CPU and memory, as well as between the CPU and peripherals.


High performance is one of the key factors in the competition of the computer systems manufacturers

Processor performance is directly related to the amount of computational work that can be processed at a time.

Roughly speaking: Performance = Number of instructions / Time

We'll talk about performance on the basis of IA32 and IA32E architectures (IA32 with EM64T).

Factors affecting the processor performance:CPU clock frequency;Accessible memory amount and speed;The performance of the instructions and completeness of the instruction set;The internal memory registers usage;The quality of pipelining;The quality of prediction;The quality of the prefetching;Superscalarity;The quality of vectorization;Parallelization and multicore.


Clock rateBecause the processor is made of different components, working with different speeds,

there is a processor timer which is providing the synchronization by sending periodic sync. Its frequency is called the clock speed of the processor.

Memory speed and amount8086 - 1 MB of memory.80 286 - A new system registers, and a new mode of memory - 16MB of memory.80 386 - the first 32-bit processor - 4GBTechnology EM64T (Extended Memory 64 Technology) - ~ 264B


The performance of the instructions and completeness of the instruction set

Performance depends on how well the instructions are implemented, how well the basic instruction set covers all possible tasks.

CISC, RISC (complex, reduced instruction set computing)Modern Intel processors are a hybrid of CISC and RISC; before executing a processor

converts CISC instructions into simpler RISC instruction set.


Registers and memorySystem registers have the smallest access time, so the number of available registers

affects the performance of the microprocessor.Register spilling – lack of system registers causes great exchange between registers and

stack of application.

Ia32eTechnology EM64T – added additional system registers.

Now the memory access speed is much lower than the speed of calculations.There are two characteristics describing the properties of memory:• Response time (latency) – the number of processor cycles required to transfer data

from the memory unit. • Bandwidth – number of items can be sent from the processor to memory at one

cycle.

Two possible performance improvement strategies – to reduce response time or pre-fetch the necessary memory.


Reducing the memory access time is achieved via cache system (small amount of memory located on processor).

Memory blocks are preloaded into the cash.If the address is in the cache memory - there is a “hit” and data acquisition is greatly

increased.Otherwise – “cash miss” and additional time is needed. In this case, the block of

memory is read into the cache for one or more cycles of bus, called the filling cache lines. (Size of cash line is 64 bytes.)

There are different kinds of cash:• fully associative cache memory (each block can appear anywhere inside the cache)• direct mapping from memory (each block can be loaded into one place)• various hybrid options (pie memory, the memory of the set-associative access)

– Set-associative access: lest significant bits are used to determine cache line this memory can be loaded to; cash line may contain a few words from main memory, the mapping inside the line is held on an associative basis.

The quality of the memory access is main key to the performance.


Modern computing architectures contains complicated cash hierarchy. Nehalem: i7oL1 - latency 4oL2 - latency 11oL3 - latency 38oOperative memory latency > 100

Proactive memory access mechanism is implemented with a hardware prefetching which based on the history of cash misses. It tries to detect and prefetch independent streams of data.

There is a special set of instructions allows to induce the processor to load the memory specified into cache (software prefetching).


The principle of locality. The quality of the prefetch.Reference locality helps to reuse variables or related data. There is difference between temporal locality – reuse of certain data and resources, and

spatial locality - use of data located in the memory beside. The caching mechanism uses the principle of temporal locality. (Before new cash line is

loaded to cash some cash line should be freed. Cash mechanism selects one which has oldest access time.

Prefetching engine uses the principle of spatial locality. It tries to define the pattern of memory access to pre-load to cache memory which will be need soon. Size of preloaded memory (cash line) is 64 bytes. Thus in case of good spatial locality (data used jointly during calculation is located in the memory beside) less cash lines should be loaded to the cache.

One of known performance problem is “cache aliasing” – bad memory locations of various objects participated in a calculation causes the replacement of useful cache lines by some other needed addresses.

x y z

x y z

Z=sqrt(y2+x2)

One cash line should be loaded

Up to three cash lines should be loaded


Pipeline

Instruction fetch

Registerfetch

Instructiondecode Execution Data

fetchWriteback

instr. 1 - - - - -

instr. 2 instr. 1 - - - -

instr. 3 instr. 1 - - -

instr. 4 instr. 1 - -

instr. 5 instr. 1 -

instr. 1

instr. 2

instr. 2

instr. 2

instr. 2

instr. 2

instr. 3

instr. 3

instr. 3

instr. 3

instr. 4

instr. 4

instr. 4

instr. 5

instr. 5

instr. 6

instr. 6instr. 7

tick

0

1

2

3

4

5

6


The quality of pipelining, instruction level of parallelismPipelining assumes that successive instructions will be processed together

during execution but on different phases of pipeline.Typical instruction execution can be divided into the following steps:

instruction fetch - IF; decoding command / register selection - ID; operation / calculation of effective memory addresses - EX; memory access – MEM; storing the result - WB.

Pipelining improves throughput of the processor, but if the instructions depend on the results of the previous instructions, there will be delays. Thus the benefits of pipelining depends on level of instruction parallelism.


The quality of predictionThe instructions may depend on the data and control logic. (Data dependence and

control flow dependence).The efficiency of pipeline is limited by various conditional branches inside instruction

flow. If there is conditional branch than following instructions aren’t known until the condition isn’t calculated. Should the pipeline be stopped?

Branch predictor is designed to solve this problem.Predictor selects one possible way and continues instructions fetching and processing.

All processed instructions are located in pipeline storage. If predictor assumption was correct all of them are marked as proper, otherwise “branch misprediction” is happened – pipeline storage should be clean and new instructions should be fetched.

There are static and dynamic predictors:• Static predictor uses some simple rules;

– Trivial prediction – the branch will be not executed if the transition is carried forward and will be made if this is a back jump;

• Dynamic predictor collects the statistics on every branch and its choice based on this information.

There is also branch target prediction, which predicts unconditional jumps.


SuperscalaritySuperscalar processor – a processor which is capable to perform multiple operations per one clock cycle. It has several execution units.

The superscalar technique has several identifying characteristics:• Instructions are issued from a sequential instruction stream• There is special device which detects data dependences between instructions at run

time.• The CPU accepts multiple instructions per clock cycle

Modern CPU is always superscalar and pipelined.Each execution unit has own specialization. "Diversity“ of instructions and high level of

instruction parallelism causes best CPU effectiveness.


Simplified processor model

Operative memory(RAM) Prefetching

Arithmetic устройство

(ALU)Control unit

(CU)

Input-output unit

External memory

Input-output bus

Superscalar

Branch prediction

Регистры

Arithmetic logical unit (ALU)

Registers

Cashes


Vector instructions and VectorizationA typical vector instruction performs an elementary operation on two vector sequences

in the memory or vector registers of fixed length C (1: n) = A (1: n) + B (1: n)

Fortran array sections are convenient to notate vector opertaions

Vectorization - the process of converting a scalar calculations, in which an operation is performed on a pair of operands, to the vector representation, in which an operation is performed on a pair of vector operands. Each vector contains several scalar operands.

Pentium III compute system of x86 family introduced SSE (Streaming SIMD Extensions). There were eight 128 bit registers (XMM0-XMM7) and 70 new instructions including working with real numbers.

SSE2, SSE3, SSEE3, SSE4, SSE4.2, AVX - further extensions of SSE.


Look ahead and out-of-order executionModern x86 family microprocessors have advanced processor mechanisms to view the

instruction flow and identify instructions that can be computed in parallel. If there are enough instructions in look-ahead buffer which can be processed together, than processor pipeline will work with maximum effectiveness.

This approach leads to execution with change of the instruction sequence (out-of-order execution).

Implementation of out-of-order mechanisms makes processor architecture more complicated and causes additional energy costs. There are Intel processors without out-of-order support. (Itanium, Atom). In this case instruction scheduling is key factor of good processor performance.


Parallelization and multi-coreMultitasking is a method where multiple tasks, also known as processes, share common

resources of microprocessor.Multithreading computers have hardware support to efficiently execute multiple

threads. Threads are parts of a process and share the same memory. Multithreading allows to divide a calculation into several parts which are processed in parallel. Hyper-threading technology allows to mix instruction sequences of different processes to improve instruction level parallelism.

Pentium 4 - Core i7Cores – microprocessor contains several superscalar pipelines which have own

calculation resources but share system bus, memory and up level cashes.Multiprocessor solutions contains several processors.

Multiprocessor and multi-core systems allow to increase the application performance by creating multiple threads


Main characteristics of the application, affecting its performance

Calculations efficiency,Memory usage effectiveness,Correct branch prediction,Efficient use of vector instructions,The effectiveness of parallelization, Instructional parallelism level.


Performance measuringWhat factors affect the performance of a specific program? Compiler quality Performance of computer system

Consumers need criteria to determine the computer system performance A representative set of typical tasks; Universal testing scheme; Independence from MP manufacturers.Spec.org (Standart Performance Evaluated Corporation) - non-profit organization for training, support and maintenance of a standard set of tests to compare the performance of different computer systems. This organization develops and publishes standard suites for performance measuring.CPU2006 - designed to measure performance. Can be used to compare the programs running on different computer systems. OMP2001 - measures the performance on tests using OpenMP standard for parallel processing with shared memory (shared-memory parallel processing).


Optimizing compiler roleCompiler translates the entire source program into an equivalent program in the

resulting machine code or assembly language.

Does the compiler have any role in the struggle for the performance of the MP? The compiler is used during testing and debugging functionality of the new MP. Performance of new computer system related with new instruction set, increasing

number of registers can be demonstrated only with optimizing compiler which supports these innovations.

The compiler is able to hide the architects misses.


List of literature for deeper study

1. Randy Allen & Ken Kennedy “Optimizing compilers for modern architectures”

2. David F. Bacon, Susan L. Graham and Oliver J.Sharp “Compiler transformations for High-Performance Computing”

3. Aart J.C. Bik “The Software Vectorization Handbook”4. Richard Gerber, Aart J.C. Bik, Kevin B.Smith, Xinmin Tian “The Software

Optimization Cookbook”5. Intel® 64 and IA-32 Intel Architecture Software Developer's Manual 6. Intel® 64 and IA-32 Architectures Optimization Reference Manual7. Agner Fog “Optimizing software in C++: An optimization guide for

Windows, Linux and Mac platforms” http://www.agner.org/optimize/

http://www.agner.org/optimize/


Thank you!

introduction to application optimizations with usage of intel ® performance tools

Documents

intel corporation

intel tools

respective owners

main factors

performance optimizations

main options

main problems

performance analysis