pragmatic optimization in modern programming - modern computer architecture concepts
TRANSCRIPT
1
PRAGMATICOPTIMIZATION
IN MODERN PROGRAMMINGMODERN COMPUTER ARCHITECTURE CONCEPTS
Created by for / 2015-2016Marina (geek) K olpakova UNN
2
COURSE TOPICSOrdering optimization approachesDemystifying a compilerMastering compiler optimizationsModern computer architectures concepts
3
OUTLINEThree aspects of the computer architectureLatency vs Throughput architecturesArchitecture families
CISCRISCVLIWVector
Why is it doing to be load/store?Latest trendssummary
4 . 1
1-ST ASPECT OF COMPUTER ARCHITECTUREInstruction Set Architecture or ISA (interface)
is a contract between HW and SW, which speci�es right, possibilities & limitations.
Class of ISA (load-store, register-memory)Memory addressing modes & rules (base-immediate,alignment requirements)Types & sizes of operands (size of byte, short)Operations (general arithmetic, control, logical)Control �ow instructions (branches, jumps, calls, returns)Encoding an ISA (�xed or variable length)All the conceptual aspects of the architecture
4 . 2
2-ND ASPECT OF COMPUTER ARCHITECTUREMicroarchitecture (organization) is a concrete
implementation of the ISA, the high-level aspects of aprocessor design (memory system, memory interconnect,
design of the processor internals).
Pipeline widthInstruction latenciesIssue wight and schedulingSpeculation capabilitiesAll the concrete aspects of the architecture
4 . 3
3-ND ASPECT OF COMPUTER ARCHITECTUREHardware or chip (design) is the speci�cs of a computer,
including the logic design and packaging. This is a concreteimplementation of the microarchitecture.
Tech-processClock ratesOn die placementAll the properties of the chip
5 . 1
ARCHITECTURE: ARMV8-Aμarch IP Hardware
Cortex-A53 ARM Octa Exynos 7(7580) 1.6GHz28nmHKMG
Cortex-A57 ARM Octa Exynos 7(7420) big.LITTLE 2.1/1.514FF ( LPE) (Samsung)
Cortex-A72 ARM Deca MediaTek Helio X20 big.LITTLE
2.5/2.0/1.4 20nmHKMG (TSMC)
Cortex-A35 ARM -
5 . 2
ARCHITECTURE: ARMV8-Aμarch IP Hardware
Denver NVIDIA Dual Tegra K1 2.3GHz28nmHPM
Kryo Qualcomm Tetra S820 big.LITTLE 2.2/1.6GHz14FF ( LPP) (Samsung)
Exynos M1 Samsung Quad Exynos 8890 big.LITTLE
2.6/2.29GHz 14FF ( LPP)(Samsung)
5 . 3
ARCHITECTURE: ARMV8-Aμarch IP Hardware
Cyclone Apple Dual A7 (APL0698) 1.4GHz28nmHKMG (Samsung)
Typhoon Apple Dual A8 (APL1012) 1.3GHz20nmHKMG (TSMC)
Twister Apple Dual A9 (APL0898) 1.85GHz16nmFF+ (TSMC)
Dual A9 (APL1022) 1.85GHz14nmFF( LPP) (Samsung)
6 . 1
LATENCY VS THROUGHPUT ARCHITECTURESLatency oriented architecture
addresses latency hiding issues;features sophisticated pipelining;out-of-order;employs advanced cache hierarchies;widely uses speculation.Compute cores occupy only a small part of a die.
6 . 2
LATENCY VS THROUGHPUT ARCHITECTURESThroughput oriented architecture
performs a bunch of operations in �y;features many simple compute units/cores;employs simple pipelines and large register �le toprovide a low-cost thread scheduling;uses wide basses, tiling, programmable local memory.Compute cores occupy most part of a die.
7
KEY ARCHITECTURE FAMILIESRISC
Reduced Instruction Set Computer
CISCComplex Instruction Set Computer
VLIWVery Long Instruction Word
Vector architecture
8 . 1
CISCComplex Instruction Set Computer
Designed in the 1970s which was a time where transistorswere expensive while compilers were naive. Additionally,instruction packaging was the main concern due toshortage of memory. The latency of the memory was just abit higher then registers.The goal was to de�ne an instruction set that allows highlevel language constructs be translated into as fewassembly language instructions as possible, improvingperformance as well as code density.Examples are VAX, x86, AMD64.Latency-oriented architecture.
8 . 2
CISCHeritage
instructions access memory, a plenty of addressing modes,many instruction families and a very rich variable lengthISA (alignment counts!),consequently, complicated instruction decoding logic.Moreover, a few registers are available for programmers.
Nowadays1. Instructions are broken down into μcode which are
much easy to pipeline and process power ef�ciently.2. Transistors are spent to cache hierarchies, out-of-order
execution, large RB and speculation to eliminate stalls.3. Symmetric multi-processing.
9 . 1
RISCReduced Instruction Set Computer
Designed in the 1980s which was a time there IPL was thegreat concern. The memory-processor gap already beganto come out.The goal was to decrease the number of clocks perinstruction (CPI) while pipeline instructions as much aspossible employing hardware to help with it. Uniform ISA,pipelining and large register �le is a must-have.Examples are MIPS, ARM, PowerPC.Latency-oriented architecture.
9 . 2
RISCHeritage
Relatively few instructions, all are the same length.Only load and store instructions access memory.Large resister �le than typical CISC processors have.No μcode
Nowadays Most architectures that comes from RISC arecalled Load-Store architectures, while may employ μops.They combine concepts of a classic RISC with usage ofmodern hardware enhancements: 1. deep pipelines, multi-cycle instructions,2. out-of-order execution,3. speculation.
10
THE HARDWARE/SOFTWARE GAPCompiler
analyzes control �ow, analyzes dependencyschedules instructionsmaps variables to limited register set
Hardware
analyzes control �ow, analyzes dependencyschedules instructionsremaps ISA register to large internal register set
11
A WORD TOWARDS REGISTERSIn deed, registers are temporary storage locations inside the
processor that hold data and addresses.
Local variables are not the same as registers in ISA, sincecompiler uses IR internally and does register allocationclose to the end of optimization process.Registers provided by ISA is not the same as actualregisters on the processor. Internal reorder buffers whichhold decoded instruction parameters and intermediateresults are closer to classic de�nition of a register �le.
12 . 1
VLIWVery Long Instruction Word
Designed in the 1980s which was a time there IPL was thegreat concern.The goal was to pipeline instructions as much as possibleemploying software to help with it reducing complexity ofthe hardware and mitigate the the Hardware/Softwaregap. Boost processor clock simplifying work per cycle.Example is Intel HP Itanium.Throughput-oriented architecture
12 . 2
VLIWHeritage
Compiler determines which instructions can beperformed in parallel,bundles this information and the instructions,and passes the bundle(word) to the hardware.No data dependencies between instructions in a word.Each operation in a word assigned to speci�c issue slot(dedicated FU).
12 . 3
VLIWNowadays
hardly any generic processor implements VLIWbrunchy nature of production codes (in contrast toHPC or scienti�c codes),need to follow binary compatibility across theμarchitecture families.
Whereas architecture is widely adopted forprogrammable co-processors where shrink in powerconsumption without lose of performance is crucial(DSP, GPU).
13 . 1
VECTOR PROCESSORSFirst introduced in 1976 and dominated for HPC in the1980s because of high instruction throughput.The goal was to perform operations on vectors of dataexposing data level parallelism (DLP) to increaseinstruction throughput. Vector pipelining is also calledchaining.Example is CrayThroughput-oriented architectures.
13 . 2
VECTOR PROCESSORSHeritage
Process the data in vectors, each element in a vector(lane) is independent on any other.Deep pipelines, wide execution units, not necessary thesame width (batch length) as size of vector in elements(vector length).Most ef�cient for simple memory patterns, butgetter/scatter is usually possible too.Wide memory interfaces to saturate execution units.Large vector register �le, cache is not a strictrequirement and absent for classical vector processors.
13 . 3
VECTOR PROCESSORSNowadays
They aren't used in generic processors design, but usedas a co-processors for a speci�c workloads: HPC,multimedia.Precursors of most designs of modern GPUs.Vector pipes with short vector length (8-16 bytes) calledSIMD units are widely integrated in modern generalpurpose processors to accelerate most demandingloops.
14 . 1
WHY IS IT DOING TO BE RISC LOAD/STORE?1. Simple �xed-width instructions & few addressing modes
Cache-ef�cient instruction fetch, branches are aligned.Simple hardware logic → power ef�cient chips.Drive a higher clock rate.
2. Concise ISA with orthogonal functionalityComplex instructions are ignored by compilers due tosemantic gap → simple instructions simplify scheduling.Complex addressing lead either to variable lengthinstructions or big instruction size → inef�cientdecoding and scheduling as well as alignment issues.
3. Large register setExpose possible instruction parallelism to the compiler.
15
LATEST TRENDSArchitecture is seen as Load-store RISC-inheritedInternally instructions are broken down into single-pipeμopsμops are reordered and optionally organized into wordsμops or words are scheduled for execution, caching in thehighest level is usually performed on this preprocessedview.Latest generations of Intel processors, NVIDIA Denverarchitecture and 64-bit ARM Cortex-A processors alreadyemploy this approach.
16 . 1
SUMMARYThere are three key aspects of computer architecture:Instruction Set Architecture, μarchitecture and design.Some architectures aim to hide latency while others aim tomaximize instruction throughput.CISC is created for compact code size and exact instructionencoding and used only on ISA level nowadays.RISC leads to less complicated decoding and pipeline stagesallow boosting clock in affordable power budget.VLIW targets power ef�cient high performance devices forspeci�c tasks or used internally on μarchitecture level.Vector processors transformed into SIMD-extensions andSIMT-like GPU designs.
16 . 2
SUMMARYLoads-Store architectres with its simple �xed-widthinstructions, few addressing modes, concise ISA and optimal size register size is a winner solution.Architecture can expose different properties for it's different levels (ISA, μarchitecture).