1 computer architectures m core. 2 cmp chip multi processor in this context i/o indicates any...
TRANSCRIPT
1
Computer Architectures M
Core
2
CMP Chip Multi Processor
In this context I/O indicates any communication with the external world (real I/O, memory, external caches). Shared cache indicates L2 or L3. Very often L2 too is integrated in the processor
3
Advantages
• Minimum latency time for data transfer• No bus use for interprocessor communication• Possible dynamic cache allocation between the
processors
Disadvantages
• Complexity. The controller must evaluate in real time the needs of the two CPUs and an error can block one of the CPUs.
• The cache bandwith must be much higher for serving to CPUs
• If the cache access is multiport further increased complexity. If queue only then reduced efficiency
• This design does not cater for scaling (the cache cannot be divided)
4
Advantages
•Reduced handling complexity•Easy scaling (i.e. one CPU ony)•No Bus involvement
Disadvantages
• No dynamic balancing• Accurate I/O controller design• Performance reduced because of the traffic
between the two CPUs (which affect sthe I/O too)
5
Advantages
• It is a dual CPU and therefore easier design• Easy scaling with one CPU only• Reduced test complexity (one CPU at a time can
be tested)• Shorter «time to market»
Disadvantages
• Very important: CPU communication affects the bus
• Double electrical impact on the bus capacitors=>slower behaviour
Shared package
7
Enhanced SpeedStep Technology
• It allows to reduce the operating voltage while reducing the clock frequency
Voltage Clock
1.484 V 1.6 GHz
1.42 V 1.4 GHz
1.276 V 1.2 GHz
1.164 V 1 GHz
1.036 V 800 MHz
0.956 V 600 MHz
Pentium 1,6 GHZ
• There are different power lines for the functional units which can be selectively switched off
10
64 bit extensions
• MMX registers 64 bit
AH AL
EAXRAX AX
63 32 31 16 15 8 7 0
• It allows the execution of 64 bit OS and programs
• Addressing space up to 64 exabytes (2**64 bytes – 2**32 x 2**32 bytes - 4Gx4GB).
• 8 additional 64 bit registers/ accumulators R08-R16
• All other accumulators 64 bits
11
Core
• While PIV tried to increase the efficiency by increasing the clock the Core relies on multiprocessing thanks to the transistor reduced size. This allows also an increase of the cache size.
• 2005-2007
• New architecture named Core for multicore
• Low power consumption
• 14 stages pipeline
• Developed in Israel
• Multicore with Out Of Order
12
COREPipeline
0
COREPipeline
1
L1 – Core 0 L1 – Core 1
L2 - shared
FSB interfaceThere were many different Core versions 1) Merom (mobile – low power)2) Conroe (the first implemented - desktop)3) Bloomfled (end 2008 – server - quadcore)
Core
NB: in this figure the prefetcher includes the L1 cache
13
Core
14
• The two cores L1 can exchange information directly without using the bus
Core• 1 + 3 decoders – 7 u-op, greater ROB increased number of EUs
• 2.66 GHz
• Smart power reduction. For instance not only the unused EU are powered down but also the internal busses path are activated only when necessary for each instruction
• For each core two L1 caches L1 (Data and instructions): Instructions => 32 (or 64) KB 8 Ways Data=>32 (or 64) KB 2/8 Way – No trace cache (Inefficient!)
• L2 Cache: 2-4 MB unified
15
• Core has no multithread which is resumed in the following processors generation
Core Microarchitecture
• Dual core superscalar 4, 36 bit physical addresses
• L2 shared inclusive – unified D e I. Each cores uses the portion which it needs. If the two cores use the same instructions they can be shared
16Independent L2 Caches Shared L2 Cache
Advanced Smart Cache
Shared L2 advantages:• None of the previous disadvantages
NON shared L2 has the following disadvantages:• possible replication of the same data in the two caches• snoop through the FSB• static partitioning of the silicon
17
• The prefetch algorithm (secret) considers the access sequences in order to predict the next requests and to anticipate them
Core Microarchitecture
• Intelligent prefetcher: 2x16=32 bytes buffer (as in P6). The system tries to guess the required data. For instance when the data at address 1-3-5 are requested then the system reads in advance the data at address 7 (if the bus is available)
• More precisely each Core has 2x3+2=8 prefetchers (two for the data and one for the instructions plus two prefetchers for the shared L2 ).The prefetch policies are different in different models according to the use (mobile, server, desktop)
18
Loop detector
– … But requires the decoding each cycle
• Exploits the hardware loop detectionI The loop detector analyzes the branches and determines whether it is a loop
– Avoids the repetitive fetch and branch prediction
DecodeBranchPrediction Fetch
LoopStream
Detector
18 instructions
19
Core Pipeline
• It must be noted that the same «in flight» instructions number is further increased by the fusion (see later). The Core window is therefore greater than the increase of the RS and ROB could induce to think
• The pipeline is 14 stages (P6 12 stages). The additional two stages inserted for delays and fusion handling
• ROB stores 96 u-ops (Xeon 126 because of the multithread)
• Unified RS handling (memory/non memory – no difference between the FUs) with increased entries for better FUs exploitation
20
Core Architecture
6 Ports
Data restructured for the vector ALUs. Operations on 128 bit data split into two 64 bit operations
21
6 ports
Higher efficiency ALUs
Core Microarchitecture 4 + 3 u-ops
7 u-ops/clock
22
Macrofusion
The sequence
load EAX, [mem1] cmp EAX, [mem2] jne Target
becomes
load EAX, [mem1] cmp EAX, [mem2] + jne Target (test and branch)
• In the main decoder couples of machine instructions can be fused (typically compare and test instructions are fused with the branch instructions) The only limit is that only one “macrofused” instruction per cycle can be generated
• This requires a higher complexity decoder, ALU and Branch EU more complex but grants a reduced number of «in flight» u-ops, faster ROB and RS emptying and an apparent higher efficiency of the ALUs. This means a lower power consumption fror the same program.
23
Microfusion
• Two distinct u-ops allocated in the same bit string
• When a microfused instruction reaches the RS they are separately sent to the respective FUs either in parallel or serially when they require the same FU (i.e. the case of LOAD e STORE)
• STORE operations are normally subdivided into two u-ops: one for the data and one for the address (two separata FUs) The data are sent to the store buffer while the addrtess is calculated: when ready it is retired by the store buffer.
• The same applies to the LOAD or READ-MODIFY : in this case the two operations are serially executed.
• The number of the u-ops is reduced in average by 10%. The efficiency increase is 5% for integer operations and 10% for FP opertions.
24
• Combining Macrofusion and Microfusion an average 10% u-ops reduction is achieved. Higher use of the FU. Higher paarallelism and higher number of u-ops among which to chose the OOO sequence.
Core Front End
• Predecode and fusion stage: it detects the instructions length and the relative boundaries
• The trace-cache is not more present because of its statistically poor performance. 4 decoders (one complex and three simple – 7 uops/clock – one more than P6)
25
Front End
P6Core
One more simple decoder: 7 microops /cycleThe simple decoders are able to decode a larger number of instructions: almost one u-op per instruction achieved
26
Dispatching architecture
27
Core
• Many u-ops require multiple clock for the execution but this doesn’t block the ports. For instance port 1, once a FADD is started is free for the IEU
Core 6 ports
• One more dispatch port is dedicated the logical and arithmetical u-ops
• Increased integer units number
• Up to 3 u-ops (ports 0, 1 e 2) can be executed per clock (not counting the Branch Execution Unit and the Memory Address Unit – ports3, 4 and 5 – which don’t produce results).
• The system is not simmetrical: FP multiplications can be executed only in a FPU and the same holds for the FADD
28
Mathematical EUs
Floating point execution units
Two units able to execute scalar and FP u-ops. One unit for simple operations (i.e. FADD)
• Three integer EUs each one able to execute a 64 bit u-op per clock. One is for complex u-ops (CIU Complex Integer Unit) and two for simple u-ops (SIU) like additions. All of them operate in parallel with the branch execution unit
Integer execution units
29
Memory acces instructions
• Load and Store – when committed - are moved from the ROB to a FIFO called MOB (Memory Reorder Buffer) which in some cases allows «overtakings» of the Loads
• Load and Store are much more complex than – for instance – the addition. First of all because they require the access to the RF (for the address computation) and because they must access the data cache. L1 access is much slower than the acces to the renamed registers and there is always the risk of a L2 access
30
Memory “disambiguation”
• u-ops commitments (and therefore memory and registers updating and reading ) must be necessarily executed “in order” But…
Memory aliasing
• Statistically B case is 97% (it depende on the compiler too!) but in P6 and PIV because of A cases (3%) no Load can be executed before the Store. Big performance loss
• There are two cases: the “Store” uses the same address of the “Load” (case A) or not (case B)
• In case A the “Store” must precede the “Load”, in case B not
• Case A is the “Memory aliasing”
31
If the processor detects that we are in case B we must not wait the memory update related to the Store and we can overlap the operations as in figure: one clock cycle was spared (more than 16%)
Clock
123456
Memory “disambiguation”
In case A the address is computed at clock 1 and the store is executed in clock 2. Another cycle must elapse for the memory update (clock 3) and then the load can be executed which requires cycles 4 and 5 for the register update. (It must be remembered that a Load in any case «occupies» the memory location which cannot be at the same time used by a Store). Eventually on te 6h clock cycle the sum can be executed.
32
In case of Core we have the B-2 situation where load is anticipated before the store sparing 3 cycles in comparison with A and two cycles in comparison with B. This is possible thanks to an algorithm which analyses the u-ops and predicts the memory aliasing. In this case too the prediction can be wrong and the pipeline must be flushed but the percent advantage is very significant
-1 -2
123456
Memory “disambiguation”