instruction scheduling6.035.scripts.mit.edu/fa18/slides/f18-lecture-30.pdfscheduling loops •loop...
TRANSCRIPT
![Page 1: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/1.jpg)
Loop Optimizations
Instruction Scheduling
![Page 2: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/2.jpg)
Outline• Scheduling for loops• Loop unrolling• Software pipelining• Interaction with register allocation• Hardware vs. Compiler• Induction Variable Recognition• loop invariant code motion
5
![Page 3: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/3.jpg)
Scheduling Loops
• Loop bodies are small• But, lot of time is spend in loops due to large
number of iterations• Need better ways to schedule loops
![Page 4: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/4.jpg)
Loop Example• Machine
– One load/store unit • load 2 cycles• store 2 cycles
– Two arithmetic units• add 2 cycles• branch 2 cycles• multiply 3 cycles
– Both units are pipelined (initiate one op each cycle)• Source Code
for i = 1 to NA[i] = A[i] * b
![Page 5: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/5.jpg)
Loop Example
• Source Codefor i = 1 to N
A[i] = A[i] * b
• Assembly Codeloop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxjz loop
base
offset
![Page 6: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/6.jpg)
Loop Example• Assembly Code
loop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxjz loop
imul
jz
mov
sub
2
2
3
0
mov
d=2
d=0
d=7
d=2
d=5
![Page 7: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/7.jpg)
Loop Example• Assembly Code
loop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxjz loop
• Schedule (9 cycles per iteration)mov mov
mov mov imul bge
imul bge imul
sub sub
imul
jz
mov
sub
2
2
3
0
mov
d=2
d=0
d=7
d=2
d=5
![Page 8: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/8.jpg)
Outline• Scheduling for loops• Loop unrolling• Software pipelining• Interaction with register allocation• Hardware vs. Compiler• Induction Variable Recognition• loop invariant code motion
5
![Page 9: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/9.jpg)
Loop Unrolling
• Unroll the loop body few times• Pros:
– Create a much larger basic block for the body– Eliminate few loop bounds checks
• Cons:– Much larger program– Setup code (# of iterations < unroll factor)– beginning and end of the schedule can still have
unused slots
![Page 10: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/10.jpg)
Loop Exampleloop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxjz loop
![Page 11: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/11.jpg)
Loop Exampleloop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxmov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxjz loop
![Page 12: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/12.jpg)
Loop Exampleloop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxmov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxjz loop
• Schedule (8 cycles per iteration)mov mov mov mov
mov mov mov mov imul imul bge
imul imul bgeimul imul
sub sub sub sub
mul
mov
mov
sub
2
mov
d=2
d=0
d=7
d=2
d=5mul
jz
mov
sub
2
3
0
d=9
d=14
d=9
d=12
2
3
0
2
![Page 13: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/13.jpg)
Loop Unrolling
• Rename registers– Use different registers in different iterations
![Page 14: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/14.jpg)
Loop Exampleloop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxmov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxjz loop
mul
mov
mov
sub
2
mov
d=2
d=0
d=7
d=2
d=5mul
jz
mov
sub
2
3
0
d=9
d=14
d=9
d=12
2
3
0
2
![Page 15: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/15.jpg)
Loop Exampleloop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxmov (%rdi,%rax), %rcximul %r11, %rcxmov %rcx, (%rdi,%rax)sub $4, %raxjz loop
mul
mov
mov
sub
2
mov
d=2
d=0
d=7
d=2
d=5mul
jz
mov
sub
2
3
0
d=9
d=14
d=9
d=12
2
3
0
2
![Page 16: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/16.jpg)
Loop Unrolling
• Rename registers– Use different registers in different iterations
• Eliminate unnecessary dependencies – again, use more registers to eliminate true, anti and
output dependencies– eliminate dependent-chains of calculations when
possible
![Page 17: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/17.jpg)
Loop Exampleloop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxmov (%rdi,%rax), %rcximul %r11, %rcxmov %rcx, (%rdi,%rax) sub $4, %raxjz loop
2
d=2
d=0
d=7
d=2
d=52
3
0
d=9
d=14
d=9
d=12
2
3
0
2
mul
mov
mov
sub
mov
mul
jz
mov
sub
![Page 18: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/18.jpg)
Loop Exampleloop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $8, %raxmov (%rdi,%rbx), %rcximul %r11, %rcxmov %rcx, (%rdi,%rbx) sub $8, %rbxjz loop
2
d=2
d=0
d=7
d=2
d=52
3
0
d=0
d=5
d=0
d=3
2
3
0
2
mul
mov
mov
sub
mov
mul
jz
mov
sub
![Page 19: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/19.jpg)
Loop Exampleloop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $8, %raxmov (%rdi,%rbx), %rcximul %r11, %rcxmov %rcx, (%rdi,%rbx) sub $8, %rbxjz loop
• Schedule (4.5 cycles per iterationmov mov mov mov
mov mov mov mov imul imul jz
imul imul jz imul imul
sub subsub sub
2
d=2
d=0
d=7
d=2
d=52
3
0
d=0
d=5
d=0
d=3
2
3
0
2
mul
mov
mov
sub
mov
mul
jz
mov
sub
![Page 20: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/20.jpg)
Outline• Scheduling for loops• Loop unrolling• Software pipelining• Interaction with register allocation• Hardware vs. Compiler• loop invariant code motion• Induction Variable Recognition
5
![Page 21: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/21.jpg)
Software Pipelining
• Try to overlap multiple iterations so that the slots will be filled
• Find the steady-state window so that:– all the instructions of the loop body is executed – but from different iterations
![Page 22: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/22.jpg)
Loop Example• Assembly Code
loop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxjz loop
• Schedulemov mov
mov movmul jz
mul jzmul
subsub
![Page 23: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/23.jpg)
Loop Example• Assembly Code
loop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxjz loop
• Schedulemov mov1 mov mov1
mov mov1 mov mov1mul mul1 jz jz1
mul mul1 jz jz1mul mul1
sub sub1sub sub1
![Page 24: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/24.jpg)
Loop Example• Assembly Code
loop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxjz loop
• Schedulemov mov1 mov2 mov mov1 mov2
mov mov1 mov2 mov mov1 mov2mul mul1 mul2 jz jz1 jz2
mul mul1 mul2 jz jz1 jz2mul mul1 mul2
sub sub1 sub2sub sub1 sub2
![Page 25: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/25.jpg)
Loop Example• Assembly Code
loop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxjz loop
• Schedulemov mov1 mov2 mov mov3 mov1 mov2 mov3
mov mov1 mov2 mov mov3 mov1 mov2 mov3mul mul1 mul2 jz mul3 jz1 jz2
mul mul1 mul2 jz mul3 jz1 jz2mul mul1 mul2 mul3
sub sub1 sub2 sub3sub sub1 sub2 sub3
![Page 26: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/26.jpg)
Loop Example• Assembly Code
loop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxjz loop
• Schedulemov mov1 mov2 mov mov3 mov1 mov4 mov2 mov3
mov mov1 mov2 mov mov3 mov1 mov4 mov2 mov3mul mul1 mul2 jz mul3 jz1 mul4 jz2
mul mul1 mul2 jz mul3 jz1 mul4 jz2mul mul1 mul2 mul3 mul4
sub sub1 sub2 sub3sub sub1 sub2 sub3
![Page 27: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/27.jpg)
Loop Example• Assembly Code
loop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxjz loop
• Schedulemov mov1 mov2 mov mov3 mov1 mov4 mov2 mov5 mov3
mov mov1 mov2 mov mov3 mov1 mov4 mov2 ld5 mov3mul mul1 mul2 jz mul3 jz1 mul4 jz2 mul5
mul mul1 mul2 jz mul3 jz1 mul4 jz2mul mul1 mul2 mul3 mul4
sub sub1 sub2 sub3sub sub1 sub2 sub3
![Page 28: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/28.jpg)
Loop Example• Assembly Code
loop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxjz loop
• Schedulemov mov1 mov2 mov mov3 mov1 mov4 mov2 mov5 mov3 mov6
mov mov1 mov2 mov mov3 mov1 mov4 mov2 ld5 mov3mul mul1 mul2 jz mul3 jz1 mul4 jz2 mul5
mul mul1 mul2 jz mul3 jz1 mul4 jz2mul mul1 mul2 mul3 mul4
sub sub1 sub2 sub3sub sub1 sub2 sub3
![Page 29: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/29.jpg)
mov mov1 mov2 mov mov3 mov1 mov4 mov2 mov5 mov3 mov6mov mov1 mov2 mov mov3 mov1 mov4 mov2 ld5 mov3
mul mul1 mul2 jz mul3 jz1 mul4 jz2 mul5mul mul1 mul2 jz mul3 jz1 mul4 jz2
mul mul1 mul2 mul3 mul4sub sub1 sub2 sub3
sub sub1 sub2 sub3
• Assembly Codeloop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxjz loop
• Schedule
Loop Example
![Page 30: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/30.jpg)
mov mov1 mov2 mov mov3 mov1 mov4 mov2 mov5 mov3 mov6mov mov1 mov2 mov mov3 mov1 mov4 mov2 ld5 mov3
mul mul1 mul2 jz mul3 jz1 mul4 jz2 mul5mul mul1 mul2 jz mul3 jz1 mul4 jz2
mul mul1 mul2 mul3 mul4sub sub1 sub2 sub3
sub sub1 sub2 sub3
Loop Example• Assembly Code
loop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxjz loop
• Schedule (2 cycles per iteration)
![Page 31: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/31.jpg)
Loop Example• 4 iterations are overlapped
– value of %r11 don’t change
– 4 regs for (%rdi,%rax)– each addr. incremented by 4*4
– 4 regs to keep value %r10
– Same registers can be reused after 4 of these blocksgenerate code for 4 blocks, otherwise need to move
mov4 mov2mov1 mov4mul3 jz1jz mul3mul2
sub2sub1
loop:mov (%rdi,%rax), %r10imul %r11, %r10mov %r10, (%rdi,%rax)sub $4, %raxjz loop
![Page 32: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/32.jpg)
Software Pipelining• Optimal use of resources• Need a lot of registers
– Values in multiple iterations need to be kept• Issues in dependencies
– Executing a store instruction in an iteration before branch instruction is executed for a previous iteration (writing when it should not have)
– Loads and stores are issued out-of-order (need to figure-out dependencies before doing this)
• Code generation issues– Generate pre-amble and post-amble code– Multiple blocks so no register copy is needed
![Page 33: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/33.jpg)
Outline• Scheduling for loops• Loop unrolling• Software pipelining• Interaction with register allocation• Hardware vs. Compiler• Induction Variable Recognition• loop invariant code motion
5
![Page 34: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/34.jpg)
Register Allocation and Instruction Scheduling
• If register allocation is before instruction scheduling– restricts the choices for scheduling
![Page 35: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/35.jpg)
Example1: mov 4(%rbp), %rax2: add %rax, %rbx3: mov 8(%rbp), %rax 4: add %rax, %rcx
![Page 36: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/36.jpg)
Example1: mov 4(%rbp), %rax2: add %rax, %rbx3: mov 8(%rbp), %rax 4: add %rax, %rcx
1
4
2
3
3
1
3
1 1
1
![Page 37: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/37.jpg)
Example1: mov 4(%rbp), %rax2: add %rax, %rbx3: mov 8(%rbp), %rax 4: add %rax, %rcx
1
4
2
3
3
1
3
1 1
1
2 4ALUop
1 3MEM 1
1 3MEM 2
![Page 38: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/38.jpg)
Example1: mov 4(%rbp), %rax2: add %rax, %rbx3: mov 8(%rbp), %rax4: add %rax, %rcx
Anti-dependenceHow about a different register?
1
4
2
3
3
1
3
1 1
1
![Page 39: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/39.jpg)
Example1: mov 4(%rbp), %rax2: add %rax, %rbx3: mov 8(%rbp), %r104: add %r10, %rcx
Anti-dependenceHow about a different register?
1
4
2
3
3
3
![Page 40: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/40.jpg)
Example1: mov 4(%rbp), %rax2: add %rax, %rbx3: mov 8(%rbp), %r10 4: add %r10, %rcx
1
4
2
3
3
3
2 4ALUop
1 3MEM 1
1 3MEM 2
![Page 41: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/41.jpg)
Register Allocation and Instruction Scheduling
• If register allocation is before instruction scheduling– restricts the choices for scheduling
![Page 42: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/42.jpg)
Register Allocation and Instruction Scheduling
• If register allocation is before instruction scheduling– restricts the choices for scheduling
• If instruction scheduling before register allocation– Register allocation may spill registers– Will change the carefully done schedule!!!
![Page 43: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/43.jpg)
Outline• Scheduling for loops• Loop unrolling• Software pipelining• Interaction with register allocation• Hardware vs. Compiler• Induction Variable Recognition• loop invariant code motion
5
![Page 44: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/44.jpg)
Superscalar: Where have all the transistors gone?
• Out of order execution– If an instruction stalls, go beyond that and start
executing non-dependent instructions– Pros:
• Hardware scheduling• Tolerates unpredictable latencies
– Cons:• Instruction window is small
![Page 45: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/45.jpg)
Superscalar: Where have all the transistors gone?
• Register renaming– If there is an anti or output dependency of a register
that stalls the pipeline, use a different hardware register
– Pros:• Avoids anti and output dependencies
– Cons:• Cannot do more complex transformations to eliminate
dependencies
![Page 46: Instruction Scheduling6.035.scripts.mit.edu/fa18/slides/F18-lecture-30.pdfScheduling Loops •Loop bodies are small •But, lot of time is spend in loops due to large number of iterations](https://reader033.vdocument.in/reader033/viewer/2022060802/608610b6e793480d6661c855/html5/thumbnails/46.jpg)
Hardware vs. Compiler• In a superscalar, hardware and compiler scheduling
can work hand-in-hand• Hardware can reduce the burden when not predictable
by the compiler• Compiler can still greatly enhance the performance
– Large instruction window for scheduling– Many program transformations that increase parallelism
• Compiler is even more critical when no hardware support– VLIW machines (Itanium, DSPs)