a lightweight instruction scheduling algorithm for just in time compiler
DESCRIPTION
In this paper, we present a lightweight algorithm of instruction scheduling to reduce the pipeline stalls on XScale.TRANSCRIPT
1
LOGO
Presenter : Shuai-wei HuangDate : 2007/11/21
A Lightweight Instruction Scheduling Algorithm for Just-In-Time Compiler on
XScale
Xiaohua Shi Peng Guo
Programming System Lab Microprocessor Technology Labs {xiaohua.shi,peng.guo }@intel.com
2
Contents
Introduction1
XSCALE CORE PIPELINES 2
LIS Algorithm 3
Performance Evaluation4
Conclusions5
3
Introduction
For a J2ME JIT, the algorithms face the challenges: small memory budget compilation time constraint.
In this paper, we present a lightweight algorithm of instruction scheduling to reduce the pipeline stalls on XScale.
It only uses a small piece of memory space with constant size, about 1k bytes for one thread.
4
IntroductionRelated work
Author Publish year Date structure DAG constructcomplexity
Gibbons & Muchnick(List Scheduling)
1986 DAG n^2
Goodman & Hsu(IPS)
1988 DAG n^2
Kurlander,Proebsting,and Fischer(DLS)
1995 DAG n^2
List scheduling is widely adopted in compilers. In practice, the time complexity of this algorithm could be close to linear, and O(n^2) to the code length in the worst cases.
5
IntroductionDAG Example
LIS not based on Directed Acyclic Graphs (DAGs) or expression trees, but a novel data structure namely extended dependency matrix (EDM).
6
Introduction XORP JIT
XORP (Xscale Open Runtime Platform) is Intel’s J2ME JVM for both CDC and CLDC configurations on XScale.
Most optimizations in the JIT compiler have linear time complexity or almost linear complexity in practice, with constrained memory budget.
The instruction scheduling module is the last optimization before JIT emitting the result code.
XORP JIT does not pay the price for a global scheduling mechanism with much higher complexity.
7
XSCALE CORE PIPELINESSuperpipeline
The XScaleTM core consists of a main execution pipeline, a multiply/accumulate (MAC) pipeline, and a memory access pipeline.
8
XSCALE CORE PIPELINESOut-of-order
Instructions in different pipelines may be completed out of order, if no data dependencies exist.
I0: ldr R1, [R0] I1: add R2, R2, R3 I2: add R4, R1, R2
Instruction I1 could be completed before I0, because they will be processed in different pipelines.
For instruction I2, it depends on the results of both I0 and I1, and will wait for the completion of all the previous ones.
9
XSCALE CORE PIPELINESResource conflicts
Multiply instructions could cause pipeline stalls due to either result latencies, or resource conflicts, which mean that no more than two instructions can occupy the MAC pipeline concurrently.
For instance, the following two instructions without data dependencies will incur a stall of 0~3 cycles, depending on the actual execution cycles of I0, due to resource conflicts:
I0: mul R0,R4,R5
I1: mul R1,R6,R7
10
XSCALE CORE PIPELINESload-use instruction
In many Java applications, the typical pipeline stalls come from load-use instructions.
Pipeline stalls will happen before instruction I1 and I2 respectively.
...; prepare outgoing arguments I0: ldr R12, [R0] ;
//get vtable from object handle I1: ldr R12, [R12 + offset] ;
//vtable + offset equals to the address of the method entry I2: blx R12 ;
//indirect branch to the method entry
11
LIS AlgorithmEDM
DM -1 : no dependency 0 : dependency, but no stall Positive integer : pipeline stall cycles
Cyl estimated execution cycles
from I0 to others. Stl
pipeline stall cycles before issuing one instruction.
Ceil the instruction causing the stall
with smallest index UP,DWN
record the boundaries where the instructions could be safely moved to, without breaking the data dependencies.
I0: add R0,Rr5,Rr6 I1: sub R1, R7,Rr8 I2: ldr R2, [R4, 0x4] I3: add R3, R2, R1
12
LIS AlgorithmRe-ordered native instructions
According to the values in the column “Stl”, it is easy to know there is a 2-cycle stall before instruction I3.
The “DWN’ values of the first two instructions I0 and I1 are equal to or larger than “3”.
Both I0 and I1 could be safely moved before I3, to overlap the pipeline stalls between I2 and I3.
I2: ldr R2, [R4, 0x4] I0: add R0,Rr5,Rr6 I1: sub R1, R7,Rr8 I3: add R3, R2, R1
13
LIS Algorithmstall enclosure
The motivation of LIS algorithm is, to look for instructions that could be moved before the stalled ones.
Stl_Enn is a “stall enclosure”, which includes all instructions from index Ceiln to n.
LIS avoids to move instructions outside a Stl_En, but moves instructions inside it.
14
LIS Algorithm
for (every Stln >0 ){ t = Stln ;
for(m = Ceiln –1 ; m>=0 && t >0; m--){ if( Im belongs to other Stl_En) break ; if( Im has been moved before) continue ; if( DWNm > Ceiln ){
move Im after Ceiln; t = t – issue-latency(Im);
}/*if*/ }/*for*/
for(m = n +1 ; m<=last instruction && t>0; m++){ if( Im belongs to other Stl_En) break ; if( Im has been moved before) continue ; if( UPm < n ){
move Im before In; t = t – issue-latency(Im);
}/*if*/ }/*for*/
}/*for*/
(A)
Ceiln
(B) Stl_Enn
nth inst.
(C)
Ceily
(D) Stl_Eny
yth inst.
(E)
15
LIS Algorithm Complexity
In practice, only a few instructions around the stalled ones, i.e. instructions with positive Stl values, will be visited during the scheduling. It’s one of the key reasons that this algorithm runs faster than others’ approaches.
In the worst cases, every instruction will be visited at most twice, just as what we have introduced in the previous section. The time complexity is still linear to the code length.
16
LIS AlgorithmStatic counts of total instructions and stalled instructions
The total instructions are about 6.8 times more than the stalled instructions on average, and up to 8.31 times for kXML. The big difference makes the LIS run faster.
17
LIS Algorithm Complexity (build EDM)
In XORP JIT, we use a scheduling window with a constant size to build EDM.
All the columns except “DWN” could be calculated out when the EDM growing.
When reaching the boundary of a basic block, or the scheduling window overflowing, the algorithm will update all values in the column “DWN”.
For every thread, most memory space required by this algorithm is the constant-size EDM. With 16 rows, its total size is less than 1k bytes.
18
Performance Evaluation
XORP JIT will compile every Java method to XScale native instructions when the Java method is called at the first time at runtime.
The implementation of list scheduling is based on some simple heuristic rules, and does not deal with register allocation.
List scheduling traverses the DAG from the roots toward the leaves, selects the: the earliest execution time maximum possible delay
and updates current time and the earliestexecution time for its children.
19
Performance Evaluation
We chose six Java workloads from EEMBC [3], namely Chess, Cryptography, kXML, Parallel, PNG Decoding and Regular Expression to demonstrate the comparisons between list scheduling and LIS, in terms of the compilation time and runtime performance.
20
Performance Evaluation
Figure 7 shows that the average compilation time of EEMBC occupies 25.3% of the 1st round execution. The list scheduling consumes 15.9% of the total compilation time, and 4% of the total execution time on average.
21
Performance Evaluation
Figure 8 illustrates the compilation time including scheduling and building EDM or building DAGs of LIS and the list scheduling.
22
Performance Evaluation
Figure 9 presented the efficiency of the result code by LIS.
23
Performance Evaluation
The runtime performance improvement by LIS is significant for the workloads we studied, like Figure 10.
24
Conclusions
Because of the resource constraint, especially power constraint on embedded systems, the capability of processors and the size of memory footprint are still bottlenecks for a high performance JIT compiler.
For XScaleTM, the 3-cycle L1 cache latency could produce significant pipeline stalls at runtime, just like we have introduced above.
Lightweight instruction scheduling mechanisms like LIS could be used to reduce the pipeline stalls in an easier and faster way.
25
LOGO