fast paths in concurrent programs wen xu, princeton university sanjeev kumar, intel labs. kai li,...
TRANSCRIPT
Fast Paths in Concurrent Programs
Wen Xu, Princeton UniversitySanjeev Kumar, Intel Labs .
Kai Li, Princeton University
Fast Paths in Concurrent Programs 2Intel Labs & Princeton University
Processor 2
Processor 1Processor 1
Concurrent Programs Message-Passing Style
Processes & Channels E.g. Streaming Languages
C1 C3
C2
P2 P3
P4
P1
Uniprocessors Programming Convenience
─ Embedded devices─ Network Software Stack─ Media Processing
Multiprocessors Exploit parallelism Partition Processes
Problem: Compile a concurrent program
to run efficiently on a Uniprocessor
Fast Paths in Concurrent Programs 3Intel Labs & Princeton University
Compiling Concurrent Programs Process-based Approach
Keep processes separate Context Switch between
the processes
Small executable Sum of Processes
Significant overhead
Automata-based Approach Treat each process as a
state machine Combine the state machines
Small Overhead Large Executables
Potentially Exponential
One Study Compared the two approaches and found: Compared to Process-based approach, the Automata-based
Approach generates code that is─ Twice as fast─ 2-3 Orders of magnitude larger executable
Neither approach is satisfactory
Fast Paths in Concurrent Programs 4Intel Labs & Princeton University
Our Work Our Goal: Compile Concurrent Programs
Automated using a Compiler Low Overhead Small Executable Size
Our Approach: Combine the two approaches Use process-based approach to handle all cases Use automata-based approach to speed up the
common cases
Fast Paths in Concurrent Programs 5Intel Labs & Princeton University
Outline Motivation Fast Paths Fast Paths in Concurrent Programs Experimental Evaluation Conclusions
Fast Paths in Concurrent Programs 6Intel Labs & Princeton University
Fast Paths Path: A dynamic execution path in the program Fast Path or Hot Path: Well-known technique
Commonly-executed Paths (Hot Path) Specialize and Optimize (Fast Path)
Two components Predicate that specifies the fast path Optimized code to execute the fast path
Compilers can be used to automate it
Mostly in sequential Programs
Fast Paths in Concurrent Programs 7Intel Labs & Princeton University
Manually implementing Fast Paths To achieve good performance in Concurrent
programs Start: Insert code that identifies the common case
and transfer control to fast path code Extract and optimize fast path code manually Finish: Patch up state and return control at the end
of fast path
Obvious drawbacks Difficult to implement correctly Difficult to maintain
Fast Paths in Concurrent Programs 8Intel Labs & Princeton University
Outline Motivation Fast Paths Fast Paths in Concurrent Programs Experimental Evaluation Conclusions
Fast Paths in Concurrent Programs 9Intel Labs & Princeton University
Fast Path (Automata-based)
Our Approach
Baseline (Process-based)
Test
1
a = b;b = c * d;d = 0;if (c > 0) c++;a = c;b = c * d;d = 3;if (c > 0) c++;
Optimized Code
2
Abort?
3
Fast Paths in Concurrent Programs 10Intel Labs & Princeton University
Specifying Fast Paths Multiple processes
Concurrent Program
Regular expressions Statements Conditions (Optional) Synchronization
(Optional)
Support early abort Advantages
Powerful Compact Hint
fastpath example { process first { statement A, B, C, D, #1; start A ? (size<100); follows B ( C D )*; exit #1; } process second { ... } process third { ... }}
Fast Paths in Concurrent Programs 11Intel Labs & Princeton University
Extracting Fast Paths Automata-based approach to extract fast paths
A Fast Path involves a group of processes Compiler keeps track of the execution point for
each of the involved processes On exit, control is returned to the appropriate
location in each of the processes
Baseline: Concurrent. Fast Path: Sequential Code Fairness on Fast Path
Embed scheduling decisions in the fast path─ Avoid scheduling/fairness overhead on the fast path
Rely on baseline code for fairness─ Always taken a fraction of the time
Fast Paths in Concurrent Programs 12Intel Labs & Princeton University
Optimization on Fast Path Enabling Traditional Fast Paths
Generate and Optimize baseline code Generate Fast path code
─ Fast Paths have exit/entry points to baseline code
Use data-flow information from baseline code at the exit/entry point to start analysis and optimize the fast path code
Speeding up fast path using lazy execution Delay operations that are not needed when fast
paths are executed to the end Such operations can be performed if the fast path is
aborted
Fast Paths in Concurrent Programs 13Intel Labs & Princeton University
Outline Motivation Fast Paths Fast Paths in Concurrent Programs Experimental Evaluation Conclusions
Fast Paths in Concurrent Programs 14Intel Labs & Princeton University
Experimental Evaluation Implemented the techniques in the paper
In ESP Compiler─ Supports concurrent programs
Two class of programs Filter Programs VMMC Firmware
Answer three questions Programming effort (annotation complexity) needed Size of the executable Performance
Fast Paths in Concurrent Programs 15Intel Labs & Princeton University
Filter Programs Well-defined structure
Streaming applications Use Filter Programs by Probsting et al.
Good to evaluate our technique─ Concurrency overheads dominate
Experimental Setup 2.66 GHz Pentium 4, 1 GB Memory, Linux 2.4 4 Versions of the code
Annotation Complexity Program sizes: 153, 125, 190, 196 lines Annotation sizes: 7, 7, 10, 10 lines
P1
C1
P2
P3
C2
P4
C3
Fast Paths in Concurrent Programs 16Intel Labs & Princeton University
Filter Programs Cont’d
0
0.5
1
1.5
2
2.5
Program 1 Program 2 Program 3 Program 4
Process-based Automata-basedProcess-based with Manual Fast Path Process-based with Automatic Fast Path
4.1723.5228.339.47
0
0.5
1
1.5
2
2.5
Program 1 Program 2 Program 3 Program 4
5.155.53
Exe
cuta
ble
Siz
eP
erfo
rman
ce
Program 1 Program 2 Program 3 Program 4
Better Performance than Both
Relatively Small Executable
Fast Paths in Concurrent Programs 17Intel Labs & Princeton University
VMMC Firmware Firmware for a gigabit network (Myrinet) Experimental Setup
Measure network performance between two machines connected with Myrinet─ Latency & Bandwidth
3 Versions of the firmware─ Concurrent C version with Manual Fast Paths─ Process-based code without Fast Paths─ Process-based code with Compiler-extracted Fast Paths
Annotation Complexity (3 fast paths) Fast Path Specification: 20, 14, and 18 lines Manual Fast Paths in C: 1100 lines total
Fast Paths in Concurrent Programs 18Intel Labs & Princeton University
VMMC Firmware Cont’d
Message size (in Bytes)
Performance: Latency
0
10
20
30
40
50
60
70
4 8 16 32 64 128 256 512
Hand-Optimized C with Manual Fast PathsProcess-based
Process-based with Automatic Fast Paths
s
0
10000
20000
30000
40000
Hand-Optimized C with Manual Fast Paths
Process-based Code
Process-based with Automatic Fast Paths
Generated Code Size
Ass
emb
ly
Inst
ruct
ion
s
Fast Paths in Concurrent Programs 19Intel Labs & Princeton University
Outline Motivation Fast Paths Fast Paths in Concurrent Programs Experimental Evaluation Conclusions
Fast Paths in Concurrent Programs 20Intel Labs & Princeton University
Conclusions Fast Paths in Concurrent Programs
Evaluated using Filter programs and VMMC firmware
Process-based approach to handle all cases Keeps executable size reasonable
Automata-based approach to handle only the common cases (Fast Path) Avoid high overhead of process-based approach Often outperforms the automata-based code