atlas (a.k.a. ramp red) parallel programming with transactional memory
Post on 31-Dec-2015
19 Views
Preview:
DESCRIPTION
TRANSCRIPT
ATLAS(a.k.a. RAMP Red)
Parallel Programming with Transactional Memory
Njuguna Njoroge and Sewook Wee
Transactional Coherence and Consistency
Computer System Lab
Stanford University
http:/tcc.stanford.edu/prototypes
2
Why we built ATLAS
Multicore processors exposes challenges of multithreaded programming• Transactional Memory simplifies parallel programming
As simple as coarse-grain locks As fast as fine-grain locks
Currently missing for evaluating TM• Fast TM prototypes to develop software on
FPGAs improving capabilities attractive for CMP prototyping• Fast Can operate > 100 MHz• More logic, memory and I/O’s• Larger libraries of pre-designed IP cores
ATLAS: 8-processor Transactional Memory System• 1st FPGA-based Hardware TM system• Member of RAMP initiative RAMP Red
3
ATLAS provides …
Speed• > 100x speed-up over SW simulator [FPGA 2007]
Rich software environment• Linux OS• Full GNU environment (gcc, glibc, etc.)
Productivity• Guided performance tuning• Standard GDB environment + deterministic replay
4
Transaction• Building block of a program• Critical region• Executed atomically & isolated from others
TCC’s Execution Model
5
TCC’s Execution Model
CPU 0
CPU 1
CPU 2
Commit
Arbitrate
Execute
Code
Commit
Arbitrate
Execute
Code
Undo
Execute
Code
ld 0xbeefRe-
Execute
Code
...
ld 0xaaaa
ld 0xbbbb
...
ld 0xbeef...
...
0xbeef0xbeef
st 0xbeef
...
ld 0xdddd
ld 0xeeee
...
In TCC, All Transactions All The Time [PACT 2004]
6
Processor
W7:0TAG
(2-ported)Data
Cache
ViolationLoad/Store
Address
Snoop Control
Commit Address
CommitControl
CommitData
StoreAddress
FIFO
RegisterCheckpoint
Commit Bus
Refill Bus
CommitAddress In
CommitData Out
CommitAddress Out
DATA(single-ported)
R7:0V
CMP Architecture for TCC
Speculatively Read Bits:
ld 0xdeadbeef
Speculatively Written Bits:
st 0xcafebabe
Violation Detection:Compare incoming address to R bits
Commit:Read pointers from Store Address FIFO, flush addresses W bits set
7
ATLAS 8-way CMP on BEE2 Board
Control FPGA
DDR2DRAM
Controller
CommitTokenArbiter
LinuxPPC
I/O(disk, net)
User FPGA
TCC$I $
TCC PPC
TCC$I $
TCC PPC
User FPGA
TCC$I $
TCC PPC
TCC$I $
TCC PPC
User FPGA
TCC$I $
TCC PPC
TCC$I $
TCC PPC
User FPGA
TCC$I $
TCC PPC
TCC$I $
TCC PPC
User Switch
Control Switch
User Switch
User Switch
User Switch
User FPGAs 4 FPGAs for a total of 8
TCC CPUs PPC, TCC caches, BRAMs
and busses run @ 100 MHz
Control FPGA Linux PPC @ 300 MHz
• Launch TCC apps here• Handle system services for
TCC PowerPCs Fabric runs @ 100 MHz
8
ATLAS Software Overview
TM Application
TM API ATLAS Profiler
ATLAS Subsystem
Linux OS
ATLAS HW on BEE2
TM application can be easily written with TM API
ATLAS profiler provides a runtime profiling and guided performance tuning
ATLAS subsystem provides Linux OS support for the TM application
9
ATLAS subsystem
Commit
Linux
PPC
TCCPPC0
Transfersinitial context
TCCPPC1
TCCPPC2
… TCCPPC7
Invokes parallel work
Joins parallel work
Exit withapp. stats
Violation
10
ATLAS System Support
TCC PPC requests OS support.
(TLB miss, system call)
Linux PPC replies back to the requestor.
Linux PPC regenerates
and services the
request.
Serialize, if request is irrevocable
• System Call
• Page-out
Linux PPC
TCCPPC
11
Coding with TM API: histogram
main (int argc, void* argv) { … sequential code … TM_PARALLEL(run, NULL, numCpus); … sequential code …}
// static scheduling with interleaved access to A[]void* run(void* args) { int i = TM_GET_THREAD_ID(); for (;i < NUM_LOOP; i+= TM_GET_NUM_THREAD()) { TM_BEGIN(); bucket[A[i]]++; TM_END();}
OpenTM will provide high-level (OpenMP style) pragmas
12
Guided Performance Tuning
TAPE: Light-weight runtime profiler [ICS 2005] Tracking most significant violations (longest loss time)
• Violated object address
• PC where object was read
• Loss time & # of occurrence
• Committing thread’s ID and transaction PC Tracking most significant overflows (longest duration)
• Overflows: when speculative state can no longer stay in TCC$
• PC where overflows
• Overflow duration & number of occurrence
• Type of overflow (LRU or Write Buffer)
13
Deterministic Replay
All Transactions All The Time
• TM 101: Transaction is executed atomically and in isolation
• TM’s illusion: transaction starts after older transactions finish
Only need to record “the order of commit”
• Minimal runtime overhead & footprint size = 1B / transaction
Logging execution Replay execution
write-set
time time
LOG:
T0
T1
T2 T2
write-set
T0
T1
T2 T2
Token arbiter enforces
commit orderspecified in LOG
T0 T1 T2
14
Useful Features of Replay
Monitoring code in the transaction
• Remember we only record the transaction order
Verification
• Log is not written in stone
• Complete runtime scenario coverage is possible
Choice of running Replay on
• ATLAS itself HW support for other debugging tools (see next slide)
• Local machine (your favorite desktop or workstation) Runs natively on faster local machine, sequentially
Seamless access to existing debugging tools
15
GDB support
Current status
• GDB integrated with local machine replay GDB provides debugability while guaranteeing deterministic
replay
• Below are work-in-progress
Breakpoint
• Thread local BP vs. global BP
• Stop the world by controlling commit token
Stepping
• Backward stepping: Transaction is ready to roll back
• Transaction stepping
Unlimited data-watch (ATLAS only)
• Separate monitor TCC cache to register data-watches
16
Conclusion: ATLAS provides
Speed• > 100x speed-up over SW simulator [FPGA 2007]
Software environment• Linux OS• Full GNU environment (gcc, glibc, etc.)
Productivity • TAPE: Guided performance tuning• Deterministic replay • Standard GDB environment
Future Work• High-level language support (Java, Python, …)
17
Questions and Answers
tcc_fpga_xtreme@mailman.stanford.edu
ATLAS Team Members
• System Hardware – Njuguna Njoroge, PhD Candidate
• System Software – Sewook Wee, PhD Candidate
• High level languages – Jiwon Seo, PhD Candidate
• HW Performance – Lewis Mbae, BS Candidate
Past contributors
• Interconnection Fabric – Jared Casper, PhD Candidate
• Undergrads – Justy Burdick, Daxia Ge, Yuriy Teslar
top related