Notes on an actor language
Jörn W. JanneckXilinx Inc.
13 February 2007 – 7th Ptolemy Miniconference
CAL Actor Language
• scripting actor specifications– make it easier to write atomic actors
• experimenting with domain polymorphism• (code generation)
CAL @ Ptolemy• the language• domain-dependent
interpretationCAL @ Xilinx• overview• application
actors in CAL
encapsulated state
Actions
State
guarded atomic actions
CAL @ Ptolemy• the language• domain-dependent
interpretationCAL @ Xilinx• overview• application
simple actors
actor Sum () Input ==> Output:
sum := 0;
action [a] ==> [sum] do sum := sum + a; endend
Sum
actor SumAbs () Input ==> Output:
sum := 0;
action [a] ==> [sum] guard a >= 0 do sum := sum + a; end
action [a] ==> [sum] guard a < 0 do sum := sum - a; endend
SumAbsInput Output
CAL @ Ptolemy• the language• domain-dependent
interpretationCAL @ Xilinx• overview• application
nondeterminism
actor NDMerge () Input1, Input2 ==> Output:
action Input1: [x] ==> [x] end action Input2: [x] ==> [x] endend
NDMergeInput1
OutputInput2
CAL @ Ptolemy• the language• domain-dependent
interpretationCAL @ Xilinx• overview• application
data-dependent token flow
actor Select () S, A, B ==> Output:
action S: [sel], A: [v] ==> [v] guard sel end
action S: [sel], B: [v] ==> [v] guard not sel endend
Select
S
Output
B
A
CAL @ Ptolemy• the language• domain-dependent
interpretationCAL @ Xilinx• overview• application
CAL anddomain polymorphism
• two fundamental questions:1. Can an actor be interpreted/used in a given MoC?2. What is its interpretation?
domain-specific interpretation
CAL @ Ptolemy• the language• domain-dependent
interpretationCAL @ Xilinx• overview• application
Example: SDF
actor Add () Input1, Input2 ==> Output:
action [a], [b] ==> [a + b] endend
actor AddSeq () Input ==> Output:
action [a, b] ==> [a + b] endend
AddInput1
OutputInput2
1
1
1
AddSeqInput Output2 1
CAL @ Ptolemy• the language• domain-dependent
interpretationCAL @ Xilinx• overview• application
Example: SDF (cont’d)
actor NDMerge () Input1, Input2 ==> Output:
action Input1: [x] ==> [x] end action Input2: [x] ==> [x] endend
NDMergeInput1
OutputInput2
actor Merge () Input1, Input2 ==> Output:
action [x1], [x2] ==> [x1, x2] endend
MergeInput1
OutputInput2
1
1
2
CAL @ Ptolemy• the language• domain-dependent
interpretationCAL @ Xilinx• overview• application
Some kind of “synchronous”...
NDMerge A2
1
1
F
1 1
Merge
1
1
2
1 1
CAL @ Ptolemy• the language• domain-dependent
interpretationCAL @ Xilinx• overview• application
Example: CSP
actor NDMerge () Input1, Input2 ==> Output:
action Input1: [x] ==> [x] end action Input2: [x] ==> [x] endend
actor Add () Input1, Input2 ==> Output:
action [a], [b] ==> [a + b] endend
[ Input1 ? x -> Output ! x|| Input2 ? x -> Output ! x]
Input1 ? a -> Input2 ? b ->Output ! a + b
[ Input1 ? a -> Input2 ? b|| Input2 ? b -> Input1 ? a] ; Output ! a + b
CAL @ Ptolemy• the language• domain-dependent
interpretationCAL @ Xilinx• overview• application
Example: CSP (cont’d)
actor Select () S, A, B ==> Output:
action S: [sel], A: [v] ==> [v] guard sel end
action S: [sel], B: [v] ==> [v] guard not sel endend
S ? sel; [ sel -> A ? v -> Output ! v|| not sel -> B ? v -> Output ! v]
actor A () X, Y ==> Z:
action X: [x1, x2] ==> [f(x1, x2)] guard P(x1, x2) end
action Y: [y1, y2] ==> [f(y1, y2)] guard P(y1, y2) end end
?
CAL @ Ptolemy• the language• domain-dependent
interpretationCAL @ Xilinx• overview• application
CAL and dataflow at Xilinx
class MyActor
{ schedule(); readPort( portNum ); writePort( portNum );
}
software
hardware
actor source+ network
high-level synthesis
simulation
new FPGA programming model & tools• hardware code generation• software (& mixed) code generation
driver application• MPEG4 Simple Profile Decoder
MPEG standardization effort• ISO/IEC 23001-4 (working draft):
Codec Configuration Representation
• ISO/IEC 23002-4 (working draft):Video Tool Library
CAL @ Ptolemy• the language• domain-dependent
interpretationCAL @ Xilinx• overview• application
FPGA Programming In PracticeNetworked MPEG-4 Viewer
Microblaze running LWIP protocol stack
Decoder Actor Network
Raster Scan Actor
Raster Scan Actor
VGA Display IP
XUP Board(2VP30)
Remote Video Stream Server
UDP over Ethernet
LocalVGA Monitor
Ethernet
UDP
Memory ControllerVGA
Display IP
CAL @ Ptolemy• the language• domain-dependent
interpretationCAL @ Xilinx• overview• application
MPEG-4 SP Decoder
quality of compiled code
VersionArea
PerformanceSlice LUT FF BRAM MULT
VHDL IP 1
(15000 lines) 4637 7923 2637 26 2 344-CIF image size180K macroblock/s @ 100MHzRequires ZBT SRAM framebuf
CAL decoder(4000 lines)
3872 7720 3576 22 3 7
HD image size243K macroblock/s @ 120MHzInterfaces to DRAM framebufI-frame parsing: 50 Mbit/s
1 http://www.xilinx.com/bvdocs/ipcenter/data_sheet/ds520_prod_brf.pdf2 BRAM-limited to 4-CIF image size.3 Supports HD image size. Reduces to 16 BRAMs for 4-CIF image size.
CAL @ Ptolemy• the language• domain-dependent
interpretationCAL @ Xilinx• overview• application
comparing decoder solutions
throughputmacroblocks/sec
x1000
relative area efficiency
1
2
5
10
10 100 1000
CIF SD HD
a
a TI64xx MPEG-4 (CPU + L1 cache only)
b
c FPGA MPEG-4 using traditional HDL flow (12 MM effort)
c
d FPGA MPEG-4 using actor/dataflow synthesis (3 MM effort)
d
b ISSCC’06 H.264 capable (includes periphery)
CAL @ Ptolemy• the language• domain-dependent
interpretationCAL @ Xilinx• overview• application
Thank You.
CAL actor language: embedded.eecs.berkeley.edu/caltrop
Credits:Dave B. Parlour, Ian D. Miller, Johan Eker, Edward A. Lee, and many others.
BACKUP
programming language adoption
Name TPCI TPCI cum. Year
C 17.66% 17.66% 1973C++ 11.06% 28.73% 1985Perl 5.48% 34.20% 1987Python 3.47% 37.67% 1990VB 9.73% 47.40% 1991Delphi 2.15% 49.54% 1994Java 21.17% 70.72% 1995PHP 9.86% 80.58% 1995JavaScript 2.20% 82.78% 1995C# 3.07% 85.85% 2002
source: TIOBE Programming Community Index, TPCI, October 2006, http://www.tiobe.com/tpci.htm
1970 1975 1980 1985 1990 1995 2000 2005
50
100
C
C++Perl
Python
VBDelphi
JavaPHP
JavaScript
C#
cumulative TCPI by language creation date(for top 10 languages)
Smaller, Faster, Easier Too good to be true?
• This is what happens when design effort is constrained.• The key is enabling architectural exploration with rapid
turn-around time.• New decoder architecture incorporates many
improvements over original design in motion compensation, AC/DC reconstruction, parser, 2-d IDCT.
• Approximate manpower numbers:– VHDL decoder: 12 months– Dataflow decoder: 3 months
Architectural ExplorationMPEG4 Motion Compensator
video stream feedback
video frame buffer(off-chip DRAM)
PROBLEM! Memory latency for random access reads and writes prevents real-world operation at HD rates.
First Step: Try on-chip cache
• Break the address and data streams, insert a cache placeholder.
• Insert different policies, see what happens.
policy1Pass-through just to make sure model is OK.
policy2Insert a cache actor in the read path and monitor statistics.
Simulation result with policy2
Frame 1 OK time: 28111msFrame 2 OK time: 23834msRequests: 49456, Hits: 45360Miss rate: 8.28%Frame 3 OK time: 27369msRequests: 98704, Hits: 90512Miss rate: 8.30%
Monitor console
• Memory controller performance 133MHz clock 32 pixel cache line fill in ~18 cycles
• Worst case compensation is 81 reads for an 8x8 block.
• 8.3% miss rate impliesaverage read is ~ 2.4 cycles
• Rate limit is 44 Mpixel/s
• HD (1920p, 4:2:0, 30fps) rate target is 93.3 Mpixel/s
• Options for improvement- more expensive controller- much better cache policy- application-aware prefetch
Step2: Application-aware prefetch
replace cache with “search window”
compensation addresses now
relative to search window
search window senses block type
prefetch requests to frame buffer prefetch data
Results of prefetch strategy
• Better performance– prefetch needs to operate at 3x pixel rate– exploits longer burst read with application-awareness
(longer cache line did not help policy2 significantly)– 64 pixels in 26 cycles → average read is ~ 0.4 cycles– peak theoretical performance is 111 Mpixel/s– exceeds HD rate target with cheap DRAM
• Substantial change to overall model behavior, but– impact limited to two actors– no refactoring of control in other actors needed
The FPGA programming problem
• Big, heterogeneous chips• circuit-design programming (+ C, Simulink, ...)
1985: 128 4-LUTs
2006: [V5-LX] 207360 6-LUTs 10Mbit BRAM 192 ALUs