parallel hardware applications parallel it industry...
TRANSCRIPT
Parallel Applications
Parallel Hardware
Parallel Software IT industry
(Silicon Valley) Users
Parallel Layout Automatic Generation and Optimization
01.14.2011
Leo Meyerovich
Adam Jiang
Rastislav Bodik
Why Generate a Layout Engine
Many and Growing Layout Languages
HTML, CSS, SMIL, XUL, Ext, jQuery, YUI OpenOffice, JavaFX, Swing, Flex,
Adam & Eve, Thermo, XAML/WPF, Word, WinForms, Qt, LaTeX, music,
iPhone, Android, WAP, MathML, 3 competing CSS grid proposals
A lot of code
Firefox layout engine: 346111 lines
layout engine
renderer
scene graph
parser
selector matcher
cascade
HTML CSS
tree decorated with style constraints
tree traversals
fast tree library
CSS grammar specification
generator
Compile Time
Approach
sequential layout engine
HBox: Example of Specifying Layout
4
HBox with two child nodes
wInput = 50px
10px 5px
wInput = ShrinkToFit
10px 5px
w := case wInput: n px: n shrinkToFit: sum (children.w)
All Absolute Coordinates
Computed
HBox Traversal Functions
5
def pass0(): child1.y = y child2.y = y def pass1(): cursor = child1.w w = if (wInput is shrink): child1.w + child2.w else: wInput h = max (child1.h, child2.h) def pass2(): child1.x = x; child2.x = x + cursor;
Root
HBox
Leaf Leaf
pass0 pass1 pass2
All Absolute Coordinates
Computed
HBox Traversal Functions
6
def pass0(): child1.y = y child2.y = y def pass1(): cursor = child1.w w = if (wInput is shrink): child1.w + child2.w else: wInput h = max (child1.h, child2.h) def pass2(): child1.x = x; child2.x = x + cursor; Leaf Leaf
Root
HBox
Leaf HBox
Scheduling Traversals
7
See Demo @
Poster Session
Leveraging Generation: % Width
8
w := case wInput: n px: n n %: parentWidth * n% shrinkToFit: sum(children.w)
10px 90% of parentWidth
9
def pass0(): … def pass1(): cursor = child1.w w = calculateWidth(wInput, child1.w, child2.w) h = max (child1.h, child2.h)
def pass2(): child1.x = x child2.x = x + cursor
def pass0(): …
def pass1():
cwpx =sumPx(children)
cwperc = sumPercs(children)
h = max (child1.h, child2.h)
def pass2():
w = calculateWidth(wInput,
cwpx, cwperc, parentWidth)
child.parentWidth = w
def pass3():
cursor = child1.w
def pass4():
child1.x = x
child2.x = x + cursor
Other Advantages
10
• Correctness Wins • Finds spec inconsistencies • Can visually debug spec
• Performance Wins • Optimal scheduling • Extract parallelism
Leaf Leaf
Root
HBox
HBox HBox
Leaf Leaf
h h h h
h h
h
layout engine
renderer
scene graph
parser
selector matcher
cascade
HTML CSS
tree decorated with style constraints
tree traversals
fast tree library
CSS grammar specification
ALE synthesizer
Compile Time
Fast Tree Library
Overview of Tree Eval Strategies
Sequential
Multicore
core 1 core 2
SIMD (“SIMTask”)
2. Optimizing memory
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
50 150 250 350
spe
ed
up
nodes per block
13
2. Optimizing memory
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
50 150 250 350
spe
ed
up
nodes per block
dfs
bfs
Order within block: bfs, dfs
Traversal order
14
2. Optimizing memory
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
50 150 250 350
spe
ed
up
nodes per block
dfs, rel pointers
dfs
bfs
Order within block: bfs, dfs
Pointer representation: leftChild = 0x00ffaa00,
leftchild = 1200
Pointer compression How much compression hardware
15
2. Optimizing memory
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
50 150 250 350
spe
ed
up
nodes per block
bfs, rel pointers
dfs, rel pointers
dfs
bfs
Order within block: bfs, dfs
Pointer representation: leftChild = 0x00ffaa00,
leftchild = 1200
And More packing, coallocation / lazy defaults, structure splitting / phasing …
16
Challenge Problem for Task Parallelism?
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4
Sp
eed
up
threads
TBB tree traversal on dual-core Atom 330
base.h tbbcont.h tbbgraph.h tbb.h tbbopt.h
Different TBB algorithms
• Seconds, milliseconds instead of hours, minutes
• Sequence of traversals (locality implications) Dynamic task allocation?
Runtime queues?
Locality across traversals?
Semi-Static Work Stealing
1. Before parallel traversal: approximate work stealing
schedule 2. Traversal: reuse schedule
tuned locking scheme
Locality across
passes!
0
2
4
6
8
1 2 3 4 5 6 7 8
sp
eed
up
pthreads
Opteron Speedup (2 sockets x 4 cores); 1,000 nodes
0i
0s
1i
1s
SUM
0
2
4
6
8
1 2 3 4 5 6 7 8
sp
ee
du
p
pthreads
0i
0s
1i
1s
SUM
0
2
4
6
8
1 2 3 4 5 6 7 8
sp
ee
du
p
pthreads
0i
0s
1i
1s
SUM
strong scaling: small workload (1ms
each)
10,000 nodes 1,000 nodes; repeat each 10x
SIMD Task Evaluation (MSR)
pointwise parallel
instructions
over similar tasks
Irregularity in the task tree
structure mining
Microbenchmarks: 2-7x speedup
see poster for challenges and opportunities
15-20
10-15
5-10
0-5
Demo
Parallel layout
layout engine
scene graph
renderer
parser
multicore selector matcher
multicore cascade
HTML CSS
tree style
template
tree decorated with style constraints OpenGL Qt Renderer
tree traversals
Fast Tree Library
grammar specification
ALE synthesizer
Compile Time
Status and Future Work
MUD language
widget definition
incrementalizer
multicore parser