parallel hardware applications parallel it industry...

Parallel Applications

Parallel Hardware

Parallel Software IT industry

(Silicon Valley) Users

Parallel Layout Automatic Generation and Optimization

01.14.2011

Leo Meyerovich

Adam Jiang

Rastislav Bodik

Why Generate a Layout Engine

Many and Growing Layout Languages

HTML, CSS, SMIL, XUL, Ext, jQuery, YUI OpenOffice, JavaFX, Swing, Flex,

Adam & Eve, Thermo, XAML/WPF, Word, WinForms, Qt, LaTeX, music,

iPhone, Android, WAP, MathML, 3 competing CSS grid proposals

A lot of code

Firefox layout engine: 346111 lines

layout engine

renderer

scene graph

parser

selector matcher

cascade

HTML CSS

tree decorated with style constraints

tree traversals

fast tree library

CSS grammar specification

generator

Compile Time

Approach

sequential layout engine

HBox: Example of Specifying Layout

4

HBox with two child nodes

wInput = 50px

10px 5px

wInput = ShrinkToFit

10px 5px

w := case wInput: n px: n shrinkToFit: sum (children.w)

All Absolute Coordinates

Computed

HBox Traversal Functions

5

def pass0(): child1.y = y child2.y = y def pass1(): cursor = child1.w w = if (wInput is shrink): child1.w + child2.w else: wInput h = max (child1.h, child2.h) def pass2(): child1.x = x; child2.x = x + cursor;

Root

HBox

Leaf Leaf

pass0 pass1 pass2

All Absolute Coordinates

Computed

HBox Traversal Functions

6

def pass0(): child1.y = y child2.y = y def pass1(): cursor = child1.w w = if (wInput is shrink): child1.w + child2.w else: wInput h = max (child1.h, child2.h) def pass2(): child1.x = x; child2.x = x + cursor; Leaf Leaf

Root

HBox

Leaf HBox

Scheduling Traversals

7

See Demo @

Poster Session

Leveraging Generation: % Width

8

w := case wInput: n px: n n %: parentWidth * n% shrinkToFit: sum(children.w)

10px 90% of parentWidth

9

def pass0(): … def pass1(): cursor = child1.w w = calculateWidth(wInput, child1.w, child2.w) h = max (child1.h, child2.h)

def pass2(): child1.x = x child2.x = x + cursor

def pass0(): …

def pass1():

cwpx =sumPx(children)

cwperc = sumPercs(children)

h = max (child1.h, child2.h)

def pass2():

w = calculateWidth(wInput,

cwpx, cwperc, parentWidth)

child.parentWidth = w

def pass3():

cursor = child1.w

def pass4():

child1.x = x

child2.x = x + cursor

Other Advantages

10

• Correctness Wins • Finds spec inconsistencies • Can visually debug spec

• Performance Wins • Optimal scheduling • Extract parallelism

Leaf Leaf

Root

HBox

HBox HBox

Leaf Leaf

h h h h

h h

h

layout engine

renderer

scene graph

parser

selector matcher

cascade

HTML CSS

tree decorated with style constraints

tree traversals

fast tree library

CSS grammar specification

ALE synthesizer

Compile Time

Fast Tree Library

Overview of Tree Eval Strategies

Sequential

Multicore

core 1 core 2

SIMD (“SIMTask”)

2. Optimizing memory

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

50 150 250 350

spe

ed

up

nodes per block

13


1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

50 150 250 350

spe

ed

up

nodes per block

dfs

bfs

Order within block: bfs, dfs

Traversal order

14


1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

50 150 250 350

spe

ed

up

nodes per block

dfs, rel pointers

dfs

bfs


Pointer representation: leftChild = 0x00ffaa00,

leftchild = 1200

Pointer compression How much compression hardware

15


1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

50 150 250 350

spe

ed

up

nodes per block

bfs, rel pointers

dfs, rel pointers

dfs

bfs


Pointer representation: leftChild = 0x00ffaa00,

leftchild = 1200

And More packing, coallocation / lazy defaults, structure splitting / phasing …

16

Challenge Problem for Task Parallelism?

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4

Sp

eed

up

threads

TBB tree traversal on dual-core Atom 330

base.h tbbcont.h tbbgraph.h tbb.h tbbopt.h

Different TBB algorithms

• Seconds, milliseconds instead of hours, minutes

• Sequence of traversals (locality implications) Dynamic task allocation?

Runtime queues?

Locality across traversals?

Semi-Static Work Stealing

1. Before parallel traversal: approximate work stealing

schedule 2. Traversal: reuse schedule

tuned locking scheme

Locality across

passes!

0

2

4

6

8

1 2 3 4 5 6 7 8

sp

eed

up

pthreads

Opteron Speedup (2 sockets x 4 cores); 1,000 nodes

0i

0s

1i

1s

SUM

0

2

4

6

8

1 2 3 4 5 6 7 8

sp

ee

du

p

pthreads

0i

0s

1i

1s

SUM

0

2

4

6

8

1 2 3 4 5 6 7 8

sp

ee

du

p

pthreads

0i

0s

1i

1s

SUM

strong scaling: small workload (1ms

each)

10,000 nodes 1,000 nodes; repeat each 10x

SIMD Task Evaluation (MSR)

pointwise parallel

instructions

over similar tasks

Irregularity in the task tree

structure mining

Microbenchmarks: 2-7x speedup

see poster for challenges and opportunities

15-20

10-15

5-10

0-5

Demo

Parallel layout

layout engine

scene graph

renderer

parser

multicore selector matcher

multicore cascade

HTML CSS

tree style

template

tree decorated with style constraints OpenGL Qt Renderer

tree traversals

Fast Tree Library

grammar specification

ALE synthesizer

Compile Time

Status and Future Work

MUD language

widget definition

incrementalizer

multicore parser

parallel hardware applications parallel it industry...

Documents