parallel hardware applications parallel it industry...

Parallel Applications

Parallel Hardware

Parallel Software IT industry

(Silicon Valley) Users

Parallel Layout Automatic Generation and Optimization

01.14.2011

Leo Meyerovich

Adam Jiang

Rastislav Bodik

Why Generate a Layout Engine

Many and Growing Layout Languages

HTML, CSS, SMIL, XUL, Ext, jQuery, YUI OpenOffice, JavaFX, Swing, Flex,

Adam & Eve, Thermo, XAML/WPF, Word, WinForms, Qt, LaTeX, music,

iPhone, Android, WAP, MathML, 3 competing CSS grid proposals

A lot of code

Firefox layout engine: 346111 lines

layout engine

renderer

scene graph

parser

selector matcher

cascade

HTML CSS

tree decorated with style constraints

tree traversals

fast tree library

CSS grammar specification

generator

Compile Time

Approach

sequential layout engine

HBox: Example of Specifying Layout

HBox with two child nodes

wInput = 50px

10px 5px

wInput = ShrinkToFit

10px 5px

w := case wInput: n px: n shrinkToFit: sum (children.w)

All Absolute Coordinates

Computed

HBox Traversal Functions

def pass0(): child1.y = y child2.y = y def pass1(): cursor = child1.w w = if (wInput is shrink): child1.w + child2.w else: wInput h = max (child1.h, child2.h) def pass2(): child1.x = x; child2.x = x + cursor;

Leaf Leaf

pass0 pass1 pass2

All Absolute Coordinates

Computed

HBox Traversal Functions

def pass0(): child1.y = y child2.y = y def pass1(): cursor = child1.w w = if (wInput is shrink): child1.w + child2.w else: wInput h = max (child1.h, child2.h) def pass2(): child1.x = x; child2.x = x + cursor; Leaf Leaf

Leaf HBox

Scheduling Traversals

See Demo @

Poster Session

Leveraging Generation: % Width

w := case wInput: n px: n n %: parentWidth * n% shrinkToFit: sum(children.w)

10px 90% of parentWidth

def pass0(): … def pass1(): cursor = child1.w w = calculateWidth(wInput, child1.w, child2.w) h = max (child1.h, child2.h)

def pass2(): child1.x = x child2.x = x + cursor

def pass0(): …

def pass1():

cwpx =sumPx(children)

cwperc = sumPercs(children)

h = max (child1.h, child2.h)

def pass2():

w = calculateWidth(wInput,

cwpx, cwperc, parentWidth)

child.parentWidth = w

def pass3():

cursor = child1.w

def pass4():

child1.x = x

child2.x = x + cursor

Other Advantages

• Correctness Wins • Finds spec inconsistencies • Can visually debug spec

• Performance Wins • Optimal scheduling • Extract parallelism

Leaf Leaf

HBox HBox

Leaf Leaf

h h h h

layout engine

renderer

scene graph

parser

selector matcher

cascade

HTML CSS

tree decorated with style constraints

tree traversals

fast tree library

CSS grammar specification

ALE synthesizer

Compile Time

Fast Tree Library

Overview of Tree Eval Strategies

Sequential

Multicore

core 1 core 2

SIMD (“SIMTask”)

2. Optimizing memory

50 150 250 350

nodes per block

50 150 250 350

nodes per block

Order within block: bfs, dfs

Traversal order

50 150 250 350

nodes per block

dfs, rel pointers

Pointer representation: leftChild = 0x00ffaa00,

leftchild = 1200

Pointer compression How much compression hardware

50 150 250 350

nodes per block

bfs, rel pointers

dfs, rel pointers

Pointer representation: leftChild = 0x00ffaa00,

leftchild = 1200

And More packing, coallocation / lazy defaults, structure splitting / phasing …

Challenge Problem for Task Parallelism?

1 2 3 4

threads

TBB tree traversal on dual-core Atom 330

base.h tbbcont.h tbbgraph.h tbb.h tbbopt.h

Different TBB algorithms

• Seconds, milliseconds instead of hours, minutes

• Sequence of traversals (locality implications) Dynamic task allocation?

Runtime queues?

Locality across traversals?

Semi-Static Work Stealing

1. Before parallel traversal: approximate work stealing

schedule 2. Traversal: reuse schedule

tuned locking scheme

Locality across

passes!

1 2 3 4 5 6 7 8

pthreads

Opteron Speedup (2 sockets x 4 cores); 1,000 nodes

1 2 3 4 5 6 7 8

pthreads

1 2 3 4 5 6 7 8

pthreads

strong scaling: small workload (1ms

10,000 nodes 1,000 nodes; repeat each 10x

SIMD Task Evaluation (MSR)

pointwise parallel

instructions

over similar tasks

Irregularity in the task tree

structure mining

Microbenchmarks: 2-7x speedup

see poster for challenges and opportunities

Parallel layout

layout engine

scene graph

renderer

parser

multicore selector matcher

multicore cascade

HTML CSS

tree style

template

tree decorated with style constraints OpenGL Qt Renderer

tree traversals

Fast Tree Library

grammar specification

ALE synthesizer

Compile Time

Status and Future Work

MUD language

widget definition

incrementalizer

multicore parser

parallel hardware applications parallel it industry...

Documents

fast and parallel webpage layout

fast and parallel webpage layout - lmeyerov · fast and...

introducing parallel programming with the .net framework 4...

steps towards heterogeneity and the uc berkeley...

ndseq: specifying and checking parallelism correctness...

code generators for stencil auto-tuning - par lab...

end of parlab celebration - university of california,...

presentation title goes here - snia · and parallel nfs...

poski: an extensible autotuning framework to perform...

superconductor: gpu-accelerated big data visualization for...

optimization and performance modeling of stencil...

parlab parallel boot camp - university of california,...

has: heterogeneity-aware selective data layout...

parallel (lpt) port switching interface - sh · +12v 0v +5v...

coordinated control of parallel dr-hvdc and mmc-hvdc ... ·...

separating functional and parallel correctness...

cloud computing using mapreduce, hadoop, spark -...

refactoring the os around explicit resource containers with...

intel® parallel studio xe 2016 update 1 for os x ... ·...

welcome to comp4300/8300 – parallel systems ·...