parallel webpage layout - university of california,...
TRANSCRIPT
ParallelWebpageLayout
LeoMeyerovich,ChanSiuMan,ChanSiuOn,HeidiPanKrsteAsanovic,RastislavBodik
andmanyothersfromtheUPCRCBerkeleyproject
UCBerkeley
2
Personal Health
Image Retrieval
Hearing, Music Speech Parallel
Browser Motifs
Sketching
Legacy Code Schedulers Communication &
Synch. Primitives
ParLabResearchOverview
Legacy OS
Multicore/GPGPU
OS Libraries & Services
RAMP Manycore
Hypervisor
Cor
rect
ness
Composition & Coordination Language (C&CL)
Parallel Libraries
Parallel Frameworks
Static Verification
Dynamic Checking
Debugging with Replay
Directed Testing
Autotuners
C&CL Compiler/Interpreter
Efficiency Languages
Type Systems
Dia
gnos
ing
Pow
er/P
erfo
rman
ce
Efficiency Language Compilers
ParallelWebBrowser
Whythebrowser?– animportantapplicationplatform– browserwarsagain:competingonperformance(latency)– howimportant?handheldpageloadistensofCPUseconds
Whyaparallelbrowser?– sooninyourphone?4coresx2threadsx8‐wideSIMD=64– parallelismismoreenergyefficient
Technicalchallenge– Parallelizethebrowsertorunwith100‐wayparallelism
ThisTalk:ParallelizeSinglePageLayout
• Pagelayout(HTML+CSS)istheLaTeXoftheWeb– latextakessecondstoformatadocument– butpageloadshouldbe20‐100ms– pageloadisabottleneck:51%ofCPUtimeonIE8
• Pagelayoutisachallenging“desktop”application– notparallelizedbefore– specifications:oftenambiguousandsequential– low‐latency:problemsareshort‐running– lessunderstoodmotif:treecomputation
• Knuth:“MultiprocessorsarenohelptoTEX”
OurContributions
1. Analyzedbrowserperformance– layoutisabottleneck;weidentifieditscriticalmotifs
2. DistilledessentialCSSandwroteadeclarativespecforit– crucialstepforexposingparallelismhiddenbytoday’sspec
3. Developedfirstparallelpagelayoutalgorithms(1)matching:taskparallel,20xspeedup,stronglyscalesto16(2)solving:taskparallel,4xspeedup,stronglyscalesto3cores
4. Futuresteps–componentsandalgorithms
OverallPageLayoutProblem
p width=100%
imgwidth=100pxfloat=le4
pimg width=10px
align=right align=right
float=le4;width=10px
Input:documenttree+CSSrulesOutput:sizesandpositionsoftreenodesSteps:determinestylingrules;solveconstraints
p{width:100%}img{width:100px;float:left}pimg{width:10pt}
<body>hello<imgsrc="http:...”><p><b>world</b>okokokokok
CSSstylingrulesHTML
<body>
<p> <p>
<img>hello <b> okokokok
world
ok
+ →
x=12,y=17
XX=?,y=?
Whatthebrowserdoes
Our page layout subproblem
25% 25%
Thelayoutspecisconfusing
Exampleofspec:– “Ingeneral,theleftedgeofalineboxtouchestheleftedgeofitscontainingblock…However,@loatingboxesmaycomebetween[them].”
Hardtoimplementcorrectly,evensequentially.
Safari Firefox
simplestwaytoimplementthespecseemstobeto(mostly)@lowtheelementssequentiallyinorder
Flow:sequentiallayoutintoday’sbrowsers
Flowisguidedbyacursor
Cursorpointstowherenextelementgoes
world ok ok
Δ Δ
Δ
hello
ok Δ Δ
Δ
ok ok
<body>
<p>
<img>hello
<p>
<b> ok ok ok ok
world
ok
Δ Δ
Δ
Δ
Δ Δ Δ Δ Δ
Flow’sdependences
<body>
<p> <p>
<img>hello <b> okokokok
world
ok
w=100,fs=12
w=50,float=le4
w=100,fs=12x=0,y=0
w=100,fs=6x=0,y=0
w=40,fs=6x=0,y=0h=10
h=10
constraintsnotspecifiedifequality(e.g.,inherited)orintrinsic(e.g.,defaultimagesizeoraspectraVo)
w=100,fs=12x=0,y=10
w=50x=0,y=10h=20
w=30,fs=12x=50,y=10h=10
h=10
w=100,fs=12x=0,y=10
h=40
h=40
fs=50%
fs, Δ, w
fs, Δ, w Δ fs,Δ,w
Δ
Δ fs, Δ, w
fs, Δ, w
fs, Δ, w
Dependenciespreventparallelism
<body>
<p> <p>
<img>hello <b> okokokok
world
ok
w=200,fs=12
w=50,float=le4
w=100,fs=12x=0,y=0
w=100,fs=6x=0,y=0
w=40,fs=6x=0,y=0h=10
h=10w=100,fs=12x=0,y=10
w=50x=0,y=10h=20
w=30,fs=12x=50,y=10h=10
h=10
w=100,fs=12x=0,y=10
h=40
h=40
fs=50%
fs, Δ, w
fs, Δ, w Δ fs, Δ,w
Δ
c fs, Δ, w
fs, Δ, w
fs, Δ, w
w=40,fs=6x=0,y=0h=10
w=100,fs=12x=0,y=10
Δ
c fs, Δ, w
Enableparallelismbydoingpartofwork
<body>
<p> <p>
<img>hello <b> okokokok
world
ok
w=200,fs=12
w=50,float=le4
w=100,fs=12x=0,y=0
w=100,fs=6x=0,y=0
w=40,fs=6x=0,y=0h=10
h=10w=100,fs=12x=0,y=10
w=50x=0,y=10h=20
w=30,fs=12x=50,y=10h=10
h=10
w=100,fs=12x=0,y=10
h=40
h=40
fs=50%
fs, Δ, w
fs, Δ, w Δ fs, Δ,w
Δ
Δ fs, Δ, w
fs, Δ, w
fs, Δ, w w=30,fs=12x=50,y=10h=10
fs, Δ, w
fs, Δ, w
fs, Δ, w
ParallelLayoutSolving:FivePhases
ExtensiveanalysisledustofivephasesTheseenableparallelism
1. font size, tentative widths 2. preferred widths: max, min ,
3. final widths: break cycles by over-specifying CSS
4. heights, relative x/y positions
5. absolute x/y positions
EachPhaseExhibitsTreeParallelism
<body>
<p> <p>
<img>hello <b> okokokok
world
ok
w=100,fs=12
float=le4
fs=6
fs=12
fs=12
fs=6fs=12
fs=12
fs=12wp=40wm=40 wp=50
wm=50
wp=30,wm=30
wp=10wm=10
wp=80wm=30
wp=80,wm=40
wp=30wm=30
wp=40wm=40
fs=12
<body>
<p> <p>
<img>hello <b> okokokok
world
ok
Phase 1: font size, temporary width Phase 2: preferred max & min width Phase 3: width Phase 4: height, relative x/y position
fs=50%
Phase 5: absolute x/y position
ParallelLayout:SpeculativeEvaluation
• Didnotbreakdependenciesforfloats– mightstickoutofparagraphs
world ok ok
ok ok ok
world ok ok
ok ok ok
hello
hello
ParallelLayout:SpeculativeEvaluation
• Didnotbreakdependenciesforfloats– mightstickoutofparagraphs
• Speculate:assumenofloats• Check• Patchupasneeded
world ok ok ok ok ok
hello
hello
world ok ok
ok ok ok
ParallelLayout:SpeculativeEvaluation
• Didnotbreakdependenciesforfloats– mightstickoutofparagraph
• Speculate:assumenofloats• Check• Patchupasneeded
– floatsrare– Webelieveoverflowis
minimal world ok ok
ok ok ok
world ok ok
ok ok ok
hello
hello
BerkeleyStyleSheetLayoutLanguage
• CancompileessentialCSSintoit• RefactoredCSStoseparatefeatures• Simplifies:correctness,parallelization,use
Analysis
• Model:sequentialspeed~=Firefoxspeed• Cilk++:4xspeedup,scalesto3cores• NeedtoSIMDizeleaves
<body>
<p> <p>
<img>
hellooo
<b>
ok
ok 0
1
2
3
4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Aver
age
Spee
dup
# Hardware Threads
Modeled Speedup w/Cilk++
Eight socket x 4 core AMD Opteron 2356 Barcelona Sun X4600
Dual socket x 4 core AMD Opteron 2356 Barcelona Sun X2200
Preproduction 2 socket x 4 core x 2 thread Intel Xeon Nehalem
• Matching– Tagpath(img:<body><p><img>)– RuleSelectors– Foreachtagpath:which
selectorsare~substrings?
• Ruleresolution– Prioritizepropertiesby
ruleorder:loweroverrides
RuleMatching:ProblemStatement
width=100pxfloat=le4
<body>
<p> <p>
<img>hello <b> ok ok ok ok
world
ok
selectors p img pimg
proper*es width=100% width=100pxfloat=le4
width=100px
width=10px
• ~600nodes,1000srules• Assignnodestocores
– loadbalancing:randomassignment
• SIMDizable?
RuleMatching:Parallelization
<body>
<p> <p>
<img>hello <b> ok ok ok ok
world
ok
selectors p img pimg
proper*es width=100% width=100pxfloat=le4
width=100px
Analysis
• Results– perfectscaling:upto10cores– 20xspeedupon32cores– …butwithpython
• interp.overhead(seq.)• procs.,notthreads
• Future– C++implementation– SIMDrulematching 0
4
8
12
16
20
24
28
32
1 4 8 16 32
Aver
age
Spee
dup
# Hardware Threads
Slashdot
Rotten Tomatoes
Wikipedia
NY Times
8 socket x 4 cores AMD Opteron 2356 Barcelona
Speedup vs # Cores (w/ Python)
Takeaways
• Artifacts– BSS/CSSspecification&dependencydecomposition– 4xsolvingspeedup(untuned),20xmatching(inpython)
• Lessons– 4x<<100xSIMDizelow‐levellibraries(e.g.,fonts)– motifs:lowlatencytreeops,vectors,pixelblending– attributegrammarshelped
• Nextsteps– tunetasks,SIMDkernels,biggerscopeofmodel– implicationsforconcurrentscriptsusinglayout?
(questions?)