inside picnik: how we built picnik (and what we learned along the way)

Post on 16-Jul-2015

4.384 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Inside Picnik: How we built Picnik and what

we learned along the way Mike Harrington

Justin Huff

Web2Expo,April3rd2009

Thanks! We couldn’t have done this without help!

Traffic, show scale of our problem

Our happy problem

Storage Render

MySQL

Apache

Python

Flex HTML

Data Center

Client

Amazon

Fun with Flex

Flex pros and cons

We Flash

•  Browser and OS independence •  Secure and Trusted •  Huge installed base •  Graphics and animation in your browser

Flex pros and cons

We Flash

•  Decent tool set •  Decent image support •  ActionScript

Flex pros and cons

Flash isn’t Perfect

•  Not Windows, Not Visual Studio •  ActionScript is not C++ •  SWF Size/Modularity •  Hardware isn’t accessible •  Adobe is our Frienemy •  Missing support for jpg Compression

Flex pros and cons

Flash is a control freak

Forecast: Cloudy with a chance of CPU

User Cloud

Your users are your free cloud!

•  Free CPU & Storage! •  Cuts down on server round trips •  More responsive (usually) •  3rd Party APIs Directly from your client •  Flash 10 enables direct load

User Cloud

•  Your user is a crappy sysadmin •  Hard to debug problems you can’t see •  Heterogeneous Hardware •  Some OS Differences

You get what you pay for.

User Cloud

Client Side Challenges

•  Hard to debug problems you can’t see •  Every 3rd party API has downtime •  You get blamed for everything •  Caching madness

Dealing with nasty problems

•  Logging •  Ask customers directly •  Find local customers •  Test, test, test

WouldweuseFlashagain?

Absolutely.

LAMP

Linux

YUP.

Apache

Check.(staGcfiles)

MySQL

Yeah,thattoo.

P

IsforPython

Python

90%ofPicnikisAPI.

CherryPy

•  WechoseCherryPy•  RESTishAPI•  CheetahforasmallamountoftemplaGng

Ok,thatwasboring.

Fun Stuff

VirtualizaGon

•  XEN•  DualQuadcoremachines•  32GBRAM•  Wevirtualizeeverything

– ExceptDBservers

VirtualizaGonPros

•  ParGGonfuncGons•  MatchingCPUboundandIOboundVMs•  HardwarestandardizaGon(intheory)

VirtualizaGonCons

•  Morecomplexity•  UsesmoreRAM•  IO/CPUinefficiencies

Storage

•  Westartedwithoneserver– Filesstoredlocally– Notscalable

Storage•  SwitchedtoMogileFS

– Workinggreat– Nodedicatedstoragemachines!

Storage

•  AddedS3intothemix

Storage–S3

•  S3isdeadsimpletoimplement.•  Forourusagepaberns,it'sexpensive•  Picnikgenerateslotsoftempfiles•  Problemskeepingupwithdeletes•  Madeadecisiontofocusonotherthingsbeforefixingdeletes

LoadBalancers

•  StartedusingPerlbal•  BoughttwoBigIPs

– Expensive,butgood.•  OutgrewtheBigIPs•  WentwithA10Networks

– Aliblebumpy– They'vehadgreatsupport.

CDNsRock

•  WewentwithLevel3•  Cutoutboundtrafficinhalf:

Rendering

•  Werenderimageswhenausersaves•  Renderjobsgointoaqueue•  Workerspullfromthequeue•  WorkersinboththeDCandEC2

Rendering•  Managerprocessmaintainenoughidleworkers

•  WorkersinDCareat~100%uGlizaGon

%$@#!!!

EC2•  OurprocessingiselasGcbetweenEC2andourownservers

•  Webuyserversinbatches

Monitoring

IfarackfallsintheDC,doesitmakeanysound?

Monitoring

•  NagiosforUp/Down–  IntegratedwithautomaGctools– ~120hosts– ~1100servicechecks

•  CacGforgeneralgraphing/trending– ~3700datasources– ~2800Graphs

•  Smokeping– Networklatency

Redundancy

•  Ilikesleepingatnight•  Makesmaintenancetaskseasier•  Ittakessomeworktobuild•  Webuiltearly(thanksFlickr!)

WarStories

ISPIssues

•  We'rehappywithbothofourproviders.•  But...

– Denialofserviceabacks– Routerproblems– Power

•  Haveseveralproviders•  Monitor&trend!

ISPIssues

•  Highlatencyonatransitlink:

•  Cause:HighCPUusageontheiraggregaGonrouter

•  SoluGon:Clearandreconfigourinterface

AmazonWebServices

•  AWSisprebysolid•  But,it'snot100%•  WhenAWSbreaks,we'redown

– Theworsttypeofoutagebecauseyoucan'tdoanything.

•  Watchoutforissuesthatneitherpartycontrols– TheInternetisn'tasreliablepointtopoint

EC2SafetyNet

•  BugcausedrenderGmestoincrease

EC2Safetynet

•  ThatsOK!

FlickrLaunch•  VeryslowFlickrAPIcallspost‐launch•  SpentanenGredayonphone/IMwithFlickrOps

•  FinallydiscoveredanissuewithNATandTCPGmestamps

•  Lessons:– Havetoolstobeabletodivedeep– Granularmonitoring

Firewalls•  WehadapairofWatchguardx1250efirewalls.

– Specs:“1.5Gbpsoffirewallthroughput”– Actualat100Mbps:

authorize.net

•  ConnecGvityissues•  Supportwasuseless•  EventuallygottotheirNOC•  WereroutedviamyhomeDSL.•  Lessons:

– Accesstotechnicalpeople– Handlefailureofservicesgracefully

Theend.

Thanks!

mike@picnik.comjusGn@picnik.com

Credits/ResourcesTalksthatinfluencedus:hbp://Gnyurl.com/scalingwebsites

Images:hbp://www.flickr.com/photos/kimberlyfaye/2785383504/

hbp://www.flickr.com/photos/piez/574804017/sizes/o/hbp://www.flickr.com/photos/unclejojo/3224365412/sizes/o/hbp://www.flickr.com/photos/phoenixdailyphoto/1467681879/sizes/l/hbp://abducGonlamp.com/hbp://www.flickr.com/photos/msig/2716038191/hbp://www.flickr.com/photos/touringcyclist/303577344/hbp://www.flickr.com/photos/davidt2006/1407837008/

top related