apache hadoop india summit 2011 keynote talk "scaling hadoop applications" by milind...

Click here to load reader

Download Apache Hadoop India Summit 2011 Keynote talk "Scaling Hadoop Applications" by Milind Bhandarkar

Post on 25-May-2015

1.288 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

1

Scaling Hadoop ApplicationsMilind BhandarkarDistributed Data Systems Engineer[[email protected]]http://www.flickr.com/photos/theplanetdotcom/4878812549/About Mehttp://www.linkedin.com/in/milindbFounding member of Hadoop team at Yahoo! [2005-2010]Contributor to Apache Hadoop & Pig since version 0.1Built and Led Grid Solutions Team at Yahoo! [2007-2010]Parallel Programming Paradigms since 1989 (PhD, cs.illinois.edu)Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo!, and LinkedInhttp://www.flickr.com/photos/tmartin/32010732/OutlineScalability of ApplicationsCauses of Sublinear ScalabilityBest practicesQ & Ahttp://www.flickr.com/photos/[email protected]/2202278303/Importance of ScalingZyngas Cityville: 0 to 100 Million users in 43 days ! (http://venturebeat.com/2011/01/14/zyngas-cityville-grows-to-100-million-users-in-43-days/)Facebook in 2009 : From 150 Million users to 350 Million ! (http://www.facebook.com/press/info.php?timeline)LinkedIn in 2010: Adding a member per second ! (http://money.cnn.com/2010/11/17/technology/linkedin_web2/index.htm)Public WWW size: 26M in 1998, 1B in 2000, 1T in 2008 (http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html)Scalability allows dealing with success gracefully !http://www.flickr.com/photos/palmdiscipline/2784992768/Explosion of Data

(Data-Rich Computing theme proposal. J. Campbell, et al, 2007)http://www.flickr.com/photos/[email protected]/4128229979/HadoopSimplifies building parallel applicationsProgrammer specifies Record-level and Group-level sequential computationFramework handles partitioning, scheduling, dispatch, execution, communication, failure handling, monitoring, reporting etc.User provides hints to control parallel executionScalability of Parallel ProgramsIf one node processes k MB/s, then N nodes should process (k*N) MB/sIf some fixed amount of data is processed in T minutes on one node, the N nodes should process same data in (T/N) minutesLinear Scalabilityhttp://www-03.ibm.com/innovation/us/watson/Reduce LatencyMinimize programexecution timehttp://www.flickr.com/photos/adhe55/2560649325/Increase ThroughputMaximize data processed per unit timehttp://www.flickr.com/photos/mikebaird/3898801499/Three Equationshttp://www.flickr.com/photos/ucumari/321343385/Amdahls Law

http://www.computer.org/portal/web/awards/amdahlMulti-Phase ComputationsIf computation C is split into N different parts, C1..CNIf partial computation Ci can be speeded up by a factor of Sihttp://www.computer.org/portal/web/awards/amdahlAmdahls Law, Restated

http://www.computer.org/portal/web/awards/amdahlMapReduce Workflow

http://www.computer.org/portal/web/awards/amdahlLittles Law

http://sloancf.mit.edu/photo/facstaff/little.gifMessage Cost Model

http://www.flickr.com/photos/mattimattila/4485665105/Message GranularityFor Gigabit Ethernet = 300 S = 100 MB/s100 Messages of 10KB each = 40 ms10 Messages of 100 KB each = 13 mshttp://www.flickr.com/photos/philmciver/982442098/Alpha-BetaCommon Mistake: Assuming that is constantScheduling latency for responderJobtracker/Tasktracker time slice inversely proportional to number of concurrent tasksCommon Mistake: Assuming that is constantNetwork congestionTCP incastCauses of Sublinear Scalabilityhttp://www.flickr.com/photos/sabertasche2/2609405532/Sequential BottlenecksMistake: Single reducer (default)mapred.reduce.tasksParallel directiveMistake: Constant number of reducers independent of data sizedefault_parallelhttp://www.flickr.com/photos/bogenfreund/556656621/Load ImbalanceUnequal Partitioning of WorkImbalance = Max Partition Size Min Partition SizeHeterogeneous workersExample: Disk BandwidthEmpty Disk: 100 MB/s75% Full disk: 25 MB/s http://www.flickr.com/photos/kelpatolog/176424797/Over-PartitioningFor N workers, choose M partitions, M >> NAdaptive Scheduling, each worker chooses the next unscheduled partitionLoad Imbalance = Max(Sum(Wk)) Min(Sum(Wk))For sufficiently large M, both Max and Min tend to Average(Wk), thus reducing load imbalancehttp://www.flickr.com/photos/infidelic/3238637031/Critical PathsMaximum distance between start and end nodesLong tail in Map task executionsEnable Speculative ExecutionOverlap communication and computationAsynchronous notificationsExample: deleting intermediate outputs after job completionhttp://www.flickr.com/photos/cdevers/4619306252/Algorithmic OverheadsRepeating computation in place of adding communicationParallel exploration of solution space results in wastage of computationsNeed for a coordinatorShare partial, unmaterialized resultsConsider memory-based small replicated K-V stores with eventual consistency

http://www.flickr.com/photos/sbprzd/140903299/SynchronizationShuffle stage in Hadoop, M*R transfersSlowest system components involved: Network, DiskCan you eliminate Reduce stage ?Use combiners ?Compress intermediate output ?Launch reduce tasks before all maps finish

http://www.flickr.com/photos/susan_d_p/2098849477/Task GranularityAmortize task scheduling and launching overheadsCentralized schedulerOut-of-band Tasktracker heartbeat reduces latency, butMay overwhelm the schedulerIncrease task granularityCombine multiple small filesMaximize split sizesCombine computations per datumhttp://www.flickr.com/photos/philmciver/982442098/Hadoop Best PracticesUse higher-level languages (e.g. Pig)Coalesce small files into larger ones & use bigger HDFS block sizeTune buffer sizes for tasks to avoid spill to diskConsume only one slot per taskUse combiner if local aggregation reduces sizehttp://www.flickr.com/photos/yaffamedia/1388280476/Hadoop Best PracticesUse compression everywhereCPU-efficient for intermediate dataDisk-efficient for output dataUse Distributed Cache to distribute small side-files (