apache hadoop india summit 2011 keynote talk "scaling hadoop applications" by milind...
Post on 25-May-2015
1.288 views
Embed Size (px)
TRANSCRIPT
1
Scaling Hadoop ApplicationsMilind BhandarkarDistributed Data Systems Engineer[[email protected]]http://www.flickr.com/photos/theplanetdotcom/4878812549/About Mehttp://www.linkedin.com/in/milindbFounding member of Hadoop team at Yahoo! [2005-2010]Contributor to Apache Hadoop & Pig since version 0.1Built and Led Grid Solutions Team at Yahoo! [2007-2010]Parallel Programming Paradigms since 1989 (PhD, cs.illinois.edu)Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo!, and LinkedInhttp://www.flickr.com/photos/tmartin/32010732/OutlineScalability of ApplicationsCauses of Sublinear ScalabilityBest practicesQ & Ahttp://www.flickr.com/photos/[email protected]/2202278303/Importance of ScalingZyngas Cityville: 0 to 100 Million users in 43 days ! (http://venturebeat.com/2011/01/14/zyngas-cityville-grows-to-100-million-users-in-43-days/)Facebook in 2009 : From 150 Million users to 350 Million ! (http://www.facebook.com/press/info.php?timeline)LinkedIn in 2010: Adding a member per second ! (http://money.cnn.com/2010/11/17/technology/linkedin_web2/index.htm)Public WWW size: 26M in 1998, 1B in 2000, 1T in 2008 (http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html)Scalability allows dealing with success gracefully !http://www.flickr.com/photos/palmdiscipline/2784992768/Explosion of Data
(Data-Rich Computing theme proposal. J. Campbell, et al, 2007)http://www.flickr.com/photos/[email protected]/4128229979/HadoopSimplifies building parallel applicationsProgrammer specifies Record-level and Group-level sequential computationFramework handles partitioning, scheduling, dispatch, execution, communication, failure handling, monitoring, reporting etc.User provides hints to control parallel executionScalability of Parallel ProgramsIf one node processes k MB/s, then N nodes should process (k*N) MB/sIf some fixed amount of data is processed in T minutes on one node, the N nodes should process same data in (T/N) minutesLinear Scalabilityhttp://www-03.ibm.com/innovation/us/watson/Reduce LatencyMinimize programexecution timehttp://www.flickr.com/photos/adhe55/2560649325/Increase ThroughputMaximize data processed per unit timehttp://www.flickr.com/photos/mikebaird/3898801499/Three Equationshttp://www.flickr.com/photos/ucumari/321343385/Amdahls Law
http://www.computer.org/portal/web/awards/amdahlMulti-Phase ComputationsIf computation C is split into N different parts, C1..CNIf partial computation Ci can be speeded up by a factor of Sihttp://www.computer.org/portal/web/awards/amdahlAmdahls Law, Restated
http://www.computer.org/portal/web/awards/amdahlMapReduce Workflow
http://www.computer.org/portal/web/awards/amdahlLittles Law
http://sloancf.mit.edu/photo/facstaff/little.gifMessage Cost Model
http://www.flickr.com/photos/mattimattila/4485665105/Message GranularityFor Gigabit Ethernet = 300 S = 100 MB/s100 Messages of 10KB each = 40 ms10 Messages of 100 KB each = 13 mshttp://www.flickr.com/photos/philmciver/982442098/Alpha-BetaCommon Mistake: Assuming that is constantScheduling latency for responderJobtracker/Tasktracker time slice inversely proportional to number of concurrent tasksCommon Mistake: Assuming that is constantNetwork congestionTCP incastCauses of Sublinear Scalabilityhttp://www.flickr.com/photos/sabertasche2/2609405532/Sequential BottlenecksMistake: Single reducer (default)mapred.reduce.tasksParallel directiveMistake: Constant number of reducers independent of data sizedefault_parallelhttp://www.flickr.com/photos/bogenfreund/556656621/Load ImbalanceUnequal Partitioning of WorkImbalance = Max Partition Size Min Partition SizeHeterogeneous workersExample: Disk BandwidthEmpty Disk: 100 MB/s75% Full disk: 25 MB/s http://www.flickr.com/photos/kelpatolog/176424797/Over-PartitioningFor N workers, choose M partitions, M >> NAdaptive Scheduling, each worker chooses the next unscheduled partitionLoad Imbalance = Max(Sum(Wk)) Min(Sum(Wk))For sufficiently large M, both Max and Min tend to Average(Wk), thus reducing load imbalancehttp://www.flickr.com/photos/infidelic/3238637031/Critical PathsMaximum distance between start and end nodesLong tail in Map task executionsEnable Speculative ExecutionOverlap communication and computationAsynchronous notificationsExample: deleting intermediate outputs after job completionhttp://www.flickr.com/photos/cdevers/4619306252/Algorithmic OverheadsRepeating computation in place of adding communicationParallel exploration of solution space results in wastage of computationsNeed for a coordinatorShare partial, unmaterialized resultsConsider memory-based small replicated K-V stores with eventual consistency
http://www.flickr.com/photos/sbprzd/140903299/SynchronizationShuffle stage in Hadoop, M*R transfersSlowest system components involved: Network, DiskCan you eliminate Reduce stage ?Use combiners ?Compress intermediate output ?Launch reduce tasks before all maps finish
http://www.flickr.com/photos/susan_d_p/2098849477/Task GranularityAmortize task scheduling and launching overheadsCentralized schedulerOut-of-band Tasktracker heartbeat reduces latency, butMay overwhelm the schedulerIncrease task granularityCombine multiple small filesMaximize split sizesCombine computations per datumhttp://www.flickr.com/photos/philmciver/982442098/Hadoop Best PracticesUse higher-level languages (e.g. Pig)Coalesce small files into larger ones & use bigger HDFS block sizeTune buffer sizes for tasks to avoid spill to diskConsume only one slot per taskUse combiner if local aggregation reduces sizehttp://www.flickr.com/photos/yaffamedia/1388280476/Hadoop Best PracticesUse compression everywhereCPU-efficient for intermediate dataDisk-efficient for output dataUse Distributed Cache to distribute small side-files (