20090422 www

Wednesday, April 22, 2009

Socializing Big DataLessons from the Hadoop Community

Jeff HammerbacherChief Scientist and Vice President of Products, ClouderaApril 22, 2009


My BackgroundThanks for Asking

▪ [email protected]▪ Studied Mathematics at Harvard▪ Worked as a Quant on Wall Street▪ Conceived, built, and led the Data team at Facebook▪ Nearly 30 amazing engineers and data scientists▪ Released Hive and Cassandra as open source projects▪ Published research at conferences: SIGMOD, CHI, ICWSM

▪ Founder of Cloudera▪ Building tools to make learning go faster, starting with Hadoop


mailto:[email protected]

mailto:[email protected]

Presentation Outline▪ What is Hadoop?▪ Hadoop at Facebook▪ Brief history of the Facebook Data team▪ Summary of how we used Hadoop▪ Reasons for choosing Hadoop

▪ How is software built and adopted?▪ “Laboratory Life”▪ Social Learning Theory▪ Organizations and tools in open source development

▪ Moving from the “Age of Data” to the “Age of Learning”


The Hadoop community is producing innovative, world class software for web scale data management and analysis.

By studying how software is built and adopted, we can enhance rate at which data processing technologies evolve.

The Hadoop community is open to everyone and will play a central role in this evolution. You should join us!


What is Hadoop?Not Just a Stuffed Elephant

▪ Open source project, written mostly in Java▪ Most active Apache Software Foundation project▪ Inspired by Google infrastructure▪ Over one hundred production deployments▪ Project structure▪ Hadoop Distributed File System (HDFS)▪ Hadoop MapReduce▪ Hadoop Core: client libraries and management tools▪ Other subprojects: Avro, HBase, Hive, Pig, Zookeeper


Anatomy of a Hadoop Cluster▪ Commodity servers▪ 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC

▪ Typically arranged in 2 level architecture▪ 40 nodes per rack

▪ Inexpensive to acquire and maintain

ApacheCon US 2008

Commodity Hardware Cluster

•! Typically in 2 level architecture

–! Nodes are commodity Linux PCs

–! 40 nodes/rack

–! Uplink from rack is 8 gigabit

–! Rack-internal is 1 gigabit all-to-all


HDFS▪ Pool commodity servers into a single hierarchical namespace▪ Break files into 128 MB blocks and replicate blocks▪ Designed for large files written once but read many times▪ Two main daemons: NameNode and DataNode▪ NameNode manages filesystem metadata▪ DataNode manages data using local filesystem

▪ HDFS manages checksumming, replication, and compression▪ Throughput scales nearly linearly with cluster size▪ Access from Java, C, command line, FUSE, or Thrift


HDFSHDFS distributes file blocks among servers

!"#$%&'%(#)**+,%-&.%/#$#%&0%$"1%20$13+3&'1% !"

!"#$%&#"'()*+%,"-'./0('

#$%&&'"()*+,%-."$"/$,+010&+-2$)0".0&2$3-".4.0-5"*$++-%"06-"

#$%&&'"7(.02(8,0-%"9(+-":4.0-5;"&2"#79:<"#79:"(."$8+-"0&".0&2-"

6,3-"$5&,)0."&/"()/&25$0(&);".*$+-",'"()*2-5-)0$++4"$)%"

.,2=(=-"06-"/$(+,2-"&/".(3)(/(*$)0"'$20."&/"06-".0&2$3-"

()/2$.02,*0,2-">(06&,0"+&.()3"%$0$<"

#$%&&'"*2-$0-."456'$13'"&/"5$*6()-."$)%"*&&2%()$0-.">&2?"

$5&)3"06-5<"@+,.0-2."*$)"8-"8,(+0">(06"()-A'-).(=-"*&5',0-2.<"

B/"&)-"/$(+.;"#$%&&'"*&)0(),-."0&"&'-2$0-"06-"*+,.0-2">(06&,0"

+&.()3"%$0$"&2"()0-22,'0()3">&2?;"84".6(/0()3">&2?"0&"06-"

2-5$()()3"5$*6()-."()"06-"*+,.0-2<"

#79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3"

/(+-."()0&"'(-*-.;"*$++-%"C8+&*?.;D"$)%".0&2()3"-$*6"&/"06-"8+&*?."

2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;"

#79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6"

'(-*-"0&"062--"%(//-2-)0".-2=-2.E"

"

"

!"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.'

"

#79:"6$.".-=-2$+",.-/,+"/-$0,2-.<"B)"06-"=-24".(5'+-"-A$5'+-"

.6&>);"$)4"0>&".-2=-2."*$)"/$(+;"$)%"06-"-)0(2-"/(+-">(++".0(++"8-"

$=$(+$8+-<"#79:")&0(*-.">6-)"$"8+&*?"&2"$")&%-"(."+&.0;"$)%"

*2-$0-."$")->"*&'4"&/"5(..()3"%$0$"/2&5"06-"2-'+(*$."(0"

F"

!"

G"

H"

I"

!"

I"

H"

F"

!"

H"

F"

G"

I"

!"

G"

I"

F"

G"

H"

#79:"

"

" "

"

" "

7#8*3%90$1301$%

+3*+13$&1'%5&:1%;**.51<%

=>#?*0<%@#41A**:%#0)%

B#"**C%"#D1%+&*01131)%

$"1%6'1%*E%01$F*3:'%*E%

&01G+10'&D1%4*>+6$13'%

E*3%5#3.1H'4#51%)#$#%

'$*3#.1%#0)%

+3*41''&0.I%(/@J%6'1'%

$"1'1%$14"0&K61'%$*%

'$*31%10$13+3&'1%)#$#I%


Hadoop MapReduce▪ Fault-tolerant execution layer and API for parallel data processing ▪ Can target multiple storage systems▪ Key/value data model▪ Two main daemons: JobTracker and TaskTracker▪ Three main phases: Map, Shuffle, and Reduce▪ Growing sophistication for job and task scheduling▪ Many client interfaces▪ Java, C++, Streaming▪ Pig, SQL (Hive QL)


MapReduceMapReduce pushes work out to the data

!"#$%&'%(#)**+,%-&.%/#$#%&0%$"1%20$13+3&'1% !"

"

!"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+'

"

#$%%&%'"()*"+%+,-.&."/%"()*"%/0*."()+("+1($+,,-".(/2*"()*"0+(+"

0*,&3*2."4$1)"4$1)"5*((*2"6*27/24+%1*"()+%"2*+0&%'"0+(+"

/3*2"()*"%*(8/29"72/4"+".&%',*"1*%(2+,&:*0".*23*2;""<+0//6"

4/%&(/2."=/5."0$2&%'"*>*1$(&/%?"+%0"8&,,"2*.(+2("8/29",/.("0$*"

(/"%/0*"7+&,$2*"&7"%*1*..+2-;"@%"7+1(?"&7"+"6+2(&1$,+2"%/0*"&."

2$%%&%'"3*2-".,/8,-?"<+0//6"8&,,"2*.(+2("&(."8/29"/%"+%/()*2"

.*23*2"8&()"+"1/6-"/7"()*"0+(+;"

!"##$%&'

<+0//6A."B+6#*0$1*"+%0"<CDE"$.*".&46,*?"2/5$.("(*1)%&F$*."

/%"&%*>6*%.&3*"1/46$(*2".-.(*4."(/"0*,&3*2"3*2-")&')"0+(+"

+3+&,+5&,&(-"+%0"(/"+%+,-:*"*%/24/$."+4/$%(."/7"&%7/24+(&/%"

F$&19,-;"<+0//6"/77*2."*%(*262&.*."+"6/8*27$,"%*8"(//,"7/2"

4+%+'&%'"5&'"0+(+;"

D/2"4/2*"&%7/24+(&/%?"6,*+.*"1/%(+1("G,/$0*2+"+(H"

" &%7/I1,/$0*2+;1/4"

" JKLMNOLPMQLO!RR"

" )((6HSS888;1,/$0*2+;1/4S"

K"

P"

N"

K"

P"

!"

Q"

P"

!"

K"

Q"

N"

Q"

!"

N"

(#)**+%$#41'%

#)5#0$#.1%*6%(/789%

)#$#%)&'$3&:;$&*0%

'$3#$1.<%$*%+;'"%=*34%

*;$%$*%>#0<%0*)1'%&0%#%

?@;'$13A%B"&'%#@@*='%

#0#@<'1'%$*%3;0%&0%

+#3#@@1@%#0)%1@&>&0#$1'%

$"1%:*$$@101?4'%

&>+*'1)%:<%>*0*@&$"&?%

'$*3#.1%'<'$1>'A%


Hadoop Subprojects▪ Avro▪ Cross-language framework for RPC

▪ HBase▪ Table storage above HDFS, modeled after Google’s BigTable

▪ Hive▪ SQL interface to structured data stored in HDFS

▪ Pig▪ Language for data flow programming

▪ Zookeeper▪ Coordination service for distributed systems


Hadoop at Yahoo!▪ Jan 2006: Hired Doug Cutting▪ Apr 2006: Sorted 1.9 TB on 188 nodes in 47 hours▪ March 2008: Hadoop Summit attracted several hundred attendees▪ Apr 2008: Sorted 1 TB on 910 nodes in 209 seconds▪ Aug 2008: Deployed 4,000 node Hadoop cluster▪ Data Points▪ Over 20,000 nodes running Hadoop▪ Hundreds of thousands of jobs per day▪ Typical HDFS cluster: 1,400 nodes, 2 PB capacity▪ Largest shuffle is 450 TB


Facebook Before HadoopEarly 2006: The First Research Scientist

▪ Source data living on horizontally partitioned MySQL tier▪ Intensive historical analysis difficult▪ No way to assess impact of changes to the site

▪ First try: Python scripts pull data into MySQL▪ Second try: Python scripts pull data into Oracle

▪ ...and then we turned on impression logging


Facebook Data Infrastructure2007

Oracle Database Server

Data Collection Server

MySQL TierScribe Tier


Facebook Data Infrastructure2008

MySQL TierScribe Tier

Hadoop Tier

Oracle RAC Servers


Facebook Workloads▪ Data collection▪ server logs▪ application databases▪ web crawls

▪ Thousands of multi-stage processing pipelines▪ Summaries consumed by external users▪ Summaries for internal reporting▪ Ad optimization pipeline▪ Experimentation platform pipeline

▪ Ad hoc analyses


Facebook Hadoop Statistics▪ Over 700 servers running Hadoop in one data center▪ 2.5 PB in largest Hadoop cluster▪ 15 TB loaded into Hadoop cluster each day▪ 4,000 MapReduce jobs with 800,000 tasks run per day▪ 55 TB of data processed per day▪ 15 TB of additional data produced from cluster activity per day

▪ Hadoop cluster not retiring data!


Why Did Facebook Choose Hadoop?1. Demonstrated effectiveness for primary workload2. Proven ability to scale past any commercial vendor3. Easy provisioning and capacity planning with commodity nodes4. Data access for engineers and business analysts5. Single system to manage XML, JSON, text, and relational data6. No schemas enabled data collection without involving Data team7. Cost of software: zero dollars8. Deep commitment to continued development from Yahoo! 9. Active user and developer community10. Apache-licensed open source code


Hadoop Community SupportPeople Build Technology

▪ Most active Apache mailing lists▪ Details official documentation per release▪ Three books this year: O’Reilly, Apress, Manning▪ Free training videos online▪ Regular user group meetings in many cities▪ University courses across the world▪ Growing consultant and sys integrator expertise▪ Commercial training and support from Cloudera


How Software is BuiltMethodological Reflexivity

▪ Latour and Woolgar’s “Laboratory Life”▪ Study scientists doing science▪ Use “thick descriptions” and focus on “microconcerns”

▪ Some studies of closed and open source development exist▪ “Mythical Man Month”, “Cathedral and the Bazaar”▪ Hertel et al. surveyed 141 Linux kernel developers

▪ Focus on the people creating code▪ Less religion, more empirical analyses▪ Build tools to facilitate interaction and output


Building Open Source SoftwareStructural Conditions for Success

▪ Moon and Sproul proposed some rules for successful projects▪ Authority comes from competence▪ Leaders have clear responsibilities and delegate often▪ The code has a modular structure▪ Establish a parallel release policy: stable and experimental▪ Give credit to non-source contributions, e.g. documentation▪ Communicate clear rules and norms for community online▪ Use simple and reliable communication tools


Building Software FasterConsolidate Best Practices

▪ Javascript frameworks starting to converge▪ Many adopting jQuery’s selector syntax▪ Significant benchmarks emerging

▪ Web frameworks push idioms into project structure▪ What would be the Rails/Django equivalent for data storage?▪ Reusable components also nice, e.g. log structured merge trees▪ Compare work on BOOM, RodentStore

▪ Debian distributes release note writing responsibility via “beats”


Complications of Open Source▪ Intellectual property▪ Trademark, Copyright, Patent, and Trade Secret▪ Litigation history

▪ Business models and foundations to ensure long-term support▪ Direct support: Red Hat, MySQL▪ Indirect support: LLVM, GSoC▪ Foundations: Apache, Python, Django

▪ Diversity of licenses▪ Licenses form communities▪ Licenses change over time (cf. Rambus BSD incident)


How Software is AdoptedChoosing the Right Tool for the Job

▪ Must be aware that a software project exists▪ Tools like GitHub, Ohloh, Launchpad▪ Sites like Reddit and Hacker News

▪ Existing example use cases are critical▪ At Facebook, we studied motivations for content production▪ Especially effective: Bandura’s “Social Learning Theory”▪ Hadoop being run in production at Yahoo! and Facebook

▪ Active user communities and great documentation


Open LearningOpen Data, Hypotheses and Workflows

▪ In science, data is generated once and analyzed many times▪ IceCube▪ LHC

▪ Lots of places where data and visualizations get shared▪ data.gov, Many Eyes, Swivel, theinfo.org, InfoChimps, iCharts

▪ Record which hypotheses and workflows have been applied▪ Increase diversity of questions asked and applications built▪ Analysis skills unevenly distributed; send skills to the data!


The Future of Data ProcessingHadoop, the Browser, and Collaboration

▪ “The Unreasonable Effectiveness of Data”, “MAD Skills”▪ Single namespace for your organization’s bits▪ Single engine for distributed data processing▪ Materialization of structured subsets into optimized stores▪ Browser as client interface with focus on user experience▪ The system gets better over time using workload information▪ Cloning and sharing of common libraries and workflows▪ Global metadata store driving collection, analysis, and reporting▪ Version control within and between sites, cf. Orchestra