hadoop @ ibmbigdata
DESCRIPTION
Eric Baldeschwieler's talk about Hadoop at Yahoo, given at IBM Big Data SymposiumTRANSCRIPT
Eric Baldeschwieler VP, Hadoop Software
HADOOP
YAHOO &
USING AND IMPROVING APACHE HADOOP AT YAHOO!
• Brief Overview
• Hadoop @ Yahoo! • Hadoop Momentum
• The Future of Hadoop
AGENDA
2
happening WHAT’S
-‐ Big Data is here! -‐ unstructured data -‐ petabyte scale -‐ operationally critical
Flickr : sub_lime79
INTO INSIGHTS TURNING DATA
machine learning time series
content clustering
factorization models
logic regression
Flickr : NASA Goddard Photo and Video
algorithms user interest prediction
ad inventory modeling
RELEVANT MAKING YAHOO
Flickr : ogimogi
POWERING HADOOP:
science + big data + insight = personal relevance = VALUE
YAHOO!
Flickr : DDFic
WHAT IS HADOOP?
7
HDFS
MapReduce
Pig Hive Programming Languages
Computation
Storage
Commodity • Computers • Network
Focus on • Simplicity • Redundancy • Scale • Availability
Transforms commodity equipment into a service that: • HDFS – Stores peta bytes of data reliably • Map-Reduce – Allows huge distributed computations
Key Attributes • Redundant and reliable – Doesn’t stop or loose data even as hardware fails • Easy to program – Our rocket scientists use it directly! • Very powerful – Allows the development of big data algorithms & tools • Batch processing centric
WHAT HADOOP ISN’T
• A replacement for relaFonal and data warehouse systems
• A transacFonal / online / serving system • A low latency or streaming soluFon
8
HADOOP IN THE ENTERPRISE
9
RDMS EDW Data Marts
HADOOP CLUSTER(S)
TransacFons, Structured Data
Business ApplicaFons
Web Logs, Server Logs, Social Media, etc…
InteracFons Semi-‐Structured or Un-‐Structured Data
Business Intelligence ApplicaFons
10
HADOOP @ YAHOO!
11
HADOOP @ YAHOO! “Where Science meets Data”
HADOOP CLUSTERS Tens of thousands of servers
PRODUCTS
APPLIED SCIENCE
Data Analytics Content Optimization Content Enrichment Yahoo! Mail Anti-Spam Advertising Products Ad Optimization Ad Selection Big Data Processing & ETL
User Interest Prediction Ad inventory prediction Machine learning - search ranking Machine learning - ad targeting Machine learning - spam filtering 10s of Petabytes
2006 2007 2008 2009 2010 12
FROM PROJECT TO CORE PLATFORM
170 PB Storage
Thou
sand
s of
Ser
vers
Pet
abyt
es
90
80
70
60
50
40
30
20
10
0
250
200
150
100
50
0
Research
Science Impact
Daily ProducFon
“Behind every click”
40K+ Servers
5M+ Monthly Jobs
HADOOP POWERS THE YAHOO! NETWORK
advertising optimization
ad selection
Yahoo! Homepage
machine learning search ranking
ad inventory prediction
Yahoo! Mail anti-spam
user interest prediction
audience, ad and search pipelines advertising data systems
Content Optimization
data analytics
13
twice the engagement
CASE STUDY YAHOO! HOMEPAGE
14
Personalized for each visitor Result: twice the engagement
+160% clicks vs. one size fits all
+79% clicks vs. randomly selected
+43% clicks vs. editor selected
Recommended links News Interests Top Searches
CASE STUDY YAHOO! HOMEPAGE
15
• Serving Maps • Users -‐ Interests
• Five Minute ProducLon
• Weekly CategorizaLon models
SCIENCE HADOOP
CLUSTER
SERVING SYSTEMS
PRODUCTION HADOOP
CLUSTER
USER BEHAVIOR
ENGAGED USERS
CATEGORIZATION MODELS (weekly)
SERVING MAPS
(every 5 minutes) USER
BEHAVIOR
» Identify user interests using Categorization models
» Machine learning to build ever better categorization models
Build customized home pages with latest data (thousands / second)
CASE STUDY YAHOO! MAIL Enabling quick response in the spam arms race
• 450M mail boxes • 5B+ deliveries/day • AnLspam models retrained every few hours on Hadoop
40% less spam than Hotmail and 55% less spam than Gmail “ “
SCIENCE
PRODUCTION
16
YAHOO! & APACHE HADOOP
17
Yahoo! has contributed 70+% of Apache Hadoop code to date Hadoop is not our business, but Hadoop is key to our business • Yahoo! benefits from open source eco-‐system around Hadoop • Hadoop drives revenue at Yahoo! by making our core products be`er We need Hadoop to be rock solid • We invest heavily in core Hadoop development • We focus on scalability, reliability, availability We fix bugs before you see them • We run very large clusters • We have a large QA effort • We run a huge variety of workloads We are good Apache Hadoop ciLzens • We contribute our work to Apache • We share the exact code we run
18
HADOOP MOMENTUM
HADOOP IS GOING MAINSTREAM 2007
2008
2009
19
2010
The Datagraph Blog
THE PLATFORM EFFECT BIRTH OF AN ECOSYSTEM and other Early Adopters
Scale and productize Hadoop
20
Apache Hadoop
Orgs with Internet Scale Problems Add tools / frameworks, enhance Hadoop
Mainstream / Enterprise adoption Drive further development, enhancements
Enhance Hadoop Ecosystem
Service Providers Grow ecosystem - Training, support, enhancements
Virtuous Circle! • Investment -> Adoption • Adoption -> Investment
21
THE FUTURE OF HADOOP
MAKING HADOOP ENTERPRISE-READY WHAT’S NEXT
22
Hadoop is far from “done” • Current implementaFon is showing its age • Need to address several deficiencies in scalability, flexibility, ease of use & performance
Yahoo! is working on Next GeneraLon of Hadoop • MapReduce: Rewrite to improve performance; pluggable support for new programming models
• HDFS: Adding volumes to improve scalability; Flush & sync support for applicaFons that log to HDFS
Apache should remain the hub of Hadoop ecosystem • Yahoo! contributes all Hadoop changes back to Apache Hadoop • Everyone benefits from shared neutral foundaFon
23
Questions?