apache hadoop india summit 2011 talk "feeds processing at yahoo!" by jean-christophe...
Post on 24-May-2015
1.348 Views
Preview:
TRANSCRIPT
Feeds processing at Yahoo!One Platform, One Hadoop, Two Systems
Yahoo! Inc.
Apache Hadoop India Summit16th February 2011
Agenda Pacman
Design Contributions
The small feeds problem
Pepper Requirements Design
Production numbers
Cover the whole spectrum
Examples of processing
ConclusionYahoo!
Inc2
PacmanStarted in 2006 in Bangalore
Process large feeds, millions of records in few hours
Multi-Tenant
Reliability, Operability
Use Hadoop M/R, one record is unit of processing
Workflow semantics over HadoopWorkflow defined by DAG
Each node result is stored in HDFS ‘Channels’
Feeds processing oriented API, abstracting M/R
High Availability, Cross-colo replication HDFS data
3Yahoo!
Inc
Design
4Yahoo!
Inc
Notification
Asynchronous processing
One Job for each WF node
State in DB
Feed copied on the Grid
Reporting service exposes metrics and logs
FeedsArchive
Receiver
Hadoop
HDFS
Pacman Grid
6 : Read feeds
2 : Large feeds notify
3 : Store notification
DeploymentService
WorkflowExecutor
1 : Deploy WFDeploy native pkg
5 : Send jobAnd wait notify
(for each WF node)
CoreDB
Pebls/UDF
ReportingService
7 : Send Instrumentation data(for each WF node)
9 : Read logs
4 : LaunchWF
Admin User
Contributions
Multiple Output files for a Job
Counters
Chaining of Maps
Led to open-sourced Oozie
5Yahoo!
Inc
The small feeds problem
More and more small feeds on boarded (NPC, OMG, Green…)
Overhead of Pacman is high (Hadoop, DB…)
Too many small files on HDFS
Solution : Process nodes of Workflow in WebServer Farm
Lack of IsolationBetween executions
Native libraries management
Operability issues (provisioning,…)
6Yahoo!
Inc
Pepper requirements
Be able to support all properties :News, Finance, Travel, …
Scalable (millions of feeds a day), Elastic
Isolation, Multiple Native Libraries versions
Low overhead (<5s)
Compatible with Pacman API
Reuse Pacman code/infrastructure as most as possible
7Yahoo!
Inc
Pepper
Servlet Model
Synchronous in-memory execution of the workflow (very fast)
No use of HDFS
Share Pacman API and infrastructure
Hadoop
Reporting, Deployment…
Cloud like qualities
Elastic, Scalable
Isolation
8Yahoo!
Inc
DesignEmbedded Jetty server runs in Map task, registers with ZooKeeper
1 Hadoop job = 1 Map task = 1 Web Server = 1 WebApp = 1 Workflow
Proxy Router receives incoming requests, looks up ZooKeeper & redirects to appropriate Web Server
9Yahoo!
Inc
ProxyRouter
6 : Send request(synchronous)
9 : Send request(synchronous)
4 : Send job
ZooKeeper
Hadoop JT5 : Createhost entry
7 : Read avail.entries
2 : Copywebapp
HDFS
1 : Register webapp
Job Manager
3 : Add Webapp
node
Admin User
10 : copy logs
Map Web
Engine
Production numbers
10Yahoo!
Inc
System Burst Rate
(request/min)
Throughput
(requests/day)
Platform
Latency (Avg.)
Response Time (Avg.)
Pepper 2,000 3 million 75 ms 4s
PacMan 50 10,000 90s 120s
Qualified with simple workflow and 3 Hadoop slaves cluster
Production numbers
Pacman :
20+ solutions (Autos, Real Estate, Deals…)
150,000 feeds
250 requests/h
200 millions listings processed/week
Pepper :
News, Finance, NPC
600,000 feeds
10,000 requests/h… for now
20 Hadoop slave cluster (x2 colos)
11Yahoo!
Inc
Cover the whole spectrum
Clever switch between the 2 systems
Choice can be done upfront
‘Sticky’ feeds go to Pacman
Size > 2MB go to Pacman
Failed feeds in Pepper are redirected to PacmanOutOfMemory
TimeOut
12Yahoo!
Inc
Example of processing
Validation against schema
Filtering (Security), Image resizing
Send images to edge serving
Reformat to common model
Simple (in-line) enrichments
Categorization
Geocoding
Entity Recognition
Clustering
13Yahoo!
Inc
Conclusion
One common platform (Deployment, Reporting…)
Covers the whole spectrum of feeds
Share same Hadoop cluster
Very generic conceptsPacman : Workflow engine
Pepper : Serving cloud on top of Hadoop
14Yahoo!
Inc
Pepper future work
On-demand allocation of servers
Async NIO between Proxy Router & Map Web Engine to increase scalability
Improving distribution of requests across web servers
Follow Hadoop roadmap
15Yahoo!
Inc
References
Ooziehttp://yahoo.github.com/oozie/
http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2- oozie/
Pepperhttp://yahoo.github.com/pepper/ (new !!)
http://www.computer.org/portal/web/csdl/doi/10.1109/CloudCom.2010.39
http://salsahpc.indiana.edu/CloudCom2010/slides/PDF/Pepper%20An%20Elastic%20Web%20Server%20Farm%20for%20Cloud%20based%20on%20Hadoop.pdf
16Yahoo!
Inc
Questions ?
17Yahoo!
Inc
top related