hadoop: a view from the trenches
TRANSCRIPT
1 © Copyright 2013 Pivotal. All rights reserved. 1 © Copyright 2013 Pivotal. All rights reserved.
Hadoop: A View from the Trenches Milind Bhandarkar Chief Scientist, Pivotal Twitter: @techmilind
2 © Copyright 2013 Pivotal. All rights reserved.
About Me � http://www.linkedin.com/in/milindb
� Founding member of Hadoop team at Yahoo! [2005-2010]
� Contributor to Apache Hadoop since v0.1
� Built and led Grid Solutions Team at Yahoo! [2007-2010]
� Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)
� Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and Pivotal (formerly EMC-Greenplum)
3 © Copyright 2013 Pivotal. All rights reserved.
First, technology is good. Then it gets bad. Then it gets stable. - Alistair Croll (http://strata.oreilly.com/2013/01/data-warefare.html)
7 © Copyright 2013 Pivotal. All rights reserved.
W-1-W
� WebMap : Graph processing for WWW
� Dreadnaught: Infrastructure for WebMap
� Juggernaut: Infrastructure for W-1-W
� JFS, JMR, Condor: Abandoned for Hadoop
10 © Copyright 2013 Pivotal. All rights reserved.
Lessons Learned
� Multi-Tenancy from ground-up
� Agility in lieu of Performance
� Provisioning vs Procurement
� “Weird” use cases as learning experience
� Academic collaboration
12 © Copyright 2013 Pivotal. All rights reserved.
http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/ Big Data Landscape (June 2012)
13 © Copyright 2013 Pivotal. All rights reserved.
http://www.datameer.com/blog/perspectives/hadoop-ecosystem-as-of-january-2013-now-an-app.html Hadoop Ecosystem (January 2013)
17 © Copyright 2013 Pivotal. All rights reserved.
Hadoop Economics is Game Changer
$-
$20,000
$40,000
$60,000
$80,000
2008 2009 2010 2011 2012 2013
Big Data Platform Price/TB
Big Data DB Hadoop
18 © Copyright 2013 Pivotal. All rights reserved.
“Typical” Hadoop Use-Case
� “User” Modeling
� Objective: Determine User-Interests by mining user-activities
� Large dimensionality of possible user activities
� Typical user has sparse activity vector
� Event attributes change over time
19 © Copyright 2013 Pivotal. All rights reserved.
Domain: Retail
� User = Customer
� Activities – Online: Purchase, Ad click, FB Likes – Offline : Brick-and-mortar purchases, returns, coupon clipping,
gift cards
� Personalized Product Recommendation
20 © Copyright 2013 Pivotal. All rights reserved.
Domain: IT Infrastructure
� “User” = HW & SW Components
� Activities – Log messages, Metrics, connectivity, communication events
� Goal: Proactive alerting of imminent failures
21 © Copyright 2013 Pivotal. All rights reserved.
Domain: Healthcare
� User = Patient
� Activities – Doctor Visits, Medicine refills, Medical History – 3G/WiFi-enabled Pillbox...
� Goal: Prevent Hospital Readmissions
22 © Copyright 2013 Pivotal. All rights reserved.
Domain: Telecom
� User: Subscriber
� Activities – Calls made, duration, calls dropped, locations, ... – “social” graph, status updates
� Goal: Reduce customer churn
23 © Copyright 2013 Pivotal. All rights reserved.
Domain: Ad-Supported Web
� User = User :-)
� Activities – Clicks on content, Likes, Repost – Search Queries, Comments, Participation
� Goal: Increase Engagement, Increase Clicks on revenue-generating content (ads/premium content)
24 © Copyright 2013 Pivotal. All rights reserved.
User-Modeling Pipeline
� Sessionization
� Feature and Target Generation
� Model Training
� Offline Scoring & Evaluation
� Batch Scoring & Upload to serving
28 © Copyright 2013 Pivotal. All rights reserved.
Storage Wars
� HDFS
� KosmosFS, LocalFS, Quantcast FS, S3
� MapR
� GPFS, Isilon, Atmos, Swift, NetApp
� Lustre, Gluster, Ceph, PanFS, PVFS
� EMC ViPR
29 © Copyright 2013 Pivotal. All rights reserved.
NoSQL = Not Yet SQL ?
� Pivotal HAWQ
� Cloudera Impala
� Apache Drill, Spire (Drawn to Scale)
� Cascading Lingual, Optiq
� Hortonworks Stinger
� More to come....
30 © Copyright 2013 Pivotal. All rights reserved.
Prepare for Convergence
� HPC: Cache Coherence, Prefetching, Zero-copy, Low-contention locks
� “Big Data”: Caching, Mirroring, Sharding (various flavors), relaxed consistency
� Databases: Indexing, MVCC, Columnar storage/processing, Cost-based optimization
31 © Copyright 2013 Pivotal. All rights reserved.
Convergence
� Resource Allocation, Scheduling, Lifecycle Management
� Compute, Storage, and Communication isolation, Multi-tenancy, Performance SLAs
� Auth & Auth, Data/System Provisioning and Management, Monitoring, Metadata Management, Metering
32 © Copyright 2013 Pivotal. All rights reserved.
Hadoop As A Service
� Hadoop Platform-As-A-Service – EMR competitor proliferation – OpenStack, CloudStack, Joyent...
� Application-As-A-Service (Hadoop Inside) – Cetas, Continuuity, Causata, Claritics, Tresata, Wibidata,…
� Pivotal One – CloudFoundry, Hadoop, HAWQ, Analytics – Spring, Redis, RabbitMQ
33 © Copyright 2013 Pivotal. All rights reserved.
New Hardware Platforms
� Mellanox - Hadoop Acceleration through Network Levitated Merge
� RoCE - Brocade, Cisco, Extreme, Arista...
� ARM - Low power Hadoop servers
� SSD - Velobit, Violin, FusionIO, Samsung..
� Niche - Compression, Encryption…
34 © Copyright 2013 Pivotal. All rights reserved.
IAAS as the new Hardware
� AWS, GCE, Azure
� vSphere, OpenStack
� Easy Provisioning
� Scalable
� Elastic
� Ubiquitous
� Needs bundling with Data & Analytics as Services
35 © Copyright 2013 Pivotal. All rights reserved.
Big Data Platform of Future ?
depl
oy
Public Cloud
Private Cloud
On Premise