machine learning and hadoop present and future josh wills cloudera data science team february 7th,...
TRANSCRIPT
![Page 1: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/1.jpg)
Machine Learning and HadoopPresent and FutureJosh WillsCloudera Data Science TeamFebruary 7th, 2012
![Page 2: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/2.jpg)
Today’s Speaker – Josh Wills
• [email protected]• Formerly of Google (2008 – 2011)
• Worked on the ad auction• Led the team that build the data infrastructure for Google+
• Before that: a bunch of startups• Sometimes as a software engineer, sometimes as a statistician
• Math degree from Duke and a half-finished PhD from The University of Texas at Austin
• Now: Director of Data Science at Cloudera
Copyright 2012 Cloudera Inc. All rights reserved
![Page 3: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/3.jpg)
High Availability for Data Scientists
Copyright 2012 Cloudera Inc. All rights reserved
NIPS
![Page 4: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/4.jpg)
Outline
• Part 1: Industrial Machine Learning• Part 2: Machine Learning and Hadoop
• State of the World• Where Things Are Headed
• Part 3: Offline/Online Batch/Real-Time
Copyright 2012 Cloudera Inc. All rights reserved
![Page 5: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/5.jpg)
Industrial Machine Learning
Copyright 2012 Cloudera Inc. All rights reserved
![Page 6: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/6.jpg)
Delta One: Model Evaluation
• Machine Learning is One Piece of a Complex System• Well-defined objective functions are the exception
• Multiple, often conflicting goals• Weights are fuzzy and shift with business priorities• Pareto optimization is the safest play
• Predictive Accuracy Is Only Useful Up to a Point• Examples
• Computational advertising• Friend recommendations on social networks
Copyright 2012 Cloudera Inc. All rights reserved
![Page 7: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/7.jpg)
Delta Two: Systems Precede Algorithms
• Greenfield Projects Hardly Ever Happen• (and don’t usually launch)
• Industrial Computational Infrastructure• General-purpose• Cheap• Shared
• Constraints Drive Innovation• Vowpal Wabbit Hashing Trick• SETI @ Google
Copyright 2012 Cloudera Inc. All rights reserved
![Page 8: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/8.jpg)
Delta Three: Workflow
Copyright 2012 Cloudera Inc. All rights reserved
Practice Over Theory Blog
![Page 9: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/9.jpg)
Delta Three: Workflow
• Optimize the Overall Process• Model fitting is a small piece of the overall flow time• Parallelize everything
• Better Features > Better Models• Fast Model Deployment
• Common Feature Extraction Logic• Servable Models
• Validation as Sanity Checking• Deploy to a small subset of real data and evaluate
Copyright 2012 Cloudera Inc. All rights reserved
![Page 10: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/10.jpg)
Outline
• Part 1: Industrial Machine Learning• Part 2: Machine Learning and Hadoop
• State of the World• Where Things Are Headed
• Part 3: Offline/Online Batch/Real-Time
Copyright 2012 Cloudera Inc. All rights reserved
![Page 11: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/11.jpg)
“Hadoop. It’s Where The Data Is.”
Copyright 2012 Cloudera Inc. All rights reserved
![Page 12: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/12.jpg)
Hadoop Platform: Substrate
• Commodity servers• Open Compute
• Open source operating system• Linux
• Open source configuration management• Puppet• Chef
• Coordination service• ZooKeeper
Copyright 2012 Cloudera Inc. All rights reserved
![Page 13: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/13.jpg)
Hadoop Platform: Storage
• Distributed schema-less storage• HDFS• Ceph
• Append-only storage formats and metadata• Avro• RCFile• HCatalog
• Mutable key-value storage and metadata• HBase
Copyright 2012 Cloudera Inc. All rights reserved
![Page 14: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/14.jpg)
Hadoop Platform: Integration
• Tool Access• FUSE• JDBC• ODBC
• Data Ingestion• Flume• Sqoop
Copyright 2012 Cloudera Inc. All rights reserved
![Page 15: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/15.jpg)
ML and Hadoop: The State of the World
Copyright 2012 Cloudera Inc. All rights reserved
![Page 16: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/16.jpg)
Computation: Plain Old MapReduce
• Great for:• Data Preparation• Feature Engineering• Model Validation/Evaluation
• Works For Certain Model Fitting Problems• Recommendation Systems• Expectation Maximization• Decision Trees (PLANET; Gradient Boosted Decision Trees)
• Not A Practical Option for Online Learning• Way More Detail from the KDD 2011 Talk
Copyright 2012 Cloudera Inc. All rights reserved
![Page 17: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/17.jpg)
Tools for Data Preparation/Feature Engineering
• Languages/Environments• PigLatin• HiveQL• Need to deal with mismatch between offline/online feature
generation
• Java/Scala APIs• Crunch (Cloudera)• Scoobi (NICTA)• Cascading (Concurrent)• Jaql (IBM)
Copyright 2012 Cloudera Inc. All rights reserved
![Page 18: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/18.jpg)
Apache Mahout
• The starting place for MapReduce-based machine learning algorithms• Not machine-learning-in-a-box• Custom tweaks/modifications are the rule
• A disparate collection of algorithms for:• Recommendations• Clustering• Classification• Frequent Itemset Mining
Copyright 2012 Cloudera Inc. All rights reserved
![Page 19: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/19.jpg)
Apache Mahout (cont.)
• Best Library: Taste Recommender• Oldest project, most widely-deployed in production• SVD implementation is particularly active
• Good Libraries: Online SGD• Does not use MapReduce• Vowpal Rabbit is faster, has L-BFGS option
• Roll Your Own Instead: Naïve Bayes• Challenges
• “Secret sauce” effect• Delta between Mahout + the cutting edge in ML
Copyright 2012 Cloudera Inc. All rights reserved
![Page 20: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/20.jpg)
More Machine Learning Interfaces for Hadoop
• Based on MapReduce• SystemML (IBM)
• R-Based Systems (Augment MapReduce with R)• Segue• RHIPE• RHadoop• Ricardo (IBM)
Copyright 2012 Cloudera Inc. All rights reserved
![Page 21: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/21.jpg)
ML and Hadoop: Where Things are Headed
Copyright 2012 Cloudera Inc. All rights reserved
![Page 22: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/22.jpg)
MRv2 and YARN
• Eliminates JobTracker bottleneck• Separate Resource Manager/Scheduler• Individual jobs have their own task masters• No more map slots and reduce slots
• Moves MapReduce into user-land• Hadoop clusters can run all sorts of jobs
• Will also allow fine-grained resource allocation• CPU• Memory• Disk
Copyright 2012 Cloudera Inc. All rights reserved
![Page 23: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/23.jpg)
YARN Job Flows
Copyright 2012 Cloudera Inc. All rights reserved
![Page 24: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/24.jpg)
The Contenders
Copyright 2012 Cloudera Inc. All rights reserved
![Page 25: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/25.jpg)
AllReduce
• Developed at Yahoo! Research• Defines the allreduce operation
• N machines each have a number => each machine has the sum of the numbers
• At the heart of Vowpal Wabbit’s performance• Implemented in C++• Can be patched into Apache Hadoop and used today
Copyright 2012 Cloudera Inc. All rights reserved
![Page 26: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/26.jpg)
Spark
• Developed at Berkeley’s AMP Lab
• Defines operations on distributed in-memory collections
• Written in Scala• Supports reading to and
writing from HDFS
Copyright 2012 Cloudera Inc. All rights reserved
![Page 27: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/27.jpg)
GraphLab
• Developed at CMU• Lower-level primitives
• (but higher than MPI)
• Map/Reduce => Update/Sort
• Flexible, allows for asynchronous computations*
• C++/Java/Python/Matlab
Copyright 2012 Cloudera Inc. All rights reserved
![Page 28: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/28.jpg)
Outline
• Part 1: Industrial Machine Learning• Part 2: Machine Learning and Hadoop
• State of the World• Where Things Are Headed
• Part 3: Offline/Online Batch/Real-Time
Copyright 2012 Cloudera Inc. All rights reserved
![Page 29: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/29.jpg)
Offline vs. Online Learning
Copyright 2012 Cloudera Inc. All rights reserved
![Page 30: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/30.jpg)
Batch vs. Real-Time: The CAP Theorem
• Impossible for a distributed computer system to simultaneously provide:• Consistency• Availability• Partition Tolerance
• Instead, we end up with BASE• Basically Available Soft State Eventual consistency• High availability• Cleanup mechanism for providing consistency (eventually)
Copyright 2012 Cloudera Inc. All rights reserved
![Page 31: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/31.jpg)
Nathan Marz: Beating the CAP Theorem
Copyright 2012 Cloudera Inc. All rights reserved
![Page 32: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/32.jpg)
Models as Queries
Copyright 2012 Cloudera Inc. All rights reserved
![Page 33: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/33.jpg)
Collapsing Distinctions
Copyright 2012 Cloudera Inc. All rights reserved
![Page 34: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/34.jpg)
Systems Drive Algorithms, Redux
Copyright 2012 Cloudera Inc. All rights reserved
![Page 35: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012](https://reader038.vdocument.in/reader038/viewer/2022102900/551aaf4f550346e0158b61ba/html5/thumbnails/35.jpg)
Questions?Want A [email protected]