seattle data geeks: hadoop and beyond
TRANSCRIPT
Paco Nathanliber118.com/pxn/
@pacoid
“Hadoop and Beyond”
Licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.
1Saturday, 13 July 13
?issues confronted:
“Data becomes too complex for ONE computer, ONE model, ONE expert…”
2Saturday, 13 July 13
First Principles
we are taught to think of computing resources in terms of Von Neumann architecture
in other words, we characterize the computing resources by CPU, RAM, I/O
3Saturday, 13 July 13
First Principles
we are taught to think of computing resources in terms of Von Neumann architecture
in other words, we characterize the computing resources by CPU, RAM, I/O
CPU
4Saturday, 13 July 13
First Principles
we are taught to think of computing resources in terms of Von Neumann architecture
in other words, we characterize the computing resources by CPU, RAM, I/O
RAM
5Saturday, 13 July 13
First Principles
we are taught to think of computing resources in terms of Von Neumann architecture
in other words, we characterize the computing resources by CPU, RAM, I/O
I/O
6Saturday, 13 July 13
First Principles
back in the day, all the tables required for a given database could fit onto one computer,with one memory space, and one file space
7Saturday, 13 July 13
First Principles
back in the day, all the tables required for a given database could fit onto one computer,with one memory space, and one file space
• okay, maybe the CPU was multi-core…
• okay, maybe RAM paged out to virtual memory…
• okay, maybe the disks were in a RAID config…
8Saturday, 13 July 13
First Principles
back in the day, all the tables required for a given database could fit onto one computer,with one memory space, and one file space
• okay, maybe the CPU was multi-core…
• okay, maybe RAM paged out to virtual memory…
• okay, maybe the disks were in a RAID config…
or there were extra caches, or separate busses, etc.
but essentially those were incremental extensions to a Von Neumann architecture…
9Saturday, 13 July 13
First Principles
back in the day, all the tables required for a given database could fit onto one computer,with one memory space, and one file space
• okay, maybe the CPU was multi-core…
• okay, maybe RAM paged out to virtual memory…
• okay, maybe the disks were in a RAID config…
or there were extra caches, or separate busses, etc.
but essentially those were incremental extensions to a Von Neumann architecture…
a machine created in his image, if you will
NB: credit should go to Eckert and Mauchly, inventors of the ENIAC
10Saturday, 13 July 13
First Principles
a generation of computer scientists has been taught to think “relational” – data on a DB server
RDBMS made sense, with their indexes, b-trees, normal forms, etc.
Q: need to query bigger data? A: simple, buy or lease a bigger DB server
11Saturday, 13 July 13
issues confronted:
“Data becomes too complex for ONE computer, ONE model, ONE expert…”
trends observed:
“Historical arc: 1996 - 2013, rise of machine data, scale-out, and algorithmic modeling…”
“The management problem is about multi-disciplinary teams and learning curves…”
12Saturday, 13 July 13
Q3 1997: inflection point
Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware
This effort prepared the way for huge Internet successesin the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG
MapReduce and the Apache Hadoop open source stack emerged from this
13Saturday, 13 July 13
RDBMS
Stakeholder
SQL Queryresult sets
Excel pivot tablesPowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BIAnalysts
optimizedcode
Circa 1996: pre- inflection point
14Saturday, 13 July 13
RDBMS
Stakeholder
SQL Queryresult sets
Excel pivot tablesPowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BIAnalysts
optimizedcode
Circa 1996: pre- inflection point
“throw it over the wall”
15Saturday, 13 July 13
RDBMS
SQL Queryresult sets
recommenders+
classifiersWeb Apps
customertransactions
AlgorithmicModeling
Logs
eventhistory
aggregation
dashboards
Product
EngineeringUX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
16Saturday, 13 July 13
RDBMS
SQL Queryresult sets
recommenders+
classifiersWeb Apps
customertransactions
AlgorithmicModeling
Logs
eventhistory
aggregation
dashboards
Product
EngineeringUX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
“data products”
17Saturday, 13 July 13
Amazon“Early Amazon: Splitting the website” – Greg Lindenglinden.blogspot.com/2006/02/early-amazon-splitting-website.html
eBay“The eBay Architecture” – Randy Shoup, Dan Pritchettaddsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.htmladdsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)youtu.be/E91oEn1bnXM
Google“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)youtu.be/qsan-GQaeykperspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
MIT Media Lab“Social Information Filtering for Music Recommendation” – Pattie Maespubs.media.mit.edu/pubs/papers/32paper.psted.com/speakers/pattie_maes.html
Primary Sources
18Saturday, 13 July 13
Three broad categories of dataCurt Monash, 2010dbms2.com/2010/01/17/three-broad-categories-of-data
• Human/Tabular data – human-generated data which fits well into tables/arrays
• Human/Nontabular data – all other data generated by humans
• Machine-Generated data
Now let’s add IoT:
• A/D conversion for sensors
Machine Data
19Saturday, 13 July 13
Data JujitsuDJ PatilO’Reilly, 2012amazon.com/dp/B008HMN5BE
Building Data Science TeamsDJ PatilO’Reilly, 2011amazon.com/dp/B005O4U3ZE
Data Products
20Saturday, 13 July 13
Workflow
RDBMS
near timebatch
services
transactions,content
socialinteractions
Web Apps,Mobile, etc.History
Data Products Customers
RDBMS
LogEvents
In-Memory Data Grid
Hadoop, etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/wdev
datascience
discovery+
modeling
Planner
Ops
dashboardmetrics
businessprocess
optimizedcapacitytaps
DataScientist
App Dev
Ops
DomainExpert
introducedcapability
existingSDLC
Circa 2013: clusters everywhere
21Saturday, 13 July 13
Workflow
RDBMS
near timebatch
services
transactions,content
socialinteractions
Web Apps,Mobile, etc.History
Data Products Customers
RDBMS
LogEvents
In-Memory Data Grid
Hadoop, etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/wdev
datascience
discovery+
modeling
Planner
Ops
dashboardmetrics
businessprocess
optimizedcapacitytaps
DataScientist
App Dev
Ops
DomainExpert
introducedcapability
existingSDLC
Circa 2013: clusters everywhere
“optimize topologies”
22Saturday, 13 July 13
?issues confronted:
“Data becomes too complex for ONE computer, ONE model, ONE expert…”
23Saturday, 13 July 13
Modeling
back in the day, we worked with practices based on data modeling
1. sample the data
2. fit the sample to a known distribution
3. ignore the rest of the data
4. infer, based on that fitted distribution
that served well with ONE computer, ONE analyst, ONE model… just throw away annoying “extra” data
circa late 1990s: machine data, aggregation, clusters, etc.algorithmic modeling displaced data modeling
because the data won’t fit on one computer anymore
24Saturday, 13 July 13
Two Cultures
“A new research community using these tools sprang up. Their goal was predictive accuracy. The community consisted of young computer scientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.”
Statistical Modeling: The Two Cultures Leo Breiman, 2001bit.ly/eUTh9L
this paper chronicled a sea change from data modeling practices(silos, manual process) to the rising use of algorithmic modeling (machine data for automation/optimization)
25Saturday, 13 July 13
issues confronted:
“Data becomes too complex for ONE computer, ONE model, ONE expert…”
trends observed:
“Historical arc: 1996 - 2013, rise of machine data, scale-out, and algorithmic modeling…”
“The management problem is about multi-disciplinary teams and learning curves…”
26Saturday, 13 July 13
Algorithmic Modeling
“The trick to being a scientist is to be open to using a wide variety of tools.” – Breiman
circa 2001: Random Forest, bootstrap aggregation, etc., yield dramatic increases in predictive power over earlier modeling such as Logistic Regression
major learnings from the Netflix Prize: the power of ensembles, model chaining, etc.
the problems at hand have become simply too big and too complex for ONE distribution, ONE model, ONE team…
an overall history of data science:forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/
27Saturday, 13 July 13
Why Do Ensembles Matter?
The World…per Data Modeling
The World…
28Saturday, 13 July 13
Ensemblers of Fortune
Breiman: “a multiplicity of data models”
BellKor team: 100+ individual models in 2007 Progress Prize
while the process of combining models adds complexity (making it more difficult to anticipate or explain predictions) accuracy may increase substantially
Ensemble Learning: Better Predictions Through DiversityTodd HollowayETech (2008)abeautifulwww.com/EnsembleLearningETech.pdf
The Story of the Netflix Prize: An Ensemblers TaleLester MackeyNational Academies Seminar, Washington, DC (2011)stanford.edu/~lmackey/papers/
29Saturday, 13 July 13
?issues confronted:
“Data becomes too complex for ONE computer, ONE model, ONE expert…”
30Saturday, 13 July 13
issues confronted:
“Data becomes too complex for ONE computer, ONE model, ONE expert…”
trends observed:
“Historical arc: 1996 - 2013, rise of machine data, scale-out, and algorithmic modeling…”
“The management problem is about multi-disciplinary teams and learning curves…”
31Saturday, 13 July 13
Q: Can I simply hire one rockstar data scientist to cover all this kind of work?
32Saturday, 13 July 13
A: No, multi-disciplinary work requires teams.
A: Hire leads who speak the lingo of each domain.
A: Hire people who cover 2+ roles, when possible.
33Saturday, 13 July 13
approximately 80% of the costs for data-related projects gets spent on data preparation – mostly on cleaning up data quality issues: ETL, log files, etc., generally by socializing the problem
unfortunately, data-related budgets tend to go into frameworks which can only be used after clean up
most valuable skills:
‣ learn to use programmable tools that prepare data
‣ learn to understand the audience and their priorities
‣ learn to generate compelling data visualizations
‣ learn to estimate the confidence for reported results
‣ learn to automate work, making analysis repeatable
d3js.org
What is needed most?
34Saturday, 13 July 13
employing a mode of thought which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables
this approach attempts to understand not just problems and solutions, but also the processes involved and their variances
particularly valuable in Big Data work when combined with hands-on experience in physics – roughly 50% of my peers come from physics or physical engineering…
programmers typically don’t think this way… however, both systems engineers and data scientists must
Process Variation Data Tools
Statistical Thinking
35Saturday, 13 July 13
discovery
discovery
modeling
modeling
integration
integration
appsapps systems
systems
business process,stakeholder
data prep, discovery, modeling, etc.
software engineering, automation
systems engineering, access
datascience
DataScientist
App Dev
Ops
DomainExpert
introducedcapability
Team Composition: Needs × Roles
36Saturday, 13 July 13
issues confronted:
“Data becomes too complex for ONE computer, ONE model, ONE expert…”
trends observed:
“Historical arc: 1996 - 2013, rise of machine data, scale-out, and algorithmic modeling…”
“The management problem is about multi-disciplinary teams and learning curves…”
37Saturday, 13 July 13
Culture
Notes from the Mystery Machine BusSteve Yegge, Googlegoo.gl/SeRZa
consider these perspectives in light of Conway’s Law…
“conservatism” “liberalism”
(mostly) Enterprise (mostly) Start-Up
risk management customer experiments
assurance flexibility
well-defined schema schema follows code
explicit configuration convention
type-checking compiler interpreted scripts
wants no surprises wants no impediments
Java, Scala, Clojure, etc. PHP, Ruby, Python, etc.
Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc.
38Saturday, 13 July 13
Two Avenues to the App Layer…
scale ➞co
mpl
exity
➞
Enterprise: must contend with complexity at scale everyday…
incumbents extend current practices and infrastructure investments – using J2EE, ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff
Start-ups: crave complexity and scale to become viable…
new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding
39Saturday, 13 July 13
Learning Curves
difficulties in the commercial use of distributed systems often get represented as issues of managing complexity
much of the risk in managing a data science team is about budgeting for learning curve: some orgs practice a kind of engineering “conservatism”, with highly structured process and strictly codified practices – people learn a few things well, then avoid having to struggle with learning many new things perpetually…
that approach leads to enormous teams and low ROI scale ➞
com
plexity ➞
ultimately, the challenge is about
managing learning curves within
a social context
40Saturday, 13 July 13
Learning Curves vs. Technology Selections
ultimately, the challenge is about managing learning curves within a social context
est. cost of individual learning, initial impl
est.
cost
of t
eam
re-
lear
ning
, life
cycl
e
some technologies constrain the need to learn, others accelerate re-learning prior business logic… choose the latter, FTW!
41Saturday, 13 July 13
issues confronted:
“Orders of magnitude increase, more complexity and variety, widespread disruption…”
?42Saturday, 13 July 13
Big Data?
we’re just getting started:
• ~12 exabytes/day, jet turbines on commercial flights
• Google self-driving cars, ~1 Gb/s per vehicle
• National Instruments initiative: Big Analog Data™
• 1m resolution satellites skyboximaging.com
• open resource monitoring reddmetrics.com
• Sensing XChallenge nokiasensingxchallenge.org
consider the implications of Nike, Jawbone, etc., plus the secondary/tertiary effects of Google Glass
7+ billion people, instrumented better than … how we have Nagios instrumenting our web servers right now
technologyreview.com/...
43Saturday, 13 July 13
Internet of Things
44Saturday, 13 July 13
Business Disruption
Geoffrey MooreMohr Davidow Ventures, author Crossing The Chasm / Hadoop Summit, 2012: what Amazon did to the retail sector… has put the entire Global 1000 on notice over the next decade… data as the major force… mostly through apps – verticals, leveraging domain expertise
Michael StonebrakerINGRES, PostgreSQL, Vertica, VoltDB, Paradigm4, etc. / XLDB, 2012: complex analytics workloads are now displacing SQL as the basis for Enterprise apps
Larry PageCEO, Google / Wired, 2013: create products and services that are 10 times better than the competition… thousand-percent improvement requires rethinking problems entirely, exploring the edges of what’s technically possible, and having a lot more fun in the process
45Saturday, 13 July 13
A Thought Exercise
consider that when a company like Caterpillar moves into data science, they won’t be building the world’s next search engine or social network
they will most likely be optimizing supply chain, optimizing fuel costs, automating data feedback loops integrated into their equipment…
that’s a $50B company,in a market segment worth $250B
upcoming: tractors as drones – guided by complex, distributed data apps
Operations Research –crunching amazing amounts of data
46Saturday, 13 July 13
issues confronted:
“Orders of magnitude increase, more complexity and variety, widespread disruption…”
trends observed:
“Functional programming for Big Data”
“Just enough math, but not calculus”
“Enterprise Data Workflow design pattern”
“Cluster computing, smarter scheduling”
48Saturday, 13 July 13
Languages
JVM-based languages became popular for Big Data open source technologies:
• partly because YHOO adopted Hadoop, etc.
• partly because Enterprise IT shops have J2EE expertise
• partly because of functional languages: Clojure, Scala
JVM has its drawbacks, especially for low-latency use cases
ample use of languages such as Python and Erlang in Big Data practices, plus keep in mind that Google uses lots of C++
Functional ThinkingNeal Fordyoutu.be/plSZIkLodDM
49Saturday, 13 July 13
Architecture
Rich Hickey, Nathan Marz, Stuart Sierra, et al.: functional programming to help reduce costs over time
technical debt? this is how an organization builds a culture to avoid it
Conway's Law corollary: model teams and communication based on properties of the desired architecture
“Out of the Tar Pit”Moseley & Marks, 2006goo.gl/SKspn
“A relational model of data for large shared data banks”Edgar Codd, 1970dl.acm.org/citation.cfm?id=362685
Rich Hickey, infoq.com/presentations/Simple-Made-Easy
50Saturday, 13 July 13
Pattern Language
structured method for solving large, complex design problems, where the syntax of the language ensures the use of best practices – i.e., conveying expertise
FailureTraps
bonusallocation
employee
PMMLclassifier
quarterlysales
Join Countleads
A Pattern LanguageChristopher Alexander, et al.amazon.com/dp/0195019199
51Saturday, 13 July 13
DocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
1 map 1 reduce18 lines code gist.github.com/3900702
WordCount – conceptual flow diagram
cascading.org/category/impatient
52Saturday, 13 July 13
WordCount – Cascading app in Java
String docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();
DocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
53Saturday, 13 July 13
map
reduceEvery('wc')[Count[decl:'count']]
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
GroupBy('wc')[by:['token']]
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[head]
[tail]
[{2}:'token', 'count'][{1}:'token']
[{2}:'doc_id', 'text'][{2}:'doc_id', 'text']
wc[{1}:'token'][{1}:'token']
[{2}:'token', 'count'][{2}:'token', 'count']
[{1}:'token'][{1}:'token']
WordCount – generated flow diagramDocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
54Saturday, 13 July 13
(ns impatient.core (:use [cascalog.api] [cascalog.more-taps :only (hfs-delimited)]) (:require [clojure.string :as s] [cascalog.ops :as c]) (:gen-class))
(defmapcatop split [line] "reads in a line of string and splits it by regex" (s/split line #"[\[\]\\\(\),.)\s]+"))
(defn -main [in out & args] (?<- (hfs-delimited out) [?word ?count] ((hfs-delimited in :skip-header? true) _ ?line) (split ?line :> ?word) (c/count ?count)))
; Paul Lam; github.com/Quantisan/Impatient
WordCount – Cascalog / ClojureDocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
55Saturday, 13 July 13
github.com/nathanmarz/cascalog/wiki
• implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development (TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn
WordCount – Cascalog / ClojureDocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
56Saturday, 13 July 13
import com.twitter.scalding._ class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ \\[\\]\\(\\),.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true))}
WordCount – Scalding / ScalaDocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
57Saturday, 13 July 13
github.com/twitter/scalding/wiki
• extends the Scala collections API so that distributed lists become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram and function calls
• extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc.
• significant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog
WordCount – Scalding / ScalaDocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
58Saturday, 13 July 13
Functional Programming for Big Data
WordCount with token scrubbing…
Apache Hive: 52 lines HQL + 8 lines Python (UDF)
compared to
Scalding: 18 lines Scala/Cascading
functional programming languages help reduce software engineering costs at scale, over time
59Saturday, 13 July 13
Case Studies: LinkedIn and eBay
“Scalable and Flexible Machine Learning With Scala @ LinkedIn”Vitaly Gordon, LinkedInChris Severs, eBayslideshare.net/VitalyGordon/scalable-and-flexible-machine-learning-with-scala-linkedin
…be sure to read slides 8-16 !!
60Saturday, 13 July 13
Lambda Architecture
Big DataNathan Marz, James WarrenManning, 2013manning.com/marz
• batch layer (immutable data, idempotent ops)
• serving layer (to query batch)
• speed layer (transient, cached “real-time”)
• combining results
61Saturday, 13 July 13
issues confronted:
“Orders of magnitude increase, more complexity and variety, widespread disruption…”
trends observed:
“Functional programming for Big Data”
“Just enough math, but not calculus”
“Enterprise Data Workflow design pattern”
“Cluster computing, smarter scheduling”
62Saturday, 13 July 13
Where To Start?
having a solid background in statistics becomes vital, because it provides formalisms for what we’re trying to accomplish at scale
along with that, some areas of math help – regardless of the “calculus threshold” invoked at many universities…
linear algebra e.g., crunching algorithms efficiently for large-scale apps
graph theory e.g., representation of problems in a calculable language
abstract algebra e.g., probabilistic data structures in streaming analytics
topology e.g., determining the underlying structure of the data
operations research e.g., techniques for optimization … in other words, ROI
63Saturday, 13 July 13
in a nutshell, most of what we do is to…
‣ estimate probability
‣ calculate analytic variance
‣ manipulate dimension and complexity
‣ make use of learning theory
+ collaborate with DevOps, Stakeholders
+ reduce our work into cron entries
Unique Registration
Launched games lobby
NUI:TutorialMode
Birthday Message
Chat PublicRoom voice
Launched heyzap game
ConnectivityTest: test suite started
Create New Pet
Movie View Started: client, community
NUI:MovieMode
Buy an Item: web
Put on Clothing
Address space remaining: 512M
Customer Made Purchase Cart Page Step 2
Feed Pet
Play Pet
Chat Now
Edit Panel
Client Inventory Panel Flip Product Over
Add Friend
Open 3D Window
Change Seat
Type a Bubble
Visit Own Homepage
Take a Snapshot
NUI:BuyCreditsMode
NUI:MyProfileClicked
Address space remaining: 1G
Leave a Message
NUI:ChatMode
NUI:FriendsModedv
Website Login
Add Buddy
NUI:PublicRoomMode
NUI:MyRoomMode
Client Inventory Panel Remove Product
Client Inventory Panel Apply Product
NUI:DressUpMode
Unique RegistrationLaunched games lobbyNUI:TutorialModeBirthday MessageChat PublicRoom voiceLaunched heyzap gameConnectivityTest: test suite startedCreate New PetMovie View Started: client, communityNUI:MovieModeBuy an Item: webPut on ClothingAddress space remaining: 512MCustomer Made Purchase Cart Page Step 2Feed PetPlay PetChat NowEdit PanelClient Inventory Panel Flip Product OverAdd FriendOpen 3D WindowChange SeatType a BubbleVisit Own HomepageTake a SnapshotNUI:BuyCreditsModeNUI:MyProfileClickedAddress space remaining: 1GLeave a MessageNUI:ChatModeNUI:FriendsModedvWebsite LoginAdd BuddyNUI:PublicRoomModeNUI:MyRoomModeClient Inventory Panel Remove ProductClient Inventory Panel Apply ProductNUI:DressUpMode
Where is the Science in Data Science?
64Saturday, 13 July 13
techniques for manipulating order complexity:
dimensional reduction… with clustering as a common case
e.g., you may have 100 million HTML docs, but only ~10K useful keywords within them
low-dimensional structure, PCA, etc.
linear algebra techniques: eigenvalues, matrix factorization, etc.
this is an area ripe for much advancement in algorithms research, near-term
Dimension and Complexity
65Saturday, 13 July 13
in general, apps alternate between learning patterns/rules and retrieving similar things…
statistical learning theory – rigorous, prevents you from making billion dollar mistakes, probably our future
machine learning – scalable, enables you to make billion dollar mistakes, much commercial emphasis
supervised vs. unsupervised
arguably, optimization is a parent category
once Big Data projects get beyond merely digesting log files, optimization will likely become the next overused buzzword :)
Learning Theory
66Saturday, 13 July 13
Algorithms
many algorithm libraries used today are based on implementationsback when people used DO loops in FORTRAN, 30+ years ago
MapReduce is Good Enough?Jimmy Lin, U Marylandumiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf
astrophysics and genomics are light years ahead in sophisticated algorithms work – as Breiman suggested in 2001 – which may take a few years to percolate into industry
other game-changers:
• streaming algorithms, sketches, probabilistic data structures
• significant “Big O” complexity reduction (e.g., skytree.net)
• better architectures and topologies (e.g., GPUs and CUDA)
• partial aggregates – parallelizing workflows
67Saturday, 13 July 13
Make It Sparse…
also, take a moment to check this out… (IMHO most interesting algorithm work recently)
QR factorization of a “tall-and-skinny” matrix
• used to solve many data problems at scale, e.g., PCA, SVD, etc.
• numerically stable with efficient implementation on large-scale Hadoop clusters
suppose that you have a sparse matrix of customer interactions where there are 100MM customers, with a limited set of outcomes…
cs.purdue.edu/homes/dgleich
stanford.edu/~arbenson
github.com/ccsevers/scalding-linalg
David Gleich, slideshare.net/dgleich
Tristan Jehan
68Saturday, 13 July 13
Sparse Matrix Collection
for when you really need a wide variety of sparse matrix examples…
University of Florida Sparse Matrix Collectioncise.ufl.edu/research/sparse/matrices/
Tim Davis, U Floridacise.ufl.edu/~davis/welcome.html
Yifan Hu, AT&T Researchwww2.research.att.com/~yifanhu/
69Saturday, 13 July 13
A Winning Approach…
consider that if you know priors about a system, then you may be able to leverage low dimensional structure within high dimensional data… that works much, much better than sampling!
1. real-world data ⇒
2. graph theory for representation ⇒
3. sparse matrix factorization for production work ⇒
4. cost-effective parallel processing for machine learning app at scale
70Saturday, 13 July 13
Suggested Reading
A Few Useful Things to Know about Machine LearningPedro Domingos, U Washingtonhomes.cs.washington.edu/~pedrod/papers/cacm12.pdf
Probabilistic Data Structures for Web Analytics and Data MiningIlya Katsov, Grid Dynamicshighlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
71Saturday, 13 July 13
issues confronted:
“Orders of magnitude increase, more complexity and variety, widespread disruption…”
trends observed:
“Functional programming for Big Data”
“Just enough math, but not calculus”
“Enterprise Data Workflow design pattern”
“Cluster computing, smarter scheduling”
72Saturday, 13 July 13
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
enduses
73Saturday, 13 July 13
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
enduses
ANSI SQL for ETL
74Saturday, 13 July 13
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
endusesJ2EE for business logic
75Saturday, 13 July 13
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
enduses
SAS for predictive models
76Saturday, 13 July 13
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
enduses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…
77Saturday, 13 July 13
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
endusesJ2EE for business logic
most of the project costs…
78Saturday, 13 July 13
ETL dataprep
predictivemodel
datasources
enduses
Lingual:DW → ANSI SQL
Pattern:SAS, R, etc. → PMML
business logic in Java, Clojure, Scala, etc.
sink taps for Memcached, HBase, MongoDB, etc.
source taps for Cassandra, JDBC,Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source
a compiler sees it all…
cascading.org
79Saturday, 13 July 13
a compiler sees it all…
ETL dataprep
predictivemodel
datasources
enduses
Lingual:DW → ANSI SQL
Pattern:SAS, R, etc. → PMML
business logic in Java, Clojure, Scala, etc.
sink taps for Memcached, HBase, MongoDB, etc.
source taps for Cassandra, JDBC,Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef() .setName( "etl" ) .addSource( "example.employee", emplTap ) .addSource( "example.sales", salesTap ) .addSink( "results", resultsTap ); SQLPlanner sqlPlanner = new SQLPlanner() .setSql( sqlStatement ); flowDef.addAssemblyPlanner( sqlPlanner );
cascading.org
80Saturday, 13 July 13
a compiler sees it all…
ETL dataprep
predictivemodel
datasources
enduses
Lingual:DW → ANSI SQL
Pattern:SAS, R, etc. → PMML
business logic in Java, Clojure, Scala, etc.
sink taps for Memcached, HBase, MongoDB, etc.
source taps for Cassandra, JDBC,Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef() .setName( "classifier" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlModel ) ) .retainOnlyActiveIncomingFields(); flowDef.addAssemblyPlanner( pmmlPlanner );
81Saturday, 13 July 13
cascading.orgETL data
preppredictivemodel
datasources
enduses
Lingual:DW → ANSI SQL
Pattern:SAS, R, etc. → PMML
business logic in Java, Clojure, Scala, etc.
sink taps for Memcached, HBase, MongoDB, etc.
source taps for Cassandra, JDBC,Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source
visual collaboration for the business logic is a great way to improve how teams work together
FailureTraps
bonusallocation
employee
PMMLclassifier
quarterlysales
Join Countleads
82Saturday, 13 July 13
ETL dataprep
predictivemodel
datasources
enduses
Lingual:DW → ANSI SQL
Pattern:SAS, R, etc. → PMML
business logic in Java, Clojure, Scala, etc.
sink taps for Memcached, HBase, MongoDB, etc.
source taps for Cassandra, JDBC,Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source
FailureTraps
bonusallocation
employee
PMMLclassifier
quarterlysales
Join Countleads
multiple departments, working in their respective
frameworks, integrate results into a combined app,
which runs at scale on a cluster… business process
combined in a common space (DAG) for flow
planners, compiler, optimization, troubleshooting,
exception handling, notifications, security audit,
performance monitoring, etc.
cascading.org
83Saturday, 13 July 13
issues confronted:
“Orders of magnitude increase, more complexity and variety, widespread disruption…”
trends observed:
“Functional programming for Big Data”
“Just enough math, but not calculus”
“Enterprise Data Workflow design pattern”
“Cluster computing, smarter scheduling”
84Saturday, 13 July 13
Clusters
a little secret: people like me make a good living by leveraging high ROI apps based on clusters, and so the execs agree to build out more data centers…
clusters for Hadoop/Hive/HBase, clusters for Memcached, for Cassandra, for MySQL, for Storm, for Nginx, etc.
this becomes expensive!
a single class of workloads on a given cluster is simpler to manage; but terrible for utilization
leveraging VMs and various notions of “cloud” helps
Cloudera, Hortonworks, probably EMC soon: sell a notion of “Hadoop as OS” ⇒ All your workloads are belong to us
regardless of how architectures change, death and taxes will endure: servers fail, and data must move
Google Data Center, Fox News
~2002
85Saturday, 13 July 13
Three Laws, or more?
meanwhile, architectures evolve toward much, much larger data…
pistoncloud.com/ ...
Rich Freitas, IBM Research
Q:what kinds of evolution in topologies couldthis imply?
86Saturday, 13 July 13
Topologies
Hadoop and other topologies arose from a need for fault-tolerant workloads, leveraging horizontal scale-out based on commodity hardware
because the data won’t fit on one computer anymore
a variety of Big Data technologies has since emerged, which can be categorized in terms of topologies and the CAP Theorem
87Saturday, 13 July 13
Some Topologies Beyond Hadoop…
Spark (iterative/interactive)
Titan (graph database)
Redis (data structure server)
Zookeeper (distributed metadata)
HBase (columnar data objects)
Riak (durable key-value store)
Storm (real-time streams)
ElasticSearch (search index)
MongoDB (document store)
Greenplum (MPP)
SciDB (array database)
88Saturday, 13 July 13
issues confronted:
“Orders of magnitude increase, more complexity and variety, widespread disruption…”
trends observed:
“Functional programming for Big Data”
“Just enough math, but not calculus”
“Enterprise Data Workflow design pattern”
“Cluster computing, smarter scheduling”
89Saturday, 13 July 13
Operating Systems, redux
meanwhile, GOOG is 3+ generations ahead, with much improved ROI on data centers
John Wilkes, et al.Borg/Omega: “10x” secret sauceyoutu.be/0ZFMlO98Jkc
0%
25%
50%
75%
100%
RAILS CPU LOAD
MEMCACHED CPU LOAD
0%
25%
50%
75%
100%
HADOOP CPU LOAD
0%
25%
50%
75%
100%
t t
0%
25%
50%
75%
100%
Rails MemcachedHadoop
COMBINED CPU LOAD (RAILS, MEMCACHED, HADOOP)
Florian Leibert, Chronos/Mesos @ Airbnb
Mesos, open source cloud OS – like Borggoo.gl/jPtTP
90Saturday, 13 July 13
Workflow
RDBMS
near timebatch
services
transactions,content
socialinteractions
Web Apps,Mobile, etc.History
Data Products Customers
RDBMS
LogEvents
In-Memory Data Grid
Hadoop, etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/wdev
datascience
discovery+
modeling
Planner
Ops
dashboardmetrics
businessprocess
optimizedcapacitytaps
DataScientist
App Dev
Ops
DomainExpert
introducedcapability
existingSDLC
Circa 2013: clusters everywhere – Four-Part Harmony
91Saturday, 13 July 13
Workflow
RDBMS
near timebatch
services
transactions,content
socialinteractions
Web Apps,Mobile, etc.History
Data Products Customers
RDBMS
LogEvents
In-Memory Data Grid
Hadoop, etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/wdev
datascience
discovery+
modeling
Planner
Ops
dashboardmetrics
businessprocess
optimizedcapacitytaps
DataScientist
App Dev
Ops
DomainExpert
introducedcapability
existingSDLC
Circa 2013: clusters everywhere – Four-Part Harmony
1. End Use Cases, the drivers
92Saturday, 13 July 13
Workflow
RDBMS
near timebatch
services
transactions,content
socialinteractions
Web Apps,Mobile, etc.History
Data Products Customers
RDBMS
LogEvents
In-Memory Data Grid
Hadoop, etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/wdev
datascience
discovery+
modeling
Planner
Ops
dashboardmetrics
businessprocess
optimizedcapacitytaps
DataScientist
App Dev
Ops
DomainExpert
introducedcapability
existingSDLC
Circa 2013: clusters everywhere – Four-Part Harmony
2. A new kind of team process
93Saturday, 13 July 13
Workflow
RDBMS
near timebatch
services
transactions,content
socialinteractions
Web Apps,Mobile, etc.History
Data Products Customers
RDBMS
LogEvents
In-Memory Data Grid
Hadoop, etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/wdev
datascience
discovery+
modeling
Planner
Ops
dashboardmetrics
businessprocess
optimizedcapacitytaps
DataScientist
App Dev
Ops
DomainExpert
introducedcapability
existingSDLC
Circa 2013: clusters everywhere – Four-Part Harmony
3. Abstraction layer as optimizing middleware, e.g., Cascading
94Saturday, 13 July 13
Workflow
RDBMS
near timebatch
services
transactions,content
socialinteractions
Web Apps,Mobile, etc.History
Data Products Customers
RDBMS
LogEvents
In-Memory Data Grid
Hadoop, etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/wdev
datascience
discovery+
modeling
Planner
Ops
dashboardmetrics
businessprocess
optimizedcapacitytaps
DataScientist
App Dev
Ops
DomainExpert
introducedcapability
existingSDLC
Circa 2013: clusters everywhere – Four-Part Harmony
4. Distributed OS, e.g., Mesos
95Saturday, 13 July 13
Enterprise Data Workflowswith Cascading
O’Reilly, 2013shop.oreilly.com/product/0636920028536.do
Further study…
workshops and newsletter updates:
liber118.com/pxn/
96Saturday, 13 July 13