seattle data geeks: hadoop and beyond

Paco Nathanliber118.com/pxn/

@pacoid

“Hadoop and Beyond”

Licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

1Saturday, 13 July 13

http://liber118.com/pxn/


https://twitter.com/pacoid

https://twitter.com/pacoid

http://shop.oreilly.com/product/0636920028536.do?cmp=af-strata-books-video-product-cj_auwidget259_0636920028536_3941732


http://creativecommons.org/licenses/by-nc-nd/3.0/deed.en_US




?issues confronted:

“Data becomes too complex for ONE computer, ONE model, ONE expert…”


First Principles

we are taught to think of computing resources in terms of Von Neumann architecture

in other words, we characterize the computing resources by CPU, RAM, I/O


First Principles



CPU


First Principles



RAM


First Principles



I/O


First Principles

back in the day, all the tables required for a given database could fit onto one computer,with one memory space, and one file space


First Principles


• okay, maybe the CPU was multi-core…

• okay, maybe RAM paged out to virtual memory…

• okay, maybe the disks were in a RAID config…


First Principles





or there were extra caches, or separate busses, etc.

but essentially those were incremental extensions to a Von Neumann architecture…


First Principles





or there were extra caches, or separate busses, etc.

but essentially those were incremental extensions to a Von Neumann architecture…

a machine created in his image, if you will

NB: credit should go to Eckert and Mauchly, inventors of the ENIAC


First Principles

a generation of computer scientists has been taught to think “relational” – data on a DB server

RDBMS made sense, with their indexes, b-trees, normal forms, etc.

Q: need to query bigger data? A: simple, buy or lease a bigger DB server


issues confronted:


trends observed:

“Historical arc: 1996 - 2013, rise of machine data, scale-out, and algorithmic modeling…”

“The management problem is about multi-disciplinary teams and learning curves…”


Q3 1997: inflection point

Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware

This effort prepared the way for huge Internet successesin the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG

MapReduce and the Apache Hadoop open source stack emerged from this


RDBMS

Stakeholder

SQL Queryresult sets

Excel pivot tablesPowerPoint slide decks

Web App

Customers

transactions

Product

strategy

Engineering

requirements

BIAnalysts

optimizedcode

Circa 1996: pre- inflection point


RDBMS

Stakeholder


Excel pivot tablesPowerPoint slide decks

Web App

Customers

transactions

Product

strategy

Engineering

requirements

BIAnalysts

optimizedcode

Circa 1996: pre- inflection point

“throw it over the wall”


RDBMS


recommenders+

classifiersWeb Apps

customertransactions

AlgorithmicModeling

Logs

eventhistory

aggregation

dashboards

Product

EngineeringUX

Stakeholder Customers

DW ETL

Middleware

servletsmodels

Circa 2001: post- big ecommerce successes


RDBMS


recommenders+

classifiersWeb Apps

customertransactions

AlgorithmicModeling

Logs

eventhistory

aggregation

dashboards

Product

EngineeringUX

Stakeholder Customers

DW ETL

Middleware

servletsmodels

Circa 2001: post- big ecommerce successes

“data products”


Amazon“Early Amazon: Splitting the website” – Greg Lindenglinden.blogspot.com/2006/02/early-amazon-splitting-website.html

eBay“The eBay Architecture” – Randy Shoup, Dan Pritchettaddsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.htmladdsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

Inktomi (YHOO Search)“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)youtu.be/E91oEn1bnXM

Google“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)youtu.be/qsan-GQaeykperspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx

MIT Media Lab“Social Information Filtering for Music Recommendation” – Pattie Maespubs.media.mit.edu/pubs/papers/32paper.psted.com/speakers/pattie_maes.html

Primary Sources


http://glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

http://glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

http://addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html

http://addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html

http://addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

http://addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

http://youtu.be/E91oEn1bnXM

http://youtu.be/E91oEn1bnXM

http://youtu.be/qsan-GQaeyk

http://youtu.be/qsan-GQaeyk

http://perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx

http://perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx

http://pubs.media.mit.edu/pubs/papers/32paper.ps

http://pubs.media.mit.edu/pubs/papers/32paper.ps

http://www.ted.com/speakers/pattie_maes.html

http://www.ted.com/speakers/pattie_maes.html

Three broad categories of dataCurt Monash, 2010dbms2.com/2010/01/17/three-broad-categories-of-data

• Human/Tabular data – human-generated data which fits well into tables/arrays

• Human/Nontabular data – all other data generated by humans

• Machine-Generated data

Now let’s add IoT:

• A/D conversion for sensors

Machine Data


http://www.dbms2.com/2010/01/17/three-broad-categories-of-data/

http://www.dbms2.com/2010/01/17/three-broad-categories-of-data/

Data JujitsuDJ PatilO’Reilly, 2012amazon.com/dp/B008HMN5BE

Building Data Science TeamsDJ PatilO’Reilly, 2011amazon.com/dp/B005O4U3ZE

Data Products


http://amazon.com/dp/B005O4U3ZE




Workflow

RDBMS

near timebatch

services

transactions,content

socialinteractions

Web Apps,Mobile, etc.History

Data Products Customers

RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW

Use Cases Across Topologies

s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess

optimizedcapacitytaps

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

existingSDLC

Circa 2013: clusters everywhere


Workflow

RDBMS

near timebatch

services


socialinteractions



RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW


s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess


DataScientist

App Dev

Ops

DomainExpert


existingSDLC

Circa 2013: clusters everywhere

“optimize topologies”


?issues confronted:



Modeling

back in the day, we worked with practices based on data modeling

1. sample the data

2. fit the sample to a known distribution

3. ignore the rest of the data

4. infer, based on that fitted distribution

that served well with ONE computer, ONE analyst, ONE model… just throw away annoying “extra” data

circa late 1990s: machine data, aggregation, clusters, etc.algorithmic modeling displaced data modeling

because the data won’t fit on one computer anymore


Two Cultures

“A new research community using these tools sprang up. Their goal was predictive accuracy. The community consisted of young computer scientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.”

Statistical Modeling: The Two Cultures Leo Breiman, 2001bit.ly/eUTh9L

this paper chronicled a sea change from data modeling practices(silos, manual process) to the rising use of algorithmic modeling (machine data for automation/optimization)


http://bit.ly/eUTh9L

http://bit.ly/eUTh9L

issues confronted:


trends observed:




Algorithmic Modeling

“The trick to being a scientist is to be open to using a wide variety of tools.” – Breiman

circa 2001: Random Forest, bootstrap aggregation, etc., yield dramatic increases in predictive power over earlier modeling such as Logistic Regression

major learnings from the Netflix Prize: the power of ensembles, model chaining, etc.

the problems at hand have become simply too big and too complex for ONE distribution, ONE model, ONE team…

an overall history of data science:forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/


http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/




Why Do Ensembles Matter?

The World…per Data Modeling

The World…


Ensemblers of Fortune

Breiman: “a multiplicity of data models”

BellKor team: 100+ individual models in 2007 Progress Prize

while the process of combining models adds complexity (making it more difficult to anticipate or explain predictions) accuracy may increase substantially

Ensemble Learning: Better Predictions Through DiversityTodd HollowayETech (2008)abeautifulwww.com/EnsembleLearningETech.pdf

The Story of the Netflix Prize: An Ensemblers TaleLester MackeyNational Academies Seminar, Washington, DC (2011)stanford.edu/~lmackey/papers/


http://abeautifulwww.com/EnsembleLearningETech.pdf

http://abeautifulwww.com/EnsembleLearningETech.pdf

http://www.stanford.edu/~lmackey/papers/

http://www.stanford.edu/~lmackey/papers/

?issues confronted:



issues confronted:


trends observed:




Q: Can I simply hire one rockstar data scientist to cover all this kind of work?


A: No, multi-disciplinary work requires teams.

A: Hire leads who speak the lingo of each domain.

A: Hire people who cover 2+ roles, when possible.


approximately 80% of the costs for data-related projects gets spent on data preparation – mostly on cleaning up data quality issues: ETL, log files, etc., generally by socializing the problem

unfortunately, data-related budgets tend to go into frameworks which can only be used after clean up

most valuable skills:

‣ learn to use programmable tools that prepare data

‣ learn to understand the audience and their priorities

‣ learn to generate compelling data visualizations

‣ learn to estimate the confidence for reported results

‣ learn to automate work, making analysis repeatable

d3js.org

What is needed most?


http://www.visual-literacy.org/periodic_table/periodic_table.html

http://www.visual-literacy.org/periodic_table/periodic_table.html

http://d3js.org/

http://d3js.org/

employing a mode of thought which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables

this approach attempts to understand not just problems and solutions, but also the processes involved and their variances

particularly valuable in Big Data work when combined with hands-on experience in physics – roughly 50% of my peers come from physics or physical engineering…

programmers typically don’t think this way… however, both systems engineers and data scientists must

Process Variation Data Tools

Statistical Thinking


discovery

discovery

modeling

modeling

integration

integration

appsapps systems

systems

business process,stakeholder

data prep, discovery, modeling, etc.

software engineering, automation

systems engineering, access

datascience

DataScientist

App Dev

Ops

DomainExpert


Team Composition: Needs × Roles


issues confronted:


trends observed:




Culture

Notes from the Mystery Machine BusSteve Yegge, Googlegoo.gl/SeRZa

consider these perspectives in light of Conway’s Law…

“conservatism” “liberalism”

(mostly) Enterprise (mostly) Start-Up

risk management customer experiments

assurance flexibility

well-defined schema schema follows code

explicit configuration convention

type-checking compiler interpreted scripts

wants no surprises wants no impediments

Java, Scala, Clojure, etc. PHP, Ruby, Python, etc.

Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc.


Two Avenues to the App Layer…

scale ➞co

mpl

exity

➞

Enterprise: must contend with complexity at scale everyday…

incumbents extend current practices and infrastructure investments – using J2EE, ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff

Start-ups: crave complexity and scale to become viable…

new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding


Learning Curves

difficulties in the commercial use of distributed systems often get represented as issues of managing complexity

much of the risk in managing a data science team is about budgeting for learning curve: some orgs practice a kind of engineering “conservatism”, with highly structured process and strictly codified practices – people learn a few things well, then avoid having to struggle with learning many new things perpetually…

that approach leads to enormous teams and low ROI scale ➞

com

plexity ➞

ultimately, the challenge is about

managing learning curves within

a social context


Learning Curves vs. Technology Selections

ultimately, the challenge is about managing learning curves within a social context

est. cost of individual learning, initial impl

est.

cost

of t

eam

re-

lear

ning

, life

cycl

e

some technologies constrain the need to learn, others accelerate re-learning prior business logic… choose the latter, FTW!


issues confronted:

“Orders of magnitude increase, more complexity and variety, widespread disruption…”

?42Saturday, 13 July 13

Big Data?

we’re just getting started:

• ~12 exabytes/day, jet turbines on commercial flights

• Google self-driving cars, ~1 Gb/s per vehicle

• National Instruments initiative: Big Analog Data™

• 1m resolution satellites skyboximaging.com

• open resource monitoring reddmetrics.com

• Sensing XChallenge nokiasensingxchallenge.org

consider the implications of Nike, Jawbone, etc., plus the secondary/tertiary effects of Google Glass

7+ billion people, instrumented better than … how we have Nagios instrumenting our web servers right now

technologyreview.com/...


http://www.skyboximaging.com/

http://www.skyboximaging.com/

http://www.reddmetrics.com/

http://www.reddmetrics.com/

http://www.nokiasensingxchallenge.org/competition-details/overview

http://www.nokiasensingxchallenge.org/competition-details/overview

http://www.technologyreview.com/view/512166/growing-up-with-google-glass/

http://www.technologyreview.com/view/512166/growing-up-with-google-glass/

Internet of Things


Business Disruption

Geoffrey MooreMohr Davidow Ventures, author Crossing The Chasm / Hadoop Summit, 2012: what Amazon did to the retail sector… has put the entire Global 1000 on notice over the next decade… data as the major force… mostly through apps – verticals, leveraging domain expertise

Michael StonebrakerINGRES, PostgreSQL, Vertica, VoltDB, Paradigm4, etc. / XLDB, 2012: complex analytics workloads are now displacing SQL as the basis for Enterprise apps

Larry PageCEO, Google / Wired, 2013: create products and services that are 10 times better than the competition… thousand-percent improvement requires rethinking problems entirely, exploring the edges of what’s technically possible, and having a lot more fun in the process


A Thought Exercise

consider that when a company like Caterpillar moves into data science, they won’t be building the world’s next search engine or social network

they will most likely be optimizing supply chain, optimizing fuel costs, automating data feedback loops integrated into their equipment…

that’s a $50B company,in a market segment worth $250B

upcoming: tractors as drones – guided by complex, distributed data apps

Operations Research –crunching amazing amounts of data


Alternatively…

climate.com


http://climate.com/

http://climate.com/

issues confronted:


trends observed:

“Functional programming for Big Data”

“Just enough math, but not calculus”

“Enterprise Data Workflow design pattern”

“Cluster computing, smarter scheduling”


Languages

JVM-based languages became popular for Big Data open source technologies:

• partly because YHOO adopted Hadoop, etc.

• partly because Enterprise IT shops have J2EE expertise

• partly because of functional languages: Clojure, Scala

JVM has its drawbacks, especially for low-latency use cases

ample use of languages such as Python and Erlang in Big Data practices, plus keep in mind that Google uses lots of C++

Functional ThinkingNeal Fordyoutu.be/plSZIkLodDM


http://youtu.be/plSZIkLodDM




Architecture

Rich Hickey, Nathan Marz, Stuart Sierra, et al.: functional programming to help reduce costs over time

technical debt? this is how an organization builds a culture to avoid it

Conway's Law corollary: model teams and communication based on properties of the desired architecture

“Out of the Tar Pit”Moseley & Marks, 2006goo.gl/SKspn

“A relational model of data for large shared data banks”Edgar Codd, 1970dl.acm.org/citation.cfm?id=362685

Rich Hickey, infoq.com/presentations/Simple-Made-Easy


http://goo.gl/SKspn

http://goo.gl/SKspn

http://dl.acm.org/citation.cfm?id=362685

http://dl.acm.org/citation.cfm?id=362685

http://www.infoq.com/presentations/Simple-Made-Easy

http://www.infoq.com/presentations/Simple-Made-Easy

Pattern Language

structured method for solving large, complex design problems, where the syntax of the language ensures the use of best practices – i.e., conveying expertise

FailureTraps

bonusallocation

employee

PMMLclassifier

quarterlysales

Join Countleads

A Pattern LanguageChristopher Alexander, et al.amazon.com/dp/0195019199


http://amazon.com/dp/0195019199

http://amazon.com/dp/0195019199

DocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

1 map 1 reduce18 lines code gist.github.com/3900702

WordCount – conceptual flow diagram

cascading.org/category/impatient


http://gist.github.com/3900702

http://gist.github.com/3900702

http://www.cascading.org/category/impatient/

http://www.cascading.org/category/impatient/

WordCount – Cascading app in Java

String docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\$\$,.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();

DocumentCollection

WordCount


R

M


map

reduceEvery('wc')[Count[decl:'count']]

Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

GroupBy('wc')[by:['token']]

Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

[head]

[tail]

[{2}:'token', 'count'][{1}:'token']

[{2}:'doc_id', 'text'][{2}:'doc_id', 'text']

wc[{1}:'token'][{1}:'token']

[{2}:'token', 'count'][{2}:'token', 'count']

[{1}:'token'][{1}:'token']

WordCount – generated flow diagramDocumentCollection

WordCount


R

M


(ns impatient.core (:use [cascalog.api] [cascalog.more-taps :only (hfs-delimited)]) (:require [clojure.string :as s] [cascalog.ops :as c]) (:gen-class))

(defmapcatop split [line] "reads in a line of string and splits it by regex" (s/split line #"[\[\]\\,.)\s]+"))

(defn -main [in out & args] (?<- (hfs-delimited out) [?word ?count] ((hfs-delimited in :skip-header? true) _ ?line) (split ?line :> ?word) (c/count ?count)))

; Paul Lam; github.com/Quantisan/Impatient

WordCount – Cascalog / ClojureDocumentCollection

WordCount


R

M


github.com/nathanmarz/cascalog/wiki

• implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language

• run ad-hoc queries from the Clojure REPL –approx. 10:1 code reduction compared with SQL

• composable subqueries, used for test-driven development (TDD) practices at scale

• Leiningen build: simple, no surprises, in Clojure itself

• more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog

• has a learning curve, limited number of Clojure developers

• aggregators are the magic, and those take effort to learn

WordCount – Cascalog / ClojureDocumentCollection

WordCount


R

M


import com.twitter.scalding._ class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ \\[\\]\$\$,.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true))}

WordCount – Scalding / ScalaDocumentCollection

WordCount


R

M


github.com/twitter/scalding/wiki

• extends the Scala collections API so that distributed lists become “pipes” backed by Cascading

• code is compact, easy to understand

• nearly 1:1 between elements of conceptual flow diagram and function calls

• extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc.

• significant investments by Twitter, Etsy, eBay, etc.

• great for data services at scale

• less learning curve than Cascalog

WordCount – Scalding / ScalaDocumentCollection

WordCount


R

M


Functional Programming for Big Data

WordCount with token scrubbing…

Apache Hive: 52 lines HQL + 8 lines Python (UDF)

compared to

Scalding: 18 lines Scala/Cascading

functional programming languages help reduce software engineering costs at scale, over time


Case Studies: LinkedIn and eBay

“Scalable and Flexible Machine Learning With Scala @ LinkedIn”Vitaly Gordon, LinkedInChris Severs, eBayslideshare.net/VitalyGordon/scalable-and-flexible-machine-learning-with-scala-linkedin

…be sure to read slides 8-16 !!


http://www.slideshare.net/VitalyGordon/scalable-and-flexible-machine-learning-with-scala-linkedin

http://www.slideshare.net/VitalyGordon/scalable-and-flexible-machine-learning-with-scala-linkedin

Lambda Architecture

Big DataNathan Marz, James WarrenManning, 2013manning.com/marz

• batch layer (immutable data, idempotent ops)

• serving layer (to query batch)

• speed layer (transient, cached “real-time”)

• combining results


http://www.manning.com/marz/




issues confronted:


trends observed:






Where To Start?

having a solid background in statistics becomes vital, because it provides formalisms for what we’re trying to accomplish at scale

along with that, some areas of math help – regardless of the “calculus threshold” invoked at many universities…

linear algebra e.g., crunching algorithms efficiently for large-scale apps

graph theory e.g., representation of problems in a calculable language

abstract algebra e.g., probabilistic data structures in streaming analytics

topology e.g., determining the underlying structure of the data

operations research e.g., techniques for optimization … in other words, ROI


http://en.wikipedia.org/wiki/Linear_algebra

http://en.wikipedia.org/wiki/Linear_algebra

http://en.wikipedia.org/wiki/Graph_theory

http://en.wikipedia.org/wiki/Graph_theory

http://en.wikipedia.org/wiki/Abstract_algebra

http://en.wikipedia.org/wiki/Abstract_algebra

https://en.wikipedia.org/wiki/Topology

https://en.wikipedia.org/wiki/Topology

https://en.wikipedia.org/wiki/Operations_research

https://en.wikipedia.org/wiki/Operations_research

in a nutshell, most of what we do is to…

‣ estimate probability

‣ calculate analytic variance

‣ manipulate dimension and complexity

‣ make use of learning theory

+ collaborate with DevOps, Stakeholders

+ reduce our work into cron entries

Unique Registration

Launched games lobby

NUI:TutorialMode

Birthday Message

Chat PublicRoom voice

Launched heyzap game

ConnectivityTest: test suite started

Create New Pet

Movie View Started: client, community

NUI:MovieMode

Buy an Item: web

Put on Clothing

Address space remaining: 512M

Customer Made Purchase Cart Page Step 2

Feed Pet

Play Pet

Chat Now

Edit Panel

Client Inventory Panel Flip Product Over

Add Friend

Open 3D Window

Change Seat

Type a Bubble

Visit Own Homepage

Take a Snapshot

NUI:BuyCreditsMode

NUI:MyProfileClicked

Address space remaining: 1G

Leave a Message

NUI:ChatMode

NUI:FriendsModedv

Website Login

Add Buddy

NUI:PublicRoomMode

NUI:MyRoomMode

Client Inventory Panel Remove Product

Client Inventory Panel Apply Product

NUI:DressUpMode

Unique RegistrationLaunched games lobbyNUI:TutorialModeBirthday MessageChat PublicRoom voiceLaunched heyzap gameConnectivityTest: test suite startedCreate New PetMovie View Started: client, communityNUI:MovieModeBuy an Item: webPut on ClothingAddress space remaining: 512MCustomer Made Purchase Cart Page Step 2Feed PetPlay PetChat NowEdit PanelClient Inventory Panel Flip Product OverAdd FriendOpen 3D WindowChange SeatType a BubbleVisit Own HomepageTake a SnapshotNUI:BuyCreditsModeNUI:MyProfileClickedAddress space remaining: 1GLeave a MessageNUI:ChatModeNUI:FriendsModedvWebsite LoginAdd BuddyNUI:PublicRoomModeNUI:MyRoomModeClient Inventory Panel Remove ProductClient Inventory Panel Apply ProductNUI:DressUpMode

Where is the Science in Data Science?


techniques for manipulating order complexity:

dimensional reduction… with clustering as a common case

e.g., you may have 100 million HTML docs, but only ~10K useful keywords within them

low-dimensional structure, PCA, etc.

linear algebra techniques: eigenvalues, matrix factorization, etc.

this is an area ripe for much advancement in algorithms research, near-term

Dimension and Complexity


in general, apps alternate between learning patterns/rules and retrieving similar things…

statistical learning theory – rigorous, prevents you from making billion dollar mistakes, probably our future

machine learning – scalable, enables you to make billion dollar mistakes, much commercial emphasis

supervised vs. unsupervised

arguably, optimization is a parent category

once Big Data projects get beyond merely digesting log files, optimization will likely become the next overused buzzword :)

Learning Theory


Algorithms

many algorithm libraries used today are based on implementationsback when people used DO loops in FORTRAN, 30+ years ago

MapReduce is Good Enough?Jimmy Lin, U Marylandumiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf

astrophysics and genomics are light years ahead in sophisticated algorithms work – as Breiman suggested in 2001 – which may take a few years to percolate into industry

other game-changers:

• streaming algorithms, sketches, probabilistic data structures

• significant “Big O” complexity reduction (e.g., skytree.net)

• better architectures and topologies (e.g., GPUs and CUDA)

• partial aggregates – parallelizing workflows


http://www.umiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf

http://www.umiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf

http://www.skytree.net/

http://www.skytree.net/

Make It Sparse…

also, take a moment to check this out… (IMHO most interesting algorithm work recently)

QR factorization of a “tall-and-skinny” matrix

• used to solve many data problems at scale, e.g., PCA, SVD, etc.

• numerically stable with efficient implementation on large-scale Hadoop clusters

suppose that you have a sparse matrix of customer interactions where there are 100MM customers, with a limited set of outcomes…

cs.purdue.edu/homes/dgleich

stanford.edu/~arbenson

github.com/ccsevers/scalding-linalg

David Gleich, slideshare.net/dgleich

Tristan Jehan


http://www.cs.purdue.edu/homes/dgleich/

http://www.cs.purdue.edu/homes/dgleich/

http://stanford.edu/~arbenson/talks/icme-colloquium-Sp2013.pdf

http://stanford.edu/~arbenson/talks/icme-colloquium-Sp2013.pdf

https://github.com/ccsevers/scalding-linalg

https://github.com/ccsevers/scalding-linalg

http://www.slideshare.net/dgleich

http://www.slideshare.net/dgleich

Sparse Matrix Collection

for when you really need a wide variety of sparse matrix examples…

University of Florida Sparse Matrix Collectioncise.ufl.edu/research/sparse/matrices/

Tim Davis, U Floridacise.ufl.edu/~davis/welcome.html

Yifan Hu, AT&T Researchwww2.research.att.com/~yifanhu/


http://www.cise.ufl.edu/research/sparse/matrices/

http://www.cise.ufl.edu/research/sparse/matrices/

http://www.cise.ufl.edu/~davis/welcome.html

http://www.cise.ufl.edu/~davis/welcome.html

http://www2.research.att.com/~yifanhu/

http://www2.research.att.com/~yifanhu/

A Winning Approach…

consider that if you know priors about a system, then you may be able to leverage low dimensional structure within high dimensional data… that works much, much better than sampling!

1. real-world data ⇒

2. graph theory for representation ⇒

3. sparse matrix factorization for production work ⇒

4. cost-effective parallel processing for machine learning app at scale


Suggested Reading

A Few Useful Things to Know about Machine LearningPedro Domingos, U Washingtonhomes.cs.washington.edu/~pedrod/papers/cacm12.pdf

Probabilistic Data Structures for Web Analytics and Data MiningIlya Katsov, Grid Dynamicshighlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/


http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/




issues confronted:


trends observed:






Anatomy of an Enterprise app

Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

enduses




ETL dataprep

predictivemodel

datasources

enduses

ANSI SQL for ETL




ETL dataprep

predictivemodel

datasources

endusesJ2EE for business logic




ETL dataprep

predictivemodel

datasources

enduses

SAS for predictive models




ETL dataprep

predictivemodel

datasources

enduses

SAS for predictive modelsANSI SQL for ETL most of the licensing costs…




ETL dataprep

predictivemodel

datasources

endusesJ2EE for business logic

most of the project costs…


ETL dataprep

predictivemodel

datasources

enduses

Lingual:DW → ANSI SQL

Pattern:SAS, R, etc. → PMML

business logic in Java, Clojure, Scala, etc.

sink taps for Memcached, HBase, MongoDB, etc.

source taps for Cassandra, JDBC,Splunk, etc.


Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source

a compiler sees it all…

cascading.org


http://cascading.org



ETL dataprep

predictivemodel

datasources

enduses








FlowDef flowDef = FlowDef.flowDef() .setName( "etl" ) .addSource( "example.employee", emplTap ) .addSource( "example.sales", salesTap ) .addSink( "results", resultsTap ); SQLPlanner sqlPlanner = new SQLPlanner() .setSql( sqlStatement ); flowDef.addAssemblyPlanner( sqlPlanner );

cascading.org





ETL dataprep

predictivemodel

datasources

enduses








FlowDef flowDef = FlowDef.flowDef() .setName( "classifier" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlModel ) ) .retainOnlyActiveIncomingFields(); flowDef.addAssemblyPlanner( pmmlPlanner );


cascading.orgETL data

preppredictivemodel

datasources

enduses








visual collaboration for the business logic is a great way to improve how teams work together

FailureTraps

bonusallocation

employee

PMMLclassifier

quarterlysales

Join Countleads





ETL dataprep

predictivemodel

datasources

enduses








FailureTraps

bonusallocation

employee

PMMLclassifier

quarterlysales

Join Countleads

multiple departments, working in their respective

frameworks, integrate results into a combined app,

which runs at scale on a cluster… business process

combined in a common space (DAG) for flow

planners, compiler, optimization, troubleshooting,

exception handling, notifications, security audit,

performance monitoring, etc.

cascading.org




issues confronted:


trends observed:






Clusters

a little secret: people like me make a good living by leveraging high ROI apps based on clusters, and so the execs agree to build out more data centers…

clusters for Hadoop/Hive/HBase, clusters for Memcached, for Cassandra, for MySQL, for Storm, for Nginx, etc.

this becomes expensive!

a single class of workloads on a given cluster is simpler to manage; but terrible for utilization

leveraging VMs and various notions of “cloud” helps

Cloudera, Hortonworks, probably EMC soon: sell a notion of “Hadoop as OS” ⇒ All your workloads are belong to us

regardless of how architectures change, death and taxes will endure: servers fail, and data must move

Google Data Center, Fox News

~2002


Three Laws, or more?

meanwhile, architectures evolve toward much, much larger data…

pistoncloud.com/ ...

Rich Freitas, IBM Research

Q:what kinds of evolution in topologies couldthis imply?


http://www.pistoncloud.com/2013/04/storage-and-the-mobility-gap/

http://www.pistoncloud.com/2013/04/storage-and-the-mobility-gap/

http://static.usenix.org/event/fast10/tutorials/T2.pdf

http://static.usenix.org/event/fast10/tutorials/T2.pdf

Topologies

Hadoop and other topologies arose from a need for fault-tolerant workloads, leveraging horizontal scale-out based on commodity hardware

because the data won’t fit on one computer anymore

a variety of Big Data technologies has since emerged, which can be categorized in terms of topologies and the CAP Theorem


Some Topologies Beyond Hadoop…

Spark (iterative/interactive)

Titan (graph database)

Redis (data structure server)

Zookeeper (distributed metadata)

HBase (columnar data objects)

Riak (durable key-value store)

Storm (real-time streams)

ElasticSearch (search index)

MongoDB (document store)

Greenplum (MPP)

SciDB (array database)


http://spark-project.org/

http://spark-project.org/

http://thinkaurelius.github.io/titan/

http://thinkaurelius.github.io/titan/

http://redis.io/

http://redis.io/

http://zookeeper.apache.org/

http://zookeeper.apache.org/

http://hbase.apache.org/

http://hbase.apache.org/

http://basho.com/riak/

http://basho.com/riak/

http://storm-project.net/

http://storm-project.net/

http://www.elasticsearch.org/

http://www.elasticsearch.org/

http://www.mongodb.org/

http://www.mongodb.org/

http://www.greenplum.com/

http://www.greenplum.com/

http://www.scidb.org/

http://www.scidb.org/

issues confronted:


trends observed:






Operating Systems, redux

meanwhile, GOOG is 3+ generations ahead, with much improved ROI on data centers

John Wilkes, et al.Borg/Omega: “10x” secret sauceyoutu.be/0ZFMlO98Jkc

0%

25%

50%

75%

100%

RAILS CPU LOAD

MEMCACHED CPU LOAD

0%

25%

50%

75%

100%

HADOOP CPU LOAD

0%

25%

50%

75%

100%

t t

0%

25%

50%

75%

100%

Rails MemcachedHadoop

COMBINED CPU LOAD (RAILS, MEMCACHED, HADOOP)

Florian Leibert, Chronos/Mesos @ Airbnb

Mesos, open source cloud OS – like Borggoo.gl/jPtTP


http://youtu.be/0ZFMlO98Jkc

http://youtu.be/0ZFMlO98Jkc

http://goo.gl/jPtTP

http://goo.gl/jPtTP

Workflow

RDBMS

near timebatch

services


socialinteractions



RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW


s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess


DataScientist

App Dev

Ops

DomainExpert


existingSDLC

Circa 2013: clusters everywhere – Four-Part Harmony


Workflow

RDBMS

near timebatch

services


socialinteractions



RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW


s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess


DataScientist

App Dev

Ops

DomainExpert


existingSDLC


1. End Use Cases, the drivers


Workflow

RDBMS

near timebatch

services


socialinteractions



RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW


s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess


DataScientist

App Dev

Ops

DomainExpert


existingSDLC


2. A new kind of team process


Workflow

RDBMS

near timebatch

services


socialinteractions



RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW


s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess


DataScientist

App Dev

Ops

DomainExpert


existingSDLC


3. Abstraction layer as optimizing middleware, e.g., Cascading


Workflow

RDBMS

near timebatch

services


socialinteractions



RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW


s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess


DataScientist

App Dev

Ops

DomainExpert


existingSDLC


4. Distributed OS, e.g., Mesos


Enterprise Data Workflowswith Cascading

O’Reilly, 2013shop.oreilly.com/product/0636920028536.do

Further study…

workshops and newsletter updates:

liber118.com/pxn/


http://www.jdoqocy.com/click-3941732-11290546?url=http%253A%252F%252Fshop.oreilly.com%252Fproduct%252F0636920028536.do%253Fcmp%253Daf-strata-books-video-product_cj_0636920028536_%25zp&cjsku=0636920028536








seattle data geeks: hadoop and beyond

Technology

space okay

computing resources

memory space

io ram

io cpu

virtual memory okay

avon neumann architecture

given database