predictive-analytics-san-diego-2013-02-21
DESCRIPTION
The unification of big and little data processing onto a single platform is an important requirement for Hadoop. How can this be achieved? I explain what is needed for three important use cases.TRANSCRIPT
![Page 1: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/1.jpg)
1©MapR Technologies - Confidential
Remembering the Future
![Page 2: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/2.jpg)
2©MapR Technologies - Confidential
My Background
University, Startups– Aptex, MusicMatch, ID Analytics, Veoh– big data since before it was big
Open source– even before the internet– Apache Hadoop, Mahout, Zookeeper, Drill– bought the beer at first HUG
MapR Founding member of Apache Drill
![Page 3: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/3.jpg)
3©MapR Technologies - Confidential
MapR Technologies
Silicon Valley Startup– Top investors– Top technical and management team• Google, Microsoft, EMC, NetApp, Oracle
Enterprise quality distribution for Hadoop
Many extensions to basic Hadoop function Strong supporter of Apache Drill
![Page 4: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/4.jpg)
4©MapR Technologies - Confidential
Philosophy First
What is History?
![Page 5: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/5.jpg)
5©MapR Technologies - Confidential
The study of the past
(what came before now)
![Page 6: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/6.jpg)
6©MapR Technologies - Confidential
What is the future?
(it comes after now)
![Page 7: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/7.jpg)
7©MapR Technologies - Confidential
![Page 8: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/8.jpg)
8©MapR Technologies - Confidential
![Page 9: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/9.jpg)
9©MapR Technologies - Confidential
![Page 10: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/10.jpg)
10©MapR Technologies - Confidential
But the future also has a past!
![Page 11: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/11.jpg)
11©MapR Technologies - Confidential
Do you remember the future?
![Page 12: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/12.jpg)
12©MapR Technologies - Confidential
![Page 13: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/13.jpg)
13©MapR Technologies - Confidential
![Page 14: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/14.jpg)
14©MapR Technologies - Confidential
![Page 15: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/15.jpg)
15©MapR Technologies - Confidential
![Page 16: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/16.jpg)
16©MapR Technologies - Confidential
![Page 17: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/17.jpg)
17©MapR Technologies - Confidential
Some things
turned out as
expected
![Page 18: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/18.jpg)
18©MapR Technologies - Confidential
Guys wearing Fedoras
![Page 19: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/19.jpg)
19©MapR Technologies - Confidential
Many things are different!
![Page 20: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/20.jpg)
20©MapR Technologies - Confidential
Hadoop has a history
![Page 21: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/21.jpg)
21©MapR Technologies - Confidential
Hadoop also has a
future
![Page 22: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/22.jpg)
22©MapR Technologies - Confidential
The Old Future of Hadoop
Map-reduce and HDFS– more and more, but not really different
Eco-system additions– Simpler programming (Hive and Pig)– Key-value store– Ad hoc query
Stands apart from other computing– Required by HDFS and other limitations
![Page 23: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/23.jpg)
23©MapR Technologies - Confidential
The New Future of Hadoop
Real-time processing– Combines real-time and long-time
Integration with traditional IT– No need to stand apart
Integration with new technologies– Solr, Node.js, Twisted all should interface directly
Fast and flexible computation– Drill logical plan language
![Page 24: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/24.jpg)
24©MapR Technologies - Confidential
Example #1Search Abuse
![Page 25: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/25.jpg)
25©MapR Technologies - Confidential
History matrix
One row per user
One column per thing
![Page 26: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/26.jpg)
26©MapR Technologies - Confidential
Recommendation based on cooccurrence
Cooccurrence gives item-item mapping
One row and column per thing
![Page 27: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/27.jpg)
27©MapR Technologies - Confidential
Cooccurrence matrix can also be implemented as a search index
![Page 28: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/28.jpg)
28©MapR Technologies - Confidential
SolRIndexerSolR
IndexerSolrindexing
Cooccurrence(Mahout)
Item meta-data
Indexshards
Complete history
![Page 29: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/29.jpg)
29©MapR Technologies - Confidential
SolRIndexerSolR
IndexerSolrsearchWeb tier
Item meta-data
Indexshards
User history
![Page 30: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/30.jpg)
30©MapR Technologies - Confidential
Objective Results
At a very large credit card company
History is all transactions, all web interaction
Processing time cut from 20 hours per day to 3
Recommendation engine load time decreased from 8 hours to 3 minutes
![Page 31: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/31.jpg)
31©MapR Technologies - Confidential
Example #2Web
Technology
![Page 32: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/32.jpg)
32©MapR Technologies - Confidential
Fast analysis(Storm)
Analytic output
Real-timedata
Raw logs
![Page 33: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/33.jpg)
33©MapR Technologies - Confidential
Large analysis(map-reduce)
Analytic output Raw logs
![Page 34: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/34.jpg)
34©MapR Technologies - Confidential
Presentation tier (d3 + node.js)
Analytic output
Browser query
Raw logs
![Page 35: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/35.jpg)
35©MapR Technologies - Confidential
Objective Results
Real-time + long-time analysis is seamless
Web tier can be rooted directly on Hadoop cluster
No need to move data
![Page 36: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/36.jpg)
36©MapR Technologies - Confidential
Example #3Apache Drill
![Page 37: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/37.jpg)
37©MapR Technologies - Confidential
Big Data Processing – Hadoop
Batch processing
Query runtime Minutes to hours
Data volume TBs to PBs
Programming model
MapReduce
Users Developers
Google project MapReduce
Open source project
Hadoop MapReduce
![Page 38: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/38.jpg)
38©MapR Technologies - Confidential
Big Data Processing – Hadoop and Storm
Batch processing Stream processing
Query runtime Minutes to hours Never-ending
Data volume TBs to PBs Continuous stream
Programming model
MapReduce DAG (pre-programmed)
Users Developers Developers
Google project MapReduce
Open source project
Hadoop MapReduce
Storm or Apache S4
![Page 39: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/39.jpg)
39©MapR Technologies - Confidential
Big Data Processing – The missing part
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Never-ending
Data volume TBs to PBs Continuous stream
Programming model
MapReduce DAG (pre-programmed)
Users Developers Developers
Google project MapReduce
Open source project
Hadoop MapReduce
Storm and S4
![Page 40: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/40.jpg)
40©MapR Technologies - Confidential
Big Data Processing – The missing part
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to minutes
Never-ending
Data volume TBs to PBs GBs to PBs Continuous stream
Programming model
MapReduce Queries(ad hoc)
DAG (pre-programmed)
Users Developers Analysts and developers
Developers
Google project MapReduce
Open source project
Hadoop MapReduce
Storm and S4
![Page 41: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/41.jpg)
41©MapR Technologies - Confidential
Big Data Processing
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to minutes
Never-ending
Data volume TBs to PBs GBs to PBs Continuous stream
Programming model
MapReduce Queries DAG
Users Developers Analysts and developers
Developers
Google project MapReduce Dremel
Open source project
Hadoop MapReduce
Storm and S4
![Page 42: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/42.jpg)
42©MapR Technologies - Confidential
Big Data Processing
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to minutes
Never-ending
Data volume TBs to PBs GBs to PBs Continuous stream
Programming model
MapReduce Queries DAG
Users Developers Analysts and developers
Developers
Google project MapReduce Dremel
Open source project
Hadoop MapReduce
Storm and S4
Apache Drill
![Page 43: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/43.jpg)
43©MapR Technologies - Confidential
Design Principles
Flexible• Pluggable query languages• Extensible execution engine• Pluggable data formats
• Column-based and row-based• Schema and schema-less
• Pluggable data sources
Easy• Unzip and run• Zero configuration• Reverse DNS not needed• IP addresses can change• Clear and concise log messages
Dependable• No SPOF• Instant recovery from crashes
Fast• C/C++ core with Java support
• Google C++ style guide• Min latency and max throughput
(limited only by hardware)
![Page 44: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/44.jpg)
44©MapR Technologies - Confidential
Simple Architecture
![Page 45: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/45.jpg)
45©MapR Technologies - Confidential
Standard Interfaces
![Page 46: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/46.jpg)
46©MapR Technologies - Confidential
query:[ { op:"sequence", do:[ { op: "scan", memo: "initial_scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "transform", transforms: [ { ref: "donuts.quanity", expr: "donuts.sales”} ] }, { op: "filter", expr: "donuts.ppu < 1.00" }, …
Logical Plan Syntax:
![Page 47: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/47.jpg)
47©MapR Technologies - Confidential
Logical Streaming Example
{ @id: <refnum>, op: “window-frame”, input: <input>, keys: [ <name>,... ], ref: <name>, before: 2, after: here}
0 1 2 3 4
0 0 10 1 2 1 2 32 3 4
![Page 48: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/48.jpg)
48©MapR Technologies - Confidential
Logical Plan
![Page 49: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/49.jpg)
49©MapR Technologies - Confidential
Execution Plan
![Page 50: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/50.jpg)
50©MapR Technologies - Confidential
Representing a DAG
{ @id: 19, op: "aggregate", input: 18, type: <simple|running|repeat>, keys: [<name>,...], aggregations: [ {ref: <name>, expr: <aggexpr> },... ]}
![Page 51: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/51.jpg)
51©MapR Technologies - Confidential
Non-SQL queries
![Page 52: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/52.jpg)
52©MapR Technologies - Confidential
Design Principles
Flexible• Pluggable query languages• Extensible execution engine• Pluggable data formats
• Column-based and row-based• Schema and schema-less
• Pluggable data sources
Easy• Unzip and run• Zero configuration• Reverse DNS not needed• IP addresses can change• Clear and concise log messages
Dependable• No SPOF• Instant recovery from crashes
Fast• C/C++ core with Java support
• Google C++ style guide• Min latency and max throughput
(limited only by hardware)
![Page 53: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/53.jpg)
53©MapR Technologies - Confidential
The future is not what we thought it would be
![Page 54: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/54.jpg)
54©MapR Technologies - Confidential
It is better!
![Page 55: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/55.jpg)
55©MapR Technologies - Confidential
Get Involved!
Tweet:#hcj13w#mapr
@ted_dunning
![Page 56: predictive-analytics-san-diego-2013-02-21](https://reader036.vdocument.in/reader036/viewer/2022070304/54c6579d4a795965328b45e1/html5/thumbnails/56.jpg)
56©MapR Technologies - Confidential
Get Involved!
Download these slides– http://www.mapr.com/company/events/hcj-01-21-2013
Join the Drill project– [email protected] – #apachedrill
Contact me:– [email protected]– [email protected]– @ted_dunning
Join MapR (in Japan!)– [email protected]