1© 2016 MapR Technologies 1© 2016 MapR Technologies
Evolving Beyond the Data LakeA Story of Wind and Rain
Jim Scott@kingmesal #strataconf
2© 2016 MapR Technologies 2
Industry Leaders Are Investing in Disruptive Technology NowInnovating and reducing costs at the same time
Source: IDC, Gartner; Analysis & Estimates: MapRNext-gen consists of cloud, big data, software and hardware related expenses
2013 2014 2015 2016 2017 2018 2019 2020
(100,000)
(50,000)
-
50,000
100,000
150,000 Investment in Next-Gen vs. Legacy Technologies for Data
$120
100
80
60
40
20
(20)
(40)
(60)
(80)
(100)
In Billions
Total $ Growth of IT Market Next-Gen Growth Legacy Market Growth/Shrink in $
90% of data is on next-gen technology
in just four years
3© 2016 MapR Technologies 3
Application Development and Deployment
Oracle
Bulk Load
Machine Learning
Data LakePredictive
Modeling
BI / Reporting
Insights DB
Events(Kafka)
NoSQL
SQL Server
Graph DB
Microservice(.NET)
Microservice(NodeJS)
Microservice(Java)
Customer Insights
SQL Server
IIS, ASP.NET
DesktopBrowser
(Javascript, jQuery)
SQL
HTML, CSS, JS
MicrosoftReporting
Service
2005 Today DesktopBrowser
(Javascript, 20+ Frameworks)
Tablet
Native Android
Native iOS
JSON
JSON, CSS, HTML, JS
Backend for Frontend
(Java)
4© 2016 MapR Technologies 4
Application Development and Deployment
Oracle
Bulk Load
Machine Learning
Data LakePredictive
Modeling
BI / Reporting
Insights DB
Events(Kafka)
NoSQL
SQL Server
Graph DB
Microservice(.NET)
Backend for Frontend
(Java)
Microservice(NodeJS)
Microservice(Java)
DesktopBrowser
(Javascript, 20+ Frameworks)
Tablet
Native Android
Native iOS
Customer Insights
JSON
JSON, CSS, HTML, JS
SQL Server
IIS, ASP.NET
DesktopBrowser
(Javascript, jQuery)
SQL
HTML, CSS, JS
MicrosoftReporting
Service
2005 Today
5© 2016 MapR Technologies 5© 2016 MapR Technologies© 2016 MapR Technologies
Messaging platforms
6© 2016 MapR Technologies 6
Producers Consumers
A stream is an unbounded sequence of events carried from a set of producers to a set of consumers.
What’s a Stream?
Producers and consumers don’t have to be aware of each other, instead they participate in shared topics.
This is called publish/subscribe.
/Events:Topic
7© 2016 MapR Technologies 7
Publishers and Subscribers (pub-sub)
/Events:Topic Analytics
Consumers
Stream ProcessorsSocial Platforms
Servers (Logs, Metrics)
Sensors
Mobile Apps
Other Apps & Microservices
Alerting Systems
Stream Processing Frameworks
Databases & Search Engines
Dashboards
Other Apps & Microservices
8© 2016 MapR Technologies 8
Considering a Messaging Platform• 50-100k messages per second used to be good
– Not really good to handle decoupled communication between services
• Kafka model is BLAZING fast– Kafka 0.9 API with message sizes at 200 bytes– MapR Streams on a 5 node cluster sustained 18 million events / sec– Throughput of 3.5GB/s and over 1.5 trillion events / day
• Manual sharding is not a “great” solution– Adding more servers should be easy and fool proof, not painful– Yes, I have lived through this
9© 2016 MapR Technologies 9
Goals• Real-time or near-time
– Includes situations with deadlines– Also includes situations where delay is simply undesirable– Even includes situations where delay is just fine
• Microservices– Streaming is a convenient idiom for design– Microservices … you know we wanted it– Service isolation is a key requirement
10© 2016 MapR Technologies 10
Advantages of Messaging and Real-time Enablement• Less moving parts
– Less things to go wrong
• Better resource utilization– Scale any application up or down on demand
• Common deployment model (new isolation model)– Repeatability between environments (dev, qa, production)
• Improved integration testing– Listen to production streams in dev and qa (** this is a BIG DEAL! **)
• Shared file system– Get at the data anywhere in the cluster– Simplifies business continuity
11© 2016 MapR Technologies 11
A microservice isloosely coupled
with bounded context
12© 2016 MapR Technologies 12
How to Couple Services and Break micro-ness• Shared schemas, relational stores• Ad hoc communication between services• Enterprise service busses• Brittle protocols• Poor protocol versioning
Don’t do this!
13© 2016 MapR Technologies 13
How to Decouple Services• Use self-describing data • Private databases• Infrastructural communication between services• Use modern protocols• Adopt future-proof protocol practices• Use shared storage where necessary due to scale
14© 2016 MapR Technologies 14
Decoupled Architecture
Producer
Activity Handler
Producer
ProducerHistorical
Interesting Data Real-time
Analysis
Results Dashboard
Anomaly Detection
15© 2016 MapR Technologies 15
Mechanisms for Decoupling• Traditional message queues?
– Message queues are classic answer– Key feature/flaw is out-of-order acknowledgement– Many implementations– You pay a huge performance hit for persistence
• Kafka-esque Logs?– Logs are like queues, but with ordering– Out-of-order consumption is possible, acknowledgement not so much– Canonical base implementation is Kafka– Performance plus persistence
16© 2016 MapR Technologies 16
Shared Resources
17© 2016 MapR Technologies 17
Fraud Detection
18© 2016 MapR Technologies 18
Traditional Solution
19© 2016 MapR Technologies 19
What Happens Next?
20© 2016 MapR Technologies 20
What Happens Next?
21© 2016 MapR Technologies 21
How to Get Service Isolation
22© 2016 MapR Technologies 22
New Uses of Data
23© 2016 MapR Technologies 23
Scaling Through Isolation
24© 2016 MapR Technologies 24© 2016 MapR Technologies
Use Cases
25© 2016 MapR Technologies 25
Event-based Data Drives Applications
FailureAlerts
Real-time application & network monitoring
Trending now
WebPersonalized Offers
Real-time Fraud Detection
Ad optimizationSupply Chain Optimization
26© 2016 MapR Technologies 26
ClassifiersFighting Fraudulent Web Traffic
Activity Stream
Click Stream
Deviation from Normal
Blacklist Activities
Whitelist Activities
User Activity Profile
Known Bad Classifier
All OK Classifier
Session Alteration Stream Notify Security
27© 2016 MapR Technologies 27
Similarities between Marketing and Fraud?
Customer 360 Website Fraud
• Build a user profile– What are their normal usage patterns
• Build “segmented” profiles– What do real users normally do
• Dynamically alter website– Prevent user functionality
• Kick-off external workflows– Notify security team
• Build a user profile– What type of content do they like
• Build “segmented” profiles– Company affiliation
• Dynamically alter website– Show alternate content
• Kick-off external workflows– Nurture emails
28© 2016 MapR Technologies 28
Message Bus
Specialized Storage
Operational Applications
J2EE AppServer
Relational Database
Legacy Business Platforms
• IT must integrate all the products
• Inability to operationalize the insight rapidly
• Can’t deal with high speed data ingestion and processing
• Scale up architecture leads to high cost
Specialized Storage
Analytical Applications
Analytic Database ETL Tool BI Tool
29© 2016 MapR Technologies 29
Converged Data Platform
Analytical Applications
Operational Applications
Converged ApplicationsComplete Access to Real-time and
Historical Data in One Platform
Developers Creating Database and Event Based
Applications
(Bottom Line Initiatives) (Top Line Initiatives)
Analysts Creating BI Reports and KPIs on Data
Warehouse
Historical Data Current Data
30© 2016 MapR Technologies 30
Web-Scale StorageMapR-FS MapR-DB
Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability
MapR StreamsEvent StreamingDatabase
MapR Platform Services: Open API ArchitectureAssures Interoperability, Avoids Lock-in
HDFS API
POSIXNFS
SQL,HBase
APIJSONAPI
KafkaAPI
31© 2016 MapR Technologies 31
Converged Application Benefits
• Consumers scale horizontally with partitions• 1:1 mapping between consumer and partition• Enables predictable scaling as production needs grow
• Data can be seamlessly replicated to another cluster• Enables HA with zero code changes
• Data is indexed dynamically according to receivers, senders• Scales beyond the capabilities of Kafka
• Snapshots can be taken to capture state• Enables faster testing and deployment of applications
32© 2016 MapR Technologies 32
Not All Data Platforms are the Same