© Copyright 2015 EMC Corporation. All rights reserved. 1 © Copyright 2015 EMC Corporation. All rights reserved.
© Copyright 2015 EMC Corporation. All rights reserved.
EMC REDEFINING BIG DATA ALEXANDER ERMAKOV - PIVOTAL
2 © Copyright 2015 EMC Corporation. All rights reserved.
© Copyright 2015 EMC Corporation. All rights reserved.
Traditional Data Architecture
Front End
DBMS DBMS DBMS DBMS DBMS …
Front End
Front End
Front End
Front End
…
ETL
DWH
BI Data Minin
g OLAP …
100ms
3 sec
1 day
3-4 days
The path from end users
to business decisions
takes 1 day minimum
and 3-4 days typically
© Copyright 2015 EMC Corporation. All rights reserved.
Advanced Data Architecture – ELT
DBMS DBMS DBMS …
ETL
DDS
Data Marts
Reports
Aggregates
OLAP
DBMS DBMS DBMS …
ELT
DDS
Data Marts
Reports
Aggregates
OLAP
ODS ODS ODS …
ELT arisen 10 years ago
Driven by
Storage cost reduction
Introduction of MPP
Pushdown
optimization in ETL
tools
© Copyright 2015 EMC Corporation. All rights reserved.
Modern Data Architecture – Data Lake Concept was introduced 4 years ago by James Dixon
Data Lake Idea: integrate Hadoop solution into typical
enterprise architecture to improve customer analytics
capabilities
Usually Data Lake consists of the following approaches – Using Hadoop for storing and processing of unstructured data
– Using Hadoop as a staging platform for all the input data and using it
for storing all the source data loaded into the customer platform
– Historical data offload to Hadoop and using it as a cold data storage
© Copyright 2015 EMC Corporation. All rights reserved.
Modern Data Architecture – Data Lake
Hadoop
DBMS DBMS DBMS …
ELT
DDS
OLAP Data Marts
Aggregates
Reports
ODS ODS ODS …
CDC
DWH
ODS UDS
Analytical Archives
BI Data Minin
g OLAP
SQL-on-Hadoop
Data Mining At Scale
© Copyright 2015 EMC Corporation. All rights reserved.
Modern Data Architecture – Lambda Lambda Architecture introduced by Nathan Marz 2 years ago
Goal is to build a robust scalable fault-tolerant data
processing architecture, that is easily extensible and requires
minimal maintenance
Combines both near real time data processing and batch
processing into a single data processing approach
Based on the functional approach:
query = function(all data)
© Copyright 2015 EMC Corporation. All rights reserved.
Modern Data Architecture – Lambda Source data is loaded to both
Speed and Batch layers
Master Dataset is maintained in
Batch Layer and contains all the
raw input data and is a basis for
any recalculation needed in the
system
Speed layer handles only small
part of the latest data, discarding
all the older data entries
Query merges the results from
both Batch and Speed layer
Source Data
Speed Layer
Batch Layer
Serving Layer
Query Query
Master Dataset
Batch View
Batch View
Batch View
Real-time View
Real-time View
Real-time View
© Copyright 2015 EMC Corporation. All rights reserved.
Modern Data Architecture – Streaming The design is based on event stream processing
Uses message queue as the main data hub
Was born by reactive programming introduction. Emerged
with introduction of Spark Streaming, Storm and Samza
Don’t mix with “real-time processing” – Not just a webservice and RPC – no “response” exists in this design
– Not necessarily real-time: save the stream and reprocess it on demand
– Event stream processing instead of batch extraction of the data
– Using the same event stream for both OLTP and OLAP systems
© Copyright 2015 EMC Corporation. All rights reserved.
Modern Data Architecture – Streaming FE
BI
App App
App
… HTTP
BE
Srv
Srv
Srv
… SOAP
OLTP
SP JDBC
Log
Table
CDC
copy Parse
Batch
ETL
cp Batc
h
ETL load
OD
S
DD
S
Data
M
art
DWH
JDBC
© Copyright 2015 EMC Corporation. All rights reserved.
Modern Data Architecture – Streaming FE
BI
App App
App
… HTTP
BE
Srv
Srv
Srv
… SOAP
OLTP
SP JDBC
Log
Table
CDC
copy Parse
Batch
ETL
cp Batc
h
ETL load
OD
S
DD
S
Data
M
art
DWH
JDBC
FE BI
App App
App
… HTTP
BE
Srv
Srv
Srv
… SOAP
OLTP
SP JDBC Tabl
e
JDBC
ETL
ETL
OD
S
DD
S
Data
M
art
DWH
Queue
Batch
App
STG
Batch
App
Hadoop
RTI
App
HDFS
SQL On
Hadoop
ES
Introducing Queue
© Copyright 2015 EMC Corporation. All rights reserved.
Pivotal and Modern Data Architecture
BI
Pivotal Cloud Foundry
HTTP
FE
…
App
App
App
Queue BE
…
App
App
App
Pivotal GemFire
App
Spring XD
Streaming
Streaming
Data
Pivotal HD
Pivotal
HAWQ
ES
DD
S
Data
M
art
Pivotal Greenplum
Data Mart OLTP
SP Tabl
e
OD
S
ETL
ETL
© Copyright 2015 EMC Corporation. All rights reserved.
Pivotal and Modern Data Architecture
BI
Pivotal GemFire
App
Spring XD
Streaming
Streaming
Data
Pivotal HD
Pivotal
HAWQ
ES
DD
S
Data
M
art
Pivotal Greenplum
Data Mart OLTP
SP Tabl
e
OD
S
ETL
ETL
Pivotal Cloud Foundry
HTTP
FE
…
App
App
App
Queue BE
…
App
App
App
Pivotal Labs – agile software
development for next-generation
applications
Pivotal Cloud Foundry – PaaS
for customer applications
RabbitMQ – distributed message
queue service on top of PCF
Spring IO – foundation platform
for modern applications
© Copyright 2015 EMC Corporation. All rights reserved.
Pivotal and Modern Data Architecture
BI
Pivotal Cloud Foundry
HTTP
FE
…
App
App
App
Queue BE
…
App
App
App
Spring XD
Streaming
Streaming
Data
Pivotal HD
Pivotal
HAWQ
ES
DD
S
Data
M
art
Pivotal Greenplum
Data Mart OLTP
SP Tabl
e
OD
S
ETL
ETL
Pivotal GemFire
App
Pivotal GemFire – in-memory data grid enabling real-time
data processing and real-time decision making for enterprises
© Copyright 2015 EMC Corporation. All rights reserved.
Pivotal and Modern Data Architecture
BI
Pivotal Cloud Foundry
HTTP
FE
…
App
App
App
Queue BE
…
App
App
App
Pivotal GemFire
App
Streaming
Data
Pivotal HD
Pivotal
HAWQ
ES
DD
S
Data
M
art
Pivotal Greenplum
Data Mart OLTP
SP Tabl
e
OD
S
ETL
ETL
Spring XD
Streaming
Spring XD – unified, distributed and extensible framework for
data pipelining: ingesting, batching, processing and exporting
© Copyright 2015 EMC Corporation. All rights reserved.
Pivotal and Modern Data Architecture
BI
Pivotal Cloud Foundry
HTTP
FE
…
App
App
App
Queue BE
…
App
App
App
Pivotal GemFire
App
Spring XD
Streaming
ES
DD
S
Data
M
art
Pivotal Greenplum
OLTP
SP Tabl
e
OD
S
ETL
ETL
Streaming
Data
Pivotal HD
Pivotal
HAWQ Data Mart
Pivotal HD – leading Hadoop distribution based on ODP
Pivotal HAWQ – bringing the power of MPP to the Hadoop
cluster, best in class SQL-on-Hadoop solution
Apache Spark – component of the Pivotal HD distribution,
modern framework for distributed data processing
© Copyright 2015 EMC Corporation. All rights reserved.
Pivotal and Modern Data Architecture
BI
Pivotal Cloud Foundry
HTTP
FE
…
App
App
App
Queue BE
…
App
App
App
Pivotal GemFire
App
Spring XD
Streaming
Streaming
Data
Pivotal HD
Pivotal
HAWQ Data Mart OLTP
SP Tabl
e
ETL
ETL
ES
DD
S
Data
M
art
Pivotal Greenplum
OD
S
Pivotal Greenplum – leading analytical MPP database,
foundation for the enterprise data warehousing systems and
advanced analytics
© Copyright 2015 EMC Corporation. All rights reserved.
Pivotal Cloud Foundry
HTTP
FE
…
App
App
App
Pivotal and Modern Data Architecture Pivotal
GemFire
App
Spring XD
Streaming
Data Lake
BI
Streaming
Data
Pivotal HD
Pivotal
HAWQ Data Mart OLTP
SP Tabl
e
ETL
ETL
ES
DD
S
Data
M
art
OD
S
Queue BE
…
App
App
App
Pivotal Greenplum
© Copyright 2015 EMC Corporation. All rights reserved.
Pivotal and Modern Data Architecture
Pivotal Cloud Foundry
HTTP
FE
…
App
App
App
Queue BE
…
App
App
App
Spring XD
Streaming
ES
DD
S
Data
M
art
Pivotal Greenplum
OLTP
SP Tabl
e
OD
S
ETL
ETL
Pivotal GemFire
App
Streaming
Data
Pivotal HD
Pivotal
HAWQ Data Mart
BI
Lambda Architecture
© Copyright 2015 EMC Corporation. All rights reserved.
Pivotal and Modern Data Architecture
OLTP
SP Tabl
e
ETL
ETL
ES
DD
S
Data
M
art
Pivotal Greenplum
OD
S
Streaming
Pivotal Cloud Foundry
HTTP
FE
…
App
App
App
Queue BE
…
App
App
App
Pivotal GemFire
App
Spring XD
Streaming
Streaming
Data
Pivotal
HAWQ Data Mart
BI
Pivotal HD
© Copyright 2015 EMC Corporation. All rights reserved.
Pivotal and Modern Data Architecture
BI
Pivotal Cloud Foundry
HTTP
FE
…
App
App
App
Queue BE
…
App
App
App
Pivotal GemFire
App
Spring XD
Streaming
Streaming
Data
Pivotal HD
Pivotal
HAWQ
ES
DD
S
Data
M
art
Pivotal Greenplum
Data Mart OLTP
SP Tabl
e
OD
S
ETL
ETL