Download - Ceph Days 2014 Paul Evans Slide Deck
![Page 1: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/1.jpg)
BUILDING A CEPH-POWERED DATA LAKE (OR) DATA GRID
Paul Evans principal architect
daystrom technology group [email protected]
san jose 2014
ceph days
![Page 2: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/2.jpg)
Why build a data grid (or data lake) ?
…because we have a data FLOOD in process
![Page 3: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/3.jpg)
indeed, we love data…
we’re good at generating more and more, but…
( we never seem to throw any of it out )
too FAST
too many VARIANTS
too MUCH
![Page 4: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/4.jpg)
IS THE ANSWER TO ALL OF THIS…. “ WE NEED LESS DATA! ”
are you crazy? we live to store things!
we just need better tools… (and more storage)
![Page 5: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/5.jpg)
DATA AUTOMATION
Workflow Automation
Wildly-Scalable Storage
Data Lake Data Grid
STACK
![Page 6: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/6.jpg)
DATA LAKE“a storage repository that holds a vast amount of raw data in its native
format until it is needed”
![Page 7: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/7.jpg)
DATA LAKE - ORIGINS
First use credited to James Dixon, CTO at Pentaho, circa 2010
“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state…”
“The contents of the data lake stream in from a
source to fill the lake, and various users of the lake
can come to examine, dive in, or take samples.”
![Page 8: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/8.jpg)
DATA LAKE - EXPLAINED
While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.
![Page 9: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/9.jpg)
DATA LAKE - WHY ???
?
![Page 10: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/10.jpg)
DATA LAKE CHARACTER
Unwashed Data: schema-on-read from RAW source Flexible Processing: batch, interactive, online, search
MetaData Dependent: tag it or lose it Common Access: hdfs-centric toolset
…in other words: this is not a glass-house Data Mart
![Page 11: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/11.jpg)
A REFERENCE ‘LAKE’ ARCHITECTURE
OPERATIONSSECURITYDATA ACCESSGOVERNENCEINTEGRATION
DATA MANAGEMENT
![Page 12: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/12.jpg)
A CEPHALOPOD IN THE LAKE?
Hadoop-native HDFS Locality-aware HDFS Distributed Name Svc Ceph Native Erasure Coding Ceph 20% Faster * Ceph * on Terasort benchmark over IB, Mar 2014
If this is import… Use this…
![Page 13: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/13.jpg)
(LAKE) DREDGERS
technology grouptechnology group
![Page 14: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/14.jpg)
DATA GRID“the unifying layer to how content and data are stored, protected, located
and accessed”
![Page 15: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/15.jpg)
DATA GRID - ORIGINS
The need for data grids was first recognized by the scientific community concerning climate modeling, where exchanging PB-size data sets became commonplace. Recently, large-scale
instruments such as the Large Hadron Collider (LHC) at CERN are driving grid innovation.
![Page 16: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/16.jpg)
DATA GRID - EXPLAINED
Data Grids present consistent access controls, governance, and metadata extensions to diverse storage media using a common, global interface for access and transport.
Additionally, they offer a ‘micro-service’ architecture for the creation of standard tasks & policies, which are enforced by a distributed “grid control-plane.”
![Page 17: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/17.jpg)
DATA GRID - WHY ???
![Page 18: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/18.jpg)
DATA GRID - ATTRIBUTES
Data Virtualization: common presentation of all content Universe-size Namespace: for files, objects & metadata Automation of Data Operations: distributed, scalable
Policy Mgmt/Reporting: data valuation & action triggers
![Page 19: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/19.jpg)
CEPH MEETS GRID
implemented:
CephFS & RBD Ceph libRADOS RemoteCloud
Cold StorageArchive
DATA GRID unified namespace
HiSpeed Tier
LinkD
irectLIBRADOS
+ Ceph
LIBRADOS + Ceph
RBD
![Page 21: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/21.jpg)
TIME 2 SUMMARIZE…We are in the midst of a Data Explosion
We also need effective, de-centralized ways to care for the dataWe need robust, expandable, yet simple solutions to store data
![Page 22: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/22.jpg)
DATA AUTOMATION
STACK
Workflow Automation
Wildly-Scalable Storage
Ceph+
the SMART approach
Data Lake Data Grid
![Page 23: Ceph Days 2014 Paul Evans Slide Deck](https://reader034.vdocument.in/reader034/viewer/2022052507/5589e7ffd8b42ad6378b4596/html5/thumbnails/23.jpg)
thank you!
Paul Evans principal architect
technology grouptechnology group
san jose ceph days