data lakes - past, present and future

8
DATA LAKES PAST, PRESENT, FUTURE

Upload: karl-seiler

Post on 17-Jan-2017

127 views

Category:

Software


1 download

TRANSCRIPT

PowerPoint Presentation

DATA LAKESPAST, PRESENT, FUTURE

WelcomeIntros for dr. Eric little VP Data Science for OSTHUSand ben park, dir of s/w dev at ThalesI worked with both of these very bright, energetic and innovative people for a stint at Modus Operandi

Tonight we talk about Data lakes very popular approach to big data

Ill setup the topic. Eric will present, then Ben.Ill ask a few questions to kick it off and open the floor for questions and discussions

1

DATA LAKESPASTPRESENTFUTURE

Data LakesFrom whence they sprang, what are they, whats possibly out over the horizon on this trendline2

Watch out!words

its sort of a long story

stories

writing

lists / tables

Lets go back and rewalk historyWe started with short phrases that symbolized something that matteredA major leap forward was stories, myths and shared ideas at scaleAt some point we need to retain bigger and bigger stories and listsSo we invented writingThis naturally led to the construct of a table of rows of information that clumped together and columns that described the contents of a row in a consistent wayList and tables really caught on

Who made a list today?3

ton-o-tables

DatabasesSpreadhseetsHeirarchicalRelationalsAV pairs,Rare ExpensiveBig (in size)Small in storage

DatawarehousesDataMartsDataStoresHub/spoke,Stars,CommonAffordableSmaller (in size)Bigger in storage

Long vector storesBlogsNoSQLTriple & Quads

AnywhereCheapTiny (in size)Massive in storage

Tables led to more and more tables, stacks of tablesThis got unwealdyComputers came along, they were rare, expensive, physically big and storage was at a premiumDatabases, spreadsheets, came along and we experimented with many formsDatawarehouses, marts, datastores were invented to allow multiple tables to be cross linked by keys or indexes. Joins across tables allowed indirect referencesComputing marched along and 4

DATA LAKE

storing BIG datavariant structural forms, usually object blobsRaw source data, native formats & processed resultsAgile, nimble, as-needed, transient transformsmassive, low-cost (?)Data Lake is almost the prototypical use-case for theHadooptechnology stack.Hype alerts are sounding.

So now in the era of cloud and hadoop clusters, we see the rise of data lakes as the quintessential prototypical use case for data lakes

Big storage on clusters, commodity hardware elastic expansion, start small and scaleHeteorgenous data typing

Base, source, raw data + transformed, processed, together in the same lake, created as needed, when needed, we called this late binding. Lighter weight than DBA-centric data warehousing

More agile, nimble

A lower cost approach based on assembly of open source parts for home grown or out of box. Time will tell on true TOC.

5

"If you think of a datamart as a store of bottled water cleansed and packaged and structured for easy consumption the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples."

James DixonPentahochief technology officerDATAMARTS & DATAWAREHOUSESsubsetssilosstructuredpurpose-fit

What are data lakes

The bottled water analogy

Datawarehouses are in contrast subsets of data, silod, purpose fit, structured with a good degree of management required to keep it current and useful for the business6

Everything generates data (IoT++)Nothing is tossed, raw, interim, resultsReengineered on fly for, streaming, analysis, learningWhere is porousSelf-optimizing, self-describing with autonomous derivationsNew equilibriums between centralized and edge/fogOwnership and threads by BlockChainDNA

7

Karl Seiler | [email protected]@pivitguru

8