data lakes - past, present and future
TRANSCRIPT
PowerPoint Presentation
DATA LAKESPAST, PRESENT, FUTURE
WelcomeIntros for dr. Eric little VP Data Science for OSTHUSand ben park, dir of s/w dev at ThalesI worked with both of these very bright, energetic and innovative people for a stint at Modus Operandi
Tonight we talk about Data lakes very popular approach to big data
Ill setup the topic. Eric will present, then Ben.Ill ask a few questions to kick it off and open the floor for questions and discussions
1
DATA LAKESPASTPRESENTFUTURE
Data LakesFrom whence they sprang, what are they, whats possibly out over the horizon on this trendline2
Watch out!words
its sort of a long story
stories
writing
lists / tables
Lets go back and rewalk historyWe started with short phrases that symbolized something that matteredA major leap forward was stories, myths and shared ideas at scaleAt some point we need to retain bigger and bigger stories and listsSo we invented writingThis naturally led to the construct of a table of rows of information that clumped together and columns that described the contents of a row in a consistent wayList and tables really caught on
Who made a list today?3
ton-o-tables
DatabasesSpreadhseetsHeirarchicalRelationalsAV pairs,Rare ExpensiveBig (in size)Small in storage
DatawarehousesDataMartsDataStoresHub/spoke,Stars,CommonAffordableSmaller (in size)Bigger in storage
Long vector storesBlogsNoSQLTriple & Quads
AnywhereCheapTiny (in size)Massive in storage
Tables led to more and more tables, stacks of tablesThis got unwealdyComputers came along, they were rare, expensive, physically big and storage was at a premiumDatabases, spreadsheets, came along and we experimented with many formsDatawarehouses, marts, datastores were invented to allow multiple tables to be cross linked by keys or indexes. Joins across tables allowed indirect referencesComputing marched along and 4
DATA LAKE
storing BIG datavariant structural forms, usually object blobsRaw source data, native formats & processed resultsAgile, nimble, as-needed, transient transformsmassive, low-cost (?)Data Lake is almost the prototypical use-case for theHadooptechnology stack.Hype alerts are sounding.
So now in the era of cloud and hadoop clusters, we see the rise of data lakes as the quintessential prototypical use case for data lakes
Big storage on clusters, commodity hardware elastic expansion, start small and scaleHeteorgenous data typing
Base, source, raw data + transformed, processed, together in the same lake, created as needed, when needed, we called this late binding. Lighter weight than DBA-centric data warehousing
More agile, nimble
A lower cost approach based on assembly of open source parts for home grown or out of box. Time will tell on true TOC.
5
"If you think of a datamart as a store of bottled water cleansed and packaged and structured for easy consumption the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples."
James DixonPentahochief technology officerDATAMARTS & DATAWAREHOUSESsubsetssilosstructuredpurpose-fit
What are data lakes
The bottled water analogy
Datawarehouses are in contrast subsets of data, silod, purpose fit, structured with a good degree of management required to keep it current and useful for the business6
Everything generates data (IoT++)Nothing is tossed, raw, interim, resultsReengineered on fly for, streaming, analysis, learningWhere is porousSelf-optimizing, self-describing with autonomous derivationsNew equilibriums between centralized and edge/fogOwnership and threads by BlockChainDNA
7
Karl Seiler | [email protected]@pivitguru
8