introducing data lakes

11
Introducing Data Lakes Pravin Singh

Upload: pravin-kumar-singh-pmp-psm

Post on 16-Apr-2017

126 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Introducing Data Lakes

Introducing

Data Lakes

Pravin Singh

Page 2: Introducing Data Lakes

Why?• Once upon a time, there was a Data Warehouse

– Data pre-categorized at the point of entry– Data well organized, but in silos– Common, predetermined data model for “optimal” analysis– Upfront DB modelling and ETL effort– A single-source-of-truth, but at the cost of flexibility– Complex system with low tolerance for human error, IT help required

for even the smallest enhancements– Not to forget, the high costs

• Then came the Big Bang, of Information!• Data Lake to the Rescue

Page 3: Introducing Data Lakes

What?

Source: PwC

Page 4: Introducing Data Lakes

Benefits

• Breaks the silos• Flexible Data Model (Schema on Read)• Data Provenance• No upfront modeling and data cleansing• Low cost of ownership• Focused on exploration, not on operations• Can work as staging area for ETL

Page 5: Introducing Data Lakes

Pitfalls and Challenges

• Data Lake as Data Graveyard• Metadata• Governance• Information Lifecycle Management (ILM)• Security and Privacy• Training

Page 6: Introducing Data Lakes

Lake Maturity

Source: PwC

Page 7: Introducing Data Lakes

Four Stages of Data Lake Adoption1: Life Before Hadoop

– Applications stand alone with their databases– Some applications contribute data to a data warehouse– Analysts run reporting and analytics in data warehouse

Page 8: Introducing Data Lakes

Four Stages of Data Lake Adoption2: Hadoop is Introduced

– Applications contribute data to Hadoop– Hadoop runs batch MapReduce jobs– Hadoop used for ETL into warehouse or analytic databases– Hadoop data reintroduced into applications

Page 9: Introducing Data Lakes

Four Stages of Data Lake Adoption3: Growing the Data Lake

– Newly built systems center around Hadoop by default– Applications use each other’s data via Hadoop– Hadoop becomes a default data destination; governance and metadata

become important– Data warehouse use becomes the exception, where legacy or special

requirements dictate

Page 10: Introducing Data Lakes

Four Stages of Data Lake Adoption4: Data Lake and Application Cloud

– New applications are built on a Hadoop application platform around the data lake

– Hadoop matures as an elastic distributed data computing platform– Data lake adds security and governance layers– Data availability increases, application deployment time decreases– Some apps still have special or legacy needs and execute independently

Page 11: Introducing Data Lakes

Questions?