building a data lake - an app dev's perspective
TRANSCRIPT
Building a Data LakeAn App Dev’s Perspective
GeekNight Hyderabad - March 8th 2017
Geetha Balasundaram
© 2017 ThoughtWorks Technologies Pvt. Limited
ABOUT ME
Developer @ ThoughtWorks
Building a data lake in the enterprise ecosystem
Helping a retail business make sense of it ( data guided org )
Been part of web development space ( enterprise rewrite )
Equally startled like everyone else by the data engineering space
Share know-how’s and do-how’s from our team’s experience
© 2017 ThoughtWorks Technologies Pvt. Limited
AGENDA
What is data in the true sense…
Data Warehouse in an enterprise ecosystem...
What is a data lake...
Data lake implementation in an enterprise ecosystem…
How to make effective use of a data lake: technology+process+people
Cluster Administration tool - Cloudera Manager
Pitfalls to avoid
© 2017 ThoughtWorks Technologies Pvt. Limited
Question ???
How did R.Ashwin perform in the last Test match?
HIGH LEVEL
PROBLEM STATEMENT
© 2017 ThoughtWorks Technologies Pvt. Limited
COMPLEX HISTORICAL DATA
Why?
Exploit and derive as much new insights as possible
Match Made
Enterprise systems produce this nature of complexity
© 2017 ThoughtWorks Technologies Pvt. Limited
DATA WAREHOUSE
https://martinfowler.com/articles/microservices.html
ETL
© 2017 ThoughtWorks Technologies Pvt. Limited
DID MICROSERVICES CAUSE THIS PROBLEM ?
Decentralised Data
https://martinfowler.com/articles/microservices.html© 2017 ThoughtWorks Technologies Pvt. Limited
MICROSERVICES HELPED
Break down business unit
Break down complexity
Understand the nature of data
© 2017 ThoughtWorks Technologies Pvt. Limited
Question ???
R.Ashwin performed well ( 6/41 ) in yesterday’s match!
Complex historical data can quantify how well he has performed
Can we say why did he do well in this particular match? What factors affected his enhanced performance?
© 2017 ThoughtWorks Technologies Pvt. Limited
FACT is a FACT
… even when we don’t know how it can be used
© 2017 ThoughtWorks Technologies Pvt. Limited
KEY DIFFERENCE
https://martinfowler.com/bliki/DataLake.html© 2017 ThoughtWorks Technologies Pvt. Limited
What is a data lake?
© 2017 ThoughtWorks Technologies Pvt. Limited
LAKE is...
.. a large body of water in a more natural state.
The contents of the lake, stream in from a source to fill the lake,
and various users of the lake can come to examine, dive in, or
take samples
https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/
© 2017 ThoughtWorks Technologies Pvt. Limited
DATA LAKE is...
.. a large body of water data facts in a more natural state.
The contents of the lake, stream in from a source to fill the lake,
and various users of the lake can come to examine analyse, dive
in build models, or take samples use subset for specific use
cases
© 2017 ThoughtWorks Technologies Pvt. Limited
KEY DIFFERENCE
https://martinfowler.com/bliki/DataLake.html© 2017 ThoughtWorks Technologies Pvt. Limited
Implementation
© 2017 ThoughtWorks Technologies Pvt. Limited
OUR IMPLEMENTATION - TECH STACK
DATA SOURCE
DATA INGESTION
DATA LAKE
DATA MARTS DATA ANALYSIS
Staging / Queue
© 2017 ThoughtWorks Technologies Pvt. Limited
© 2017 ThoughtWorks Technologies Pvt. Limited
How to make effective use of a data lake:
technology+process+people
© 2017 ThoughtWorks Technologies Pvt. Limited
Functionality Vs Reality
I need a feature so that I can do this action…..
to
I need this insight so that I can take this action….
eg : I need a functionality to order items anytime before or during a promotion…
to
..I need to know on time, if I have to order items anytime before or during a promotion…
so that I can improve promotion sales
People
© 2017 ThoughtWorks Technologies Pvt. Limited
Start Simple
There is no data lake yet…
Carve out portions of data which are easy wins yet critical to
arrive at the earlier stated insight..
Set up the infrastructure and pipeline
Get your hands dirty..
eg: Sales is an important factor to analyse / predict anything in retail space..
Technology
© 2017 ThoughtWorks Technologies Pvt. Limited
How much should I know about the data ?
As a consumer of data (read ‘not a consumer of service’)
How much should I know about it?
Schema ⇔ Contracts
Nature of the data versioned vs latest
transactional vs reference
facts vs aggregate
frequency of change
…..
Technology
© 2017 ThoughtWorks Technologies Pvt. Limited
DATA INSIGHT - Part 1
Incrementally add
new data to the
lake
Serve data
for analysis
eg: What data wrt promotions do I need to bring into the datalake ??
Sales → improve promotion sales
Technology
© 2017 ThoughtWorks Technologies Pvt. Limited
DATA INSIGHT - Part 2Sales + Promotions → improve promotion sales
How does adding more data to the lake help arriving at new insights..?
history of past promotions sales = how much to order for this promotion
history of past promotion sales + ‘X’ = how much to order for this promotion
history of past promotion sales + ‘X’ + ‘Y’ …… = how much to order for this promotion
eg: seasonality has a strong correlation with sales
history of past promotion sales + ‘X’ + ‘Y’ …… + ‘A’ = how much to order for this promotion after the start
People
© 2017 ThoughtWorks Technologies Pvt. Limited
Think Agile
Sales + Promotions + X factor → improve promotion sales
Near perfect list of parameters
Progressive set of parameters
Sales + Promotions → is the quantity arrived from these factors (known to business) ordered on time?
Process
© 2017 ThoughtWorks Technologies Pvt. Limited
DataMarts
... as a store of bottled water – cleansed and packaged and
structured for easy consumption
© 2017 ThoughtWorks Technologies Pvt. Limited
DataMarts
... as a store of data subset - curated from meaningful facts
bundled into logical groups for arriving at useful insights
© 2017 ThoughtWorks Technologies Pvt. Limited
Easy Insight
Sales + Promotions →
is the quantity arrived from these factors (known to business) ordered on time?
System : Tells me what is the quantity that is supposed to be ordered for this promotion..
System : Tells me in realtime what is the quantity that is ordered
Technology
© 2017 ThoughtWorks Technologies Pvt. Limited
Cluster Administration Tool
Cloudera Manager
© 2017 ThoughtWorks Technologies Pvt. Limited
Think DevOps
Scale | Performance | Memory | Resource Contention |
Optimization | Stability |
Need for an ecosystem - to monitor how well the different tools
play together without chaos
Tools
© 2017 ThoughtWorks Technologies Pvt. Limited
QUICK RECAP
What is data in the true sense…
Data Warehouse in an enterprise ecosystem...
What is a data lake...
Data lake implementation in an enterprise ecosystem...
How to make effective use of a data lake…
Cluster Administration tool - Cloudera Manager
© 2017 ThoughtWorks Technologies Pvt. Limited
PITFALLS TO AVOID
Data envy - Ref:https://martinfowler.com/bliki/Datensparsamkeit.html
Tool envy
Reliable data is a luxury
Understanding the nature of data is a must
Dialogue with the data scientist
Treating the data lake like a RDBMS
Keeping the business involved
Data flow state visibility
© 2017 ThoughtWorks Technologies Pvt. Limited