hadoop for the masses
TRANSCRIPT
Presented byAmandeep Modgil@amandeepmodgil
David Hamilton@analyticsanvil
Date1 September 2016
Hadoop for the MassesGeneral use and the Battle of Big Data
Hadoop for the Masses
Hadoop for the MassesGeneral use and the Battle of Big Data
| 2
Amandeep Modgil & David Hamilton – 1 September 2016
We’ll share our experience rolling out a Hadoop-based data lake to a self-service audience within
a corporate environment.
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
About us Birth of a Data Lake
Security Governance Change management
Learnings for making
Hadoop work in the
enterprise
Agenda
1 2 3 4 5 6
| 3
1About us
Hadoop for the Masses
About usOur background
| 5
Amandeep Modgil & David Hamilton – 1 September 2016
2Birth of a Data Lake
Hadoop for the Masses
Birth of a data lake› Large internal analytics community› Changing industry› Big(ish) data› Past pain points:
» Accessibility» Accuracy» Performance
Background
| 7
Amandeep Modgil & David Hamilton – 1 September 2016
Q2-2016Go live
Q3-2015Data
ingestion
Q2-2015Infra Go
liveQ1-2015Kick off
Q4-2014Feasibility
Hadoop for the Masses
Birth of a data lakeProject initiation
| 8
Amandeep Modgil & David Hamilton – 1 September 2016
Feasibility
Q4-2014
Technical and business requirement
s
Architecture design and roadmap
Decision to implement
Hadoop
POCs (functionality, integration)
Kick OffQ1-2015
Hadoop for the Masses
Birth of a data lakeData Landscape – Conceptual diagram
| 9
Amandeep Modgil & David Hamilton – 1 September 2016
Database Replication*
Windows Azure storage
Source Systems
Data Lake*(Hortonworks HDP)
RDBMS Application
Analytical Systems
* New components
EDW ODS
APISAP Application
Hadoop for the Masses
Birth of a data lake
Target landscape› Hortonworks HDP in Azure cloud (dev, test, prod)› Hive as initial use-case› Aims:
» Multiple legacy sources Unified data lake» Batch bottlenecks Parallel, scalable» ETL heavy landscape Schema on read, unstructured data
Project initiation
| 10
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise…SecurityGovernanceChange Management
Taming the elephant
3Security
Hadoop for the Masses
Security
Challenges› Data security› Secure infrastructure› Provisioning access
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise
| 13
Hadoop for the Masses
Security
› Filesystem security is essential» Difficult with some cloud storage
› Hive security via Ranger› Private cloud environment in MS Azure› Integrated authentication via Kerberos / AD› Secured access points to the cluster
Our experience
| 14
Amandeep Modgil & David Hamilton – 1 September 2016
4Governance
Hadoop for the Masses
Governance
Challenges› Platform reliability› Data quality› Keeping the lake “clean”
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise
| 16
Hadoop for the Masses
Governance
› Naming standards essential› Metadata catalogue› Cluster resource management› Code management› Data quality› Monitoring
Our experience
| 17
Amandeep Modgil & David Hamilton – 1 September 2016
5Change management
Hadoop for the Masses
Change Management
Challenges› Requirements gathering› User education› Expectation management
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise
| 19
Hadoop for the Masses
Change management
› Explain platform choice to users› Early rollout to key user groups› UI is important› Communicate differences with existing platforms
» Performance» Functionality
› Anticipate different user groups
Our experience
| 20
Amandeep Modgil & David Hamilton – 1 September 2016
6Learnings for making Hadoop work in the enterprise
Hadoop for the Masses
Learnings for making Hadoop work in the enterprise
Amandeep Modgil & David Hamilton – 1 September 2016
Understand the scale of the challenge
| 22
Deploying a new tool
Understand-ing Parallel concepts
Deploying for the en-
terprise
Security in-tegration
Building and governing for general
use
Complexity
Perc
eive
d di
fficu
lty
/ effo
rt
Hadoop for the Masses
Learnings for making Hadoop work in the enterprise
› Write guidelines, but use erasers› Some hard things are easy, some easy things are hard› Build reusable building blocks› Integration worthwhile, smoothness not guaranteed with all tools
» Other data platforms» ETL tools» Front-end tools
Our experience
| 23
Amandeep Modgil & David Hamilton – 1 September 2016
Hadoop for the Masses
Learnings for making Hadoop work in the enterprise
› Bulky ELT / ETL flows› Data archiving› Unstructured data› Streaming data› New capability
Strengths and opportunities
| 24
Amandeep Modgil & David Hamilton – 1 September 2016
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
About us Birth of a Data Lake
Security Governance Change management
Learnings for making
Hadoop work in the
enterprise
Agenda
1 2 3 4 5 6
| 25
ü ü ü ü ü ü
Questions?
Hadoop for the Masses
Contact us
› https://au.linkedin.com/in/amandeep-modgil › https://au.linkedin.com/in/davidhamiltonau
| 27
Amandeep Modgil & David Hamilton – 1 September 2016
Hadoop for the Masses
Image credits
› ‘img_9646’ by Leonid Mamchenkov https://www.flickr.com/photos/mamchenkov/2955225736 under a Creative Commons Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0.
› ‘Bicycle Security’ by Sean MacEntee https://www.flickr.com/photos/smemon/9565907428 under a Creative Commons Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0.
› ‘Traffic Cop’ by Eric Chan https://www.flickr.com/photos/maveric2003/27022816 under a Creative Commons Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0.
› ‘restoration’ by zoetnet https://www.flickr.com/photos/zoetnet/5944551574 under a Creative Commons Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0.
| 28
Amandeep Modgil & David Hamilton – 1 September 2016