2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
DESCRIPTION
An introduction to Hadoop's core components as well as the core Hadoop use case: the Data Lake. This deck was delivered at Big Data Congress 2014 in Saint John, NB on Feb 24.TRANSCRIPT
![Page 1: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/1.jpg)
ELEPHANT AT THE DOOR: MODERN DATA ARCHITECTURE
Adam Muise – Solu/on Architect, Hortonworks
![Page 2: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/2.jpg)
Who am I?
![Page 3: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/3.jpg)
Who is ?
![Page 4: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/4.jpg)
We do Hadoop
The leaders of Hadoop’s development
Community driven, Enterprise Focused
Drive Innova/on in the plaForm – We lead the roadmap
100% Open Source – Democra/zed Access to Data
![Page 5: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/5.jpg)
We do Hadoop successfully.
Support
Professional Services Training
![Page 6: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/6.jpg)
What is Hadoop? What is everyone talking about?
![Page 7: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/7.jpg)
Data
![Page 8: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/8.jpg)
“Big Data” is the marke/ng term of the decade in IT
![Page 9: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/9.jpg)
What lurks behind the hype is the democra/za/on of Data.
![Page 10: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/10.jpg)
You need data.
![Page 11: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/11.jpg)
Data fuels analy/cs. Analy/cs fuels business decisions.
![Page 12: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/12.jpg)
So we save the data because we think we need it, but oTen we really don’t know what to do
with it.
![Page 13: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/13.jpg)
We put away data, delete it, tweet it, compress it, shred it, wikileak-‐it, put it in a database, put it in SAN/NAS, put it in the cloud, hide it in
tape…
![Page 14: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/14.jpg)
You need value from your data. You need to make decisions from your
data.
![Page 15: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/15.jpg)
So what are the problems with Big Data?
![Page 16: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/16.jpg)
Let’s talk challenges…
![Page 17: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/17.jpg)
Volume
Volume
Volume
Volume
![Page 18: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/18.jpg)
Volume
Volume
Volume
Volume
Volume Volume
Volume
Volume Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume
![Page 19: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/19.jpg)
Volume
Volume
Volume
Volume
Volume Volume
Volume
Volume Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
![Page 20: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/20.jpg)
Volume
Volume
Volume
Volume
Volume Volume
Volume
Volume Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
![Page 21: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/21.jpg)
Storage, Management, Processing all become challenges with Data at
Volume
![Page 22: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/22.jpg)
Tradi/onal technologies adopt a divide, drop, and conquer approach
![Page 23: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/23.jpg)
The solu/on? EDW
Data Data Data
Data Data Data
Data Data Data
Yet Another EDW
Data Data Data
Data Data Data
Data Data Data
Analy/cal DB
Data Data Data
Data Data Data
Data Data Data OLTP
Data Data Data
Data Data Data
Data Data Data
Another EDW
Data Data Data
Data Data Data
Data Data Data
![Page 24: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/24.jpg)
Ummm…you dropped something
Data Data Data
Data Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data Data Data
Data
Data Data Data
Data Data Data
EDW
Data Data Data
Data Data Data
Data Data Data
Yet Another EDW
Data Data Data
Data Data Data
Data Data Data
Analy/cal DB
Data Data Data
Data Data Data
Data Data Data
OLTP
Data Data Data
Data Data Data
Data Data Data
Another EDW
Data Data Data
Data Data Data
Data Data Data
![Page 25: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/25.jpg)
Analyzing the data usually raises more interes/ng ques/ons…
![Page 26: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/26.jpg)
…which leads to more data
![Page 27: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/27.jpg)
Wait, you’ve seen this before.
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data Data Data
Data
Data Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Analy/cs Sausage Factory
Data Data Data
Data Data Data
Data Data Data … Data
Data Data …
Data Data
Data
Data
![Page 28: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/28.jpg)
Data begets Data.
![Page 29: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/29.jpg)
What keeps us from our Data?
![Page 30: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/30.jpg)
“Prices, Stupid passwords, and Boring Sta/s/cs.” -‐ Hans Rosling
h)p://www.youtube.com/watch?v=hVimVzgtD6w
![Page 31: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/31.jpg)
Your data silos are lonely places.
EDW
Data Data Data
Data Data Data
Data Data Data
Accounts
Data Data Data
Data Data Data
Data Data Data
Customers
Data Data Data
Data Data Data
Data Data Data
Web Proper/es
Data Data Data
Data Data Data
Data Data Data
![Page 32: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/32.jpg)
… Data likes to be together.
EDW
Data Data Data
Data Data Data
Data Data Data
Accounts
Data Data Data
Data Data Data
Data Data Data
Customers
Data Data Data
Data Data Data
Data Data Data
Web Proper/es
Data Data Data
Data Data Data
Data Data Data
![Page 33: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/33.jpg)
Data likes to socialize too. EDW
Data Data Data
Data Data Data
Data Data Data
Accounts
Data Data Data
Data Data Data
Data Data Data
Customers
Data Data Data
Data Data Data
Data Data Data
Web Proper/es
Data Data Data
Data Data Data
Data Data Data
Machine Data
Data Data Data
Data Data Data
Data Data Data
Twi^er
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
CDR
Data Data Data
Data Data Data
Data Data Data
Weather Data
Data Data Data
Data Data Data
Data Data Data
![Page 34: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/34.jpg)
New types of data don’t quite fit into your pris/ne view of the world.
My Li^le Data Empire
Data Data Data
Data
Data Data
Data Data Data
Logs
Data Data Data Data
Data
Data Data
Machine Data
Data Data Data Data
Data
Data Data
? ?
? ?
![Page 35: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/35.jpg)
To resolve this, some people take hints from Lord Of The Rings...
![Page 36: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/36.jpg)
…and create One-‐Schema-‐To-‐Rule-‐Them-‐All…
EDW
Data Data Data
Data Data Data
Data Data Data Schema
![Page 37: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/37.jpg)
…but that has its problems too.
EDW
Data Data Data
Data Data Data
Data Data Data Schema Data
Data Data
ETL ETL
ETL ETL
EDW
Data Data Data
Data Data Data
Data Data Data Schema Data
Data Data
ETL ETL
ETL ETL
![Page 38: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/38.jpg)
What if the data was processed and stored centrally? What if you didn’t
need to force it into a single schema?
We call it a Data Lake. EDW
Data Data Data
Data Data
Data Data
Schema
BI & Analy/cs Schema Schema
Data Data Data
Data Lake
Data Data Data
Data Data Data Data
Data Data
Data Data Data
Schema Schema
Data Data Data
Process Process
Data Data Data
Data Data Data
Data Data Data
Data Data Data Data Sources
Data Sources
![Page 39: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/39.jpg)
A Data Lake Architecture enables: -‐ Landing data without forcing a single schema -‐ Landing a variety and large volume of data
efficiently -‐ Retaining data for a long period of /me with a very
low $/TB -‐ A plaForm to feed other Analy/cal DBs -‐ A plaForm to execute next gen data analy/cs and
processing applica/ons (SAS, Informa/ca, Graph Analy/cs, Machine Learning, SAP, etc…)
![Page 40: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/40.jpg)
In most cases, more data is be^er. Work with the popula/on, not just a
sample.
![Page 41: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/41.jpg)
Your view of a client today.
Male
Female
Age: 25-‐30
Town/City
Middle Income Band
Product Category Preferences
![Page 42: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/42.jpg)
Your view with more data.
Male
Female
Age: 27 but feels old
GPS coordinates
$65-‐68k per year
Product recommenda/ons
Tea Party Hippie
Looking to start a business
Walking into Starbucks right now…
A depressed Toronto Maple Leaf’s Fan
Products leT in basket indicate drunk amazon shopper
Gene Expression for Risk Taker
Thinking about a new house
Unhappy with his cell phone plan
Pregnant
Spent 25 minutes looking at tea cozies
![Page 43: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/43.jpg)
Pick up all of that data that was prohibi/vely expensive to store and
use.
![Page 44: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/44.jpg)
Why do viewer surveys…
![Page 45: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/45.jpg)
…when raw data can tell you what bu^on on the remote was pressed during what commercial for the
en/re viewer popula/on?
![Page 46: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/46.jpg)
Why make separate risk assessments in separate data silos…
![Page 47: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/47.jpg)
…when you can do a risk assessment on the en/re data
footprint of the client?
![Page 48: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/48.jpg)
To approach these use cases you need an affordable plaForm that stores, processes, and analyzes the
data.
![Page 49: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/49.jpg)
So what is the answer?
![Page 50: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/50.jpg)
Enter the Hadoop.
h^p://www.fabulouslybroke.com/2011/05/ninja-‐elephants-‐and-‐other-‐awesome-‐stories/
………
![Page 51: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/51.jpg)
Hadoop was created because tradi/onal technologies never cut it
for the Internet proper/es like Google, Yahoo, Facebook, Twi^er,
and LinkedIn
![Page 52: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/52.jpg)
Tradi/onal architecture didn’t scale enough…
DB DB DB
SAN
App App App App
DB DB DB
SAN
App App App App DB DB DB
SAN
App App App App
![Page 53: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/53.jpg)
Databases can become bloated and useless
![Page 54: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/54.jpg)
Tradi/onal architectures cost too much at that volume…
$/TB
$pecial Hardware
$upercompu/ng
![Page 55: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/55.jpg)
So what is the answer?
![Page 56: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/56.jpg)
If you could design a system that would handle this, what would it
look like?
![Page 57: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/57.jpg)
It would probably need a highly resilient, self-‐healing, cost-‐efficient,
distributed file system…
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage
![Page 58: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/58.jpg)
It would probably need a completely parallel processing framework that
took tasks to the data…
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage Processing Processing Processing
Processing Processing Processing
Processing Processing Processing
![Page 59: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/59.jpg)
It would probably run on commodity hardware, virtualized machines, and
common OS plaForms
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage Processing Processing Processing
Processing Processing Processing
Processing Processing Processing
![Page 60: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/60.jpg)
It would probably be open source so innova/on could happen as quickly
as possible
![Page 61: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/61.jpg)
It would need a cri/cal mass of users
![Page 62: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/62.jpg)
{Processing + Storage} =
{MapReduce/Tez/YARN+ HDFS}
![Page 63: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/63.jpg)
HDFS stores data in blocks and replicates those blocks
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage Processing Processing Processing
Processing Processing Processing
Processing Processing Processing block3 block3
block3
block2 block2
block2
block1
block1
block1
![Page 64: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/64.jpg)
If a block fails then HDFS always has the other copies and heals itself
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage Processing Processing Processing
Processing Processing Processing
Processing Processing Processing block3
block3
block3
block2 block2
block2
block1
block1
block1
X
![Page 65: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/65.jpg)
MapReduce is a programming paradigm that completely parallel
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
![Page 66: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/66.jpg)
MapReduce has three phases: Map, Sort/Shuffle, Reduce
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value Key, Value
Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
![Page 67: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/67.jpg)
MapReduce applies to a lot of data processing problems
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
![Page 68: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/68.jpg)
MapReduce goes a long way, but not all data processing and analy/cs
are solved the same way
![Page 69: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/69.jpg)
Some/mes your data applica/on needs parallel processing and inter-‐
process communica/on
Process
Process
Process
Process
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
![Page 70: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/70.jpg)
…like Complex Event Processing in Apache Storm
![Page 71: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/71.jpg)
Some/mes your machine learning data applica/on needs to process in
memory and iterate
Process
Process
Process
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data Process Process
Process
Process
Data
Data Data
![Page 72: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/72.jpg)
…like in Machine Learning in Spark
![Page 73: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/73.jpg)
Introducing Tez
![Page 74: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/74.jpg)
Tez is a YARN applica/on, like MapReduce is a YARN applica/on
![Page 75: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/75.jpg)
Tez is the Lego set for your data applica/on
![Page 76: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/76.jpg)
Tez provides a layer for abstract tasks, these could be mappers, reducers, customized stream
processes, in memory structures, etc
![Page 77: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/77.jpg)
Tez can chain tasks together into one job to get Map – Reduce – Reduce jobs
suitable for things like Hive SQL projec/ons, group by, and order by
TezMap
TezMap
TezMap
TezMap
TezMap
TezReduce
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
TezReduce
TezReduce
TezReduce
TezReduce
TezReduce
![Page 78: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/78.jpg)
Tez can provide long-‐running containers for applica/ons like Hive to side-‐step batch process startups you would have with MapReduce
![Page 79: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/79.jpg)
Introducing YARN
![Page 80: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/80.jpg)
YARN: Yeah, we did that too.
hortonworks.com/yarn/
![Page 81: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/81.jpg)
YARN = Yet Another Resource Nego/ator
![Page 82: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/82.jpg)
Resource Manager +
Node Managers = YARN
Resource Manager
AppMaster
Node Manager
Scheduler
AppMaster
AppMaster
Node Manager
Node Manager
Node Manager
Container
Container
MapReduce
Container
Storm
Container
Container
Container
Pig
Container
Container
Container
![Page 83: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/83.jpg)
YARN abstracts resource management so you can run more
than just MapReduce
HDFS2
MapReduce V2
YARN MapReduce V? STORM
MPI Giraph HBase Tez … and more Spark
![Page 84: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/84.jpg)
Hadoop has other open source projects…
![Page 85: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/85.jpg)
Hive = {SQL -‐> Tez || MapReduce} SQL-‐IN-‐HADOOP
![Page 86: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/86.jpg)
Pig = {PigLa/n -‐> Tez || MapReduce}
![Page 87: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/87.jpg)
HCatalog = {metadata* for MapReduce, Hive, Pig, HBase}
*metadata = tables, columns, par//ons, types
![Page 88: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/88.jpg)
Oozie = Job::{Task, Task, if Task, then Task, final Task}
![Page 89: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/89.jpg)
Falcon
Hadoop Hadoop
Data Set
ReplicaAon
Data Set
Data Set
Data Set
Data Set
Data Set
Data Set
Late Data Arrival
Lineage
RetenAon Policy
Process Management Audit
Monitoring Archival
![Page 90: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/90.jpg)
Knox
Hadoop Cluster
REST Client Knox Gateway
Hadoop Cluster
REST Client
REST Client
Enterprise LDAP
![Page 91: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/91.jpg)
Flume
JMS
Weblogs
Events
Files
Hadoop Flume
Flume
Flume
Flume
Flume
Flume
![Page 92: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/92.jpg)
Sqoop
Hadoop
DB DB Sqoop
Sqoop
![Page 93: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/93.jpg)
Ambari = {install, manage, monitor}
![Page 94: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/94.jpg)
HBase = {real-‐/me, distributed-‐map, big-‐tables}
![Page 95: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/95.jpg)
Storm = {Complex Event Processing, Near-‐Real-‐Time, Provisioned by
YARN }
![Page 96: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/96.jpg)
Apache Hadoop
Flume Ambari
HBase Falcon
MapReduce HDFS
Sqoop HCatalog
Pig
Hive
Storm YARN
Knox
Tez
![Page 97: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/97.jpg)
Hortonworks Data PlaForm
Flume Ambari
HBase Falcon
MapReduce HDFS
Sqoop HCatalog
Pig
Hive
Storm YARN
Knox
Tez
![Page 98: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/98.jpg)
What else are we working on?
hortonworks.com/labs/
![Page 99: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/99.jpg)
Hadoop is the new Modern Data Architecture for the Enterprise
![Page 100: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture](https://reader033.vdocument.in/reader033/viewer/2022052820/54c6ba904a7959a6418b45f1/html5/thumbnails/100.jpg)
© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Page 100
There is NO second place
Hortonworks …the Bull Elephant of Hadoop InnovaCon