![Page 1: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/1.jpg)
1
Headline Goes Here Speaker Name or Subhead Goes Here
DO NOT USE PUBLICLY PRIOR TO 10/23/12
ApplicaAon Architectures with Hadoop Mark Grover | SoGware Engineer Jonathan Seidman | SoluAons Architect, Partner Engineering April 1, 2014
©2014 Cloudera, Inc. All Rights Reserved.
![Page 2: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/2.jpg)
About Us • Mark
• CommiOer on Apache Bigtop, commiOer and PPMC member on Apache Sentry (incubaAng).
• Contributor to Hadoop, Hive, Spark, Sqoop, Flume. • @mark_grover
• Jonathan • SoluAons Architect, Partner Engineering Team. • Co-‐founder of Chicago Hadoop User Group and Chicago Big Data. • [email protected] • @jseidman
2 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 3: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/3.jpg)
Co-‐authoring O’Reilly book
• Titled ‘Hadoop ApplicaAon Architectures’ • How to build end-‐to-‐end soluAons using Apache Hadoop and related tools • Updates on TwiOer: @hadooparchbook • hOp://www.hadooparchitecturebook.com
©2014 Cloudera, Inc. All Rights Reserved. 3
![Page 4: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/4.jpg)
Challenges of Hadoop ImplementaAon
4 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 5: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/5.jpg)
Challenges of Hadoop ImplementaAon
5 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 6: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/6.jpg)
6
Click Stream Analysis
Case Study
©2014 Cloudera, Inc. All Rights Reserved.
![Page 7: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/7.jpg)
Click Stream Analysis
7
Log Files
DWH X
©2014 Cloudera, Inc. All Rights Reserved.
![Page 8: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/8.jpg)
Web Log Example
©2014 Cloudera, Inc. All Rights Reserved. 8
[2012/09/22 20:56:04.294 -0500] "GET /info/ HTTP/1.1" 200 701 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en)" "age=38&gender=1&incomeCategory=5&session=983040389&user=627735038®ion=8&userType=1” [2012/09/23 14:12:52.294 -0500] "GET /wish/remove/275 HTTP/1.1" 200 701 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; en-us) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16" "age=63&gender=1&incomeCategory=1&session=1561203915&user=1364334488®ion=4&userType=1"
![Page 9: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/9.jpg)
Hadoop Architectural ConsideraAons
• Storage managers? • HDFS? HBase?
• Data storage and modeling: • File formats? Compression? Schema design?
• Data movement • How do we actually get the data into Hadoop? How do we get it out?
• Metadata • How do we manage data about the data?
• Data access and processing • How will the data be accessed once in Hadoop? How can we transform it? How do we query it?
• OrchestraAon • How do we manage the workflow for all of this?
9 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 10: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/10.jpg)
10
Data Storage and Modeling
©2014 Cloudera, Inc. All Rights Reserved.
![Page 11: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/11.jpg)
Data Storage – Storage Manager consideraAons
• Popular storage managers for Hadoop • Hadoop Distributed File System (HDFS) • HBase
11 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 12: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/12.jpg)
Data Storage – HDFS vs HBase
HDFS
• Stores data directly as files • Fast scans • Poor random reads/writes
HBase
• Stores data as Hfiles on HDFS • Slow scans • Fast random reads/writes
12 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 13: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/13.jpg)
Data Storage – Storage Manager consideraAons
• We choose HDFS • AnalyAcal needs in this case served beOer by fast scans.
13 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 14: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/14.jpg)
14
Data Storage Format
©2014 Cloudera, Inc. All Rights Reserved.
![Page 15: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/15.jpg)
Data Storage – Format ConsideraAons
• Store as plain text? • Sure, well supported by Hadoop. • Text can easily be processed by MapReduce, loaded into Hive for analysis, and so on.
• But… • Will begin to consume lots of space in HDFS. • May not be opAmal for processing by tools in the Hadoop ecosystem.
15 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 16: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/16.jpg)
Data Storage – Format ConsideraAons
• But, we can compress the text files… • Gzip – supported by Hadoop, but not spliOable. • Bzip2 – hey, spliOable! Great compression! But decompression is slooowww.
• LZO – spliOable (with some work), good compress/de-‐compress performance. Good choice for storing text files on Hadoop.
• Snappy – provides a good tradeoff between size and speed.
16 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 17: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/17.jpg)
Data Storage – More About Snappy
• Designed at Google to provide high compression speeds with reasonable compression.
• Not the highest compression, but provides very good performance for processing on Hadoop.
• Snappy is not spliOable though, which brings us to…
17 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 18: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/18.jpg)
SequenceFile
• Stores records as binary key/value pairs.
• SequenceFile “blocks” can be compressed.
• This enables spliOability with non-‐spliOable compression.
18 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 19: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/19.jpg)
Avro
• Kinda SequenceFile on Steroids.
• Self-‐documenAng – stores schema in header.
• Provides very efficient storage.
• Supports spliOable compression.
19 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 20: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/20.jpg)
Our Format Choices…
• Avro with Snappy • Snappy provides opAmized compression. • Avro provides compact storage, self-‐documenAng files, and supports schema evoluAon.
• Avro also provides beOer failure handling than other choices. • SequenceFiles would also be a good choice, and are directly supported by ingesAon tools in the ecosystem.
• But only supports Java.
20 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 21: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/21.jpg)
21
HDFS Schema Design
©2014 Cloudera, Inc. All Rights Reserved.
![Page 22: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/22.jpg)
Recommended HDFS Schema Design
• How to lay out data on HDFS?
22 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 23: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/23.jpg)
Recommended HDFS Schema Design
/user/<username> -‐ User specific data, jars, conf files /etl – Data in various stages of ETL workflow /tmp – temp data from tools or shared between users /data – shared data for the enAre organizaAon /app – Everything but data: UDF jars, HQL files, Oozie workflows
23 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 24: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/24.jpg)
24
Advanced HDFS Schema Design
©2014 Cloudera, Inc. All Rights Reserved.
![Page 25: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/25.jpg)
What is ParAAoning?
25
dataset col=val1/file.txt col=val2/file.txt . . . col=valn/file.txt
dataset file1.txt file2.txt . . . filen.txt
Un-‐parAAoned HDFS directory structure
ParAAoned HDFS directory structure
©2014 Cloudera, Inc. All Rights Reserved.
![Page 26: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/26.jpg)
What is ParAAoning?
26
clicks dt=2014-‐01-‐01/clicks.txt dt=2014-‐01-‐02/clicks.txt . . . dt=2014-‐03-‐31/clicks.txt
clicks clicks-‐2014-‐01-‐01.txt clicks-‐2014-‐01-‐02.txt . . . clicks-‐2014-‐03-‐31.txt
Un-‐parAAoned HDFS directory structure
ParAAoned HDFS directory structure
©2014 Cloudera, Inc. All Rights Reserved.
![Page 27: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/27.jpg)
ParAAoning
• Split the dataset into smaller consumable chunks • Rudimentary form of “indexing” • <data set name>/<parAAon_column_name=parAAon_column_value>/{files}
27 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 28: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/28.jpg)
ParAAoning consideraAons
• What column to bucket by? • HDFS is append only. • Don’t have too many parAAons (<10,000) • Don’t have too many small files in the parAAons (more than block size generally)
• We decided to parAAon by 1mestamp
28 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 29: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/29.jpg)
What is buckeAng?
29
clicks dt=2014-‐01-‐01/clicks.txt dt=2014-‐01-‐02/clicks.txt
Un-‐bucketed HDFS directory structure
clicks dt=2014-‐01-‐01/file0.txt dt=2014-‐01-‐01/file1.txt dt=2014-‐01-‐01/file2.txt dt=2014-‐01-‐01/file3.txt dt=2014-‐01-‐02/file0.txt dt=2014-‐01-‐02/file1.txt dt=2014-‐01-‐02/file2.txt dt=2014-‐01-‐02/file3.txt
Bucketed HDFS directory structure
©2014 Cloudera, Inc. All Rights Reserved.
![Page 30: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/30.jpg)
BuckeAng
• Hash-‐bucketed files within each parAAon based on a parAcular column
• Useful when sampling • In some joins, pre-‐reqs:
• Datasets bucketed on the same key as the join key • Number of buckets are the same or one is a mulAple of the other
30 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 31: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/31.jpg)
BuckeAng consideraAons?
• Which column to bucket on? • How many buckets? • We decided to bucket based on cookie
31 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 32: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/32.jpg)
De-‐normalizing consideraAons
• In general, big data joins are expensive • When to de-‐normalize?
• Decided to join the smaller dimension tables • Big fact tables are sAll joined
32 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 33: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/33.jpg)
33
Data IngesAon
©2014 Cloudera, Inc. All Rights Reserved.
![Page 34: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/34.jpg)
File Transfers
• “hadoop fs –put <file>” • Reliable, but not resilient to failure.
• Other opAons are mountable HDFS, for example NFSv3.
34 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 35: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/35.jpg)
Streaming IngesAon
• Flume • Reliable, distributed, and available system for efficient collecAon, aggregaAon and movement of streaming data, e.g. logs.
• Ka{a • Reliable and distributed publish-‐subscribe messaging system.
35 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 36: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/36.jpg)
Flume vs. Ka{a
• Purpose built for Hadoop data ingest.
• Pre-‐built sinks for HDFS, HBase, etc.
• Supports transformaAon of data in-‐flight.
• General pub-‐sub messaging framework.
• Hadoop not supported, requires 3rd-‐party component (Camus).
• Just a message transport (a very fast one).
36 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 37: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/37.jpg)
Flume vs. Ka{a
• BoOom line: • Flume very well integrated with Hadoop ecosystem, well suited to ingesAon of sources such as log files.
• Ka{a is a highly reliable and scalable enterprise messaging system, and great for scaling out to mulAple consumers.
37 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 38: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/38.jpg)
A Quick IntroducAon to Flume
38
Flume Agent
Source Channel Sink DesAnaAon External Source
Web Server TwiOer JMS System logs …
Consumes events and forwards to channels
Stores events unAl consumed by sinks – file, memory, JDBC
Removes event from channel and puts into external desAnaAon
JVM process hosAng components
©2014 Cloudera, Inc. All Rights Reserved.
![Page 39: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/39.jpg)
A Quick IntroducAon to Flume
• Reliable – events are stored in channel unAl delivered to next stage. • Recoverable – events can be persisted to disk and recovered in the event of failure.
39
Flume Agent
Source Channel Sink DesAnaAon
©2014 Cloudera, Inc. All Rights Reserved.
![Page 40: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/40.jpg)
A Quick IntroducAon to Flume
• DeclaraAve • No coding required. • ConfiguraAon specifies how components are wired together.
40 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 41: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/41.jpg)
A Brief Discussion of Flume PaOerns – Fan-‐in
• Flume agent runs on each of our servers.
• These agents send data to mulAple agents to provide reliability.
• Flume provides support for load balancing.
41 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 42: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/42.jpg)
A Brief Discussion of Flume PaOerns – Spli~ng
• Common need is to split data on ingest.
• For example: • Sending data to mulAple clusters for DR.
• To mulAple desAnaAons. • Flume also supports parAAoning, which is key to our implementaAon.
42 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 43: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/43.jpg)
Sqoop Overview
• Apache project designed to ease import and export of data between Hadoop and external data stores such as relaAonal databases.
• Great for doing bulk imports and exports of data between HDFS, Hive and HBase and an external data store. Not suited for ingesAng event based data.
©2014 Cloudera, Inc. All Rights Reserved. 43
![Page 44: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/44.jpg)
IngesAon Decisions
• Historical Data • Smaller files: file transfer • Larger files: Flume with spooling directory source.
• Incoming Data • Flume with the spooling directory source.
44 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 45: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/45.jpg)
45
Data Processing and Access
©2014 Cloudera, Inc. All Rights Reserved.
![Page 46: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/46.jpg)
Data flow
46
Raw data
ParAAoned clickstream
data
Other data (Financial, CRM, etc.)
Aggregated dataset #2
Aggregated dataset #1
©2014 Cloudera, Inc. All Rights Reserved.
![Page 47: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/47.jpg)
Data processing tools
47
• Hive • Impala • Pig, etc.
©2014 Cloudera, Inc. All Rights Reserved.
![Page 48: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/48.jpg)
Hive
48
• Open source data warehouse system for Hadoop • Converts SQL-‐like queries to MapReduce jobs
• Work is being done to move this away from MR • Stores metadata in Hive metastore • Can create tables over HDFS or HBase data • Access available via JDBC/ODBC
©2014 Cloudera, Inc. All Rights Reserved.
![Page 49: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/49.jpg)
Impala
49
• Real-‐Ame open source SQL query engine for Hadoop • Doesn’t build on MapReduce • WriOen in C++, uses LLVM for run-‐Ame code generaAon • Can create tables over HDFS or HBase data • Accesses Hive metastore for metadata • Access available via JDBC/ODBC
©2014 Cloudera, Inc. All Rights Reserved.
![Page 50: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/50.jpg)
Pig
50
• Higher level abstracAon over MapReduce (like Hive) • Write transformaAons in scripAng language – Pig LaAn • Can access Hive metastore via HCatalog for metadata
©2014 Cloudera, Inc. All Rights Reserved.
![Page 51: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/51.jpg)
Data Processing consideraAons
51
• We chose Hive for ETL and Impala for interac1ve BI.
©2014 Cloudera, Inc. All Rights Reserved.
![Page 52: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/52.jpg)
52
Metadata Management
©2014 Cloudera, Inc. All Rights Reserved.
![Page 53: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/53.jpg)
What is Metadata?
53
• Metadata is data about the data • Format in which data is stored • Compression codec • LocaAon of the data • Is the data parAAoned/bucketed/sorted?
©2014 Cloudera, Inc. All Rights Reserved.
![Page 54: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/54.jpg)
Metadata in Hive
54
Hive Metastore
©2014 Cloudera, Inc. All Rights Reserved.
![Page 55: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/55.jpg)
Metadata
55
• Hive metastore has become the de-‐facto metadata repository • HCatalog makes Hive metastore accessible to other applicaAons (Pig, MapReduce, custom apps, etc.)
©2014 Cloudera, Inc. All Rights Reserved.
![Page 56: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/56.jpg)
Hive + HCatalog
56 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 57: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/57.jpg)
57
OrchestraAon
©2014 Cloudera, Inc. All Rights Reserved.
![Page 58: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/58.jpg)
OrchestraAon
• Once the data is in Hadoop, we need a way to manage workflows in our architecture.
• Scheduling and tracking MapReduce jobs, Hive jobs, etc. • Several opAons here:
• Cron • Oozie, Azkaban • 3rd-‐party tools, Talend, Pentaho, InformaAca, enterprise schedulers.
58 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 59: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/59.jpg)
Oozie
• Supports defining and execuAng a sequence of jobs.
• Can trigger jobs based on external dependencies or schedules.
59 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 60: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/60.jpg)
60
Final Architecture
©2014 Cloudera, Inc. All Rights Reserved.
![Page 61: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/61.jpg)
Final Architecture – High Level Overview
61
Data Sources IngesAon
Data Storage/Processing
Data ReporAng/Analysis
©2014 Cloudera, Inc. All Rights Reserved.
![Page 62: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/62.jpg)
Final Architecture – High Level Overview
62
Data Sources IngesAon
Data Storage/Processing
Data ReporAng/Analysis
©2014 Cloudera, Inc. All Rights Reserved.
![Page 63: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/63.jpg)
Final Architecture – IngesAon
63
Web App Avro Agent Web App Avro Agent
Web App Avro Agent Web App Avro Agent
Web App Avro Agent Web App Avro Agent
Web App Avro Agent Web App Avro Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Fan-‐in PaOern
MulA Agents for Failover and rolling restarts
HDFS
©2014 Cloudera, Inc. All Rights Reserved.
![Page 64: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/64.jpg)
Final Architecture – High Level Overview
64
Data Sources IngesAon
Data Storage/Processing
Data ReporAng/Analysis
©2014 Cloudera, Inc. All Rights Reserved.
![Page 65: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/65.jpg)
Final Architecture – Storage and Processing
65
/etl/weblogs/20140331/ /etl/weblogs/20140401/ …
Data Processing /data/markeAng/clickstream/bouncerate/ /data/markeAng/clickstream/aOribuAon/ …
©2014 Cloudera, Inc. All Rights Reserved.
![Page 66: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/66.jpg)
Final Architecture – High Level Overview
66
Data Sources IngesAon
Data Storage/Processing
Data ReporAng/Analysis
©2014 Cloudera, Inc. All Rights Reserved.
![Page 67: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/67.jpg)
Final Architecture – Data Access
67
Hive/Impala
BI/AnalyAcs Tools
DWH Sqoop
Local Disk
R, etc.
DB import tool
JDBC/ODBC
©2014 Cloudera, Inc. All Rights Reserved.
![Page 68: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/68.jpg)
Contact info • Mark Grover
• @mark_grover • www.linkedin.com/in/grovermark
• Jonathan Seidman • [email protected] • @jseidman • hOps://www.linkedin.com/pub/jonathan-‐seidman/1/26a/959 • hOp://www.slideshare.net/jseidman
• Slides at slideshare.net/hadooparchbook
68 ©2014 Cloudera, Inc. All Rights Reserved.
![Page 69: Application architectures with Hadoop – Big Data TechCon 2014](https://reader034.vdocument.in/reader034/viewer/2022051608/53fe2feb8d7f72db2d8b463b/html5/thumbnails/69.jpg)
69 ©2014 Cloudera, Inc. All Rights Reserved.