the fundamentals guide to hdp and hdinsight
TRANSCRIPT
![Page 1: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/1.jpg)
![Page 2: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/2.jpg)
Laat ons weten wat u vindt van deze sessie! Vul de evaluatie in via www.techdaysapp.nl en maak kans op een van de 20 prijzen*. Prijswinnaars worden bekend gemaakt via Twitter (#TechDaysNL). Gebruik hiervoor de code op uw badge.
Let us know how you feel about this session! Give your feedback via www.techdaysapp.nl and possibly win one of the 20 prices*. Winners will be announced via Twitter (#TechDaysNL). Use your personal code on your badge.
* Over de uitslag kan niet worden gecorrespondeerd, prijzen zijn voorbeelden – All results are final, prices are examples
![Page 3: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/3.jpg)
The Fundamentals Guide to HDP and HDInsightGert Drapers (#DataDude)
Principle Software Design Engineer
![Page 4: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/4.jpg)
http://www.economist.com/node/15579717?Story_ID=15579717Copyright © The Economist Newspaper Limited 2012. All rights reserved
![Page 5: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/5.jpg)
The 4Vs of Big Data: Volume, Velocity, Variability, & Variety
Source: http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data
![Page 6: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/6.jpg)
![Page 7: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/7.jpg)
New In Hadoop 2
•YARN• ResourceManager• NodeManager• ApplicationMaster
•HDFS 2• NameNode HA• Snapshots• Federation
![Page 8: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/8.jpg)
Hortonworks Data Platform For Windows
• Leverages work from Hortonworks and Microsoft• 100% open source Apache Hadoop• Built on the latest releases across Hadoop (2.2)• YARN• Stinger Phase 2 (Faster queries)
• Only distribution available on Windows Server• Harness existing .NET and Java skills to write
MapReduce• Utilize familiar BI tools for analysis including Microsoft
Excel
On-Premise Self-Deploy (Hadoop)
See: http://hortonworks.com/products/releases/hdp-2-windows/
![Page 9: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/9.jpg)
Microsoft Azure HDInsight 3.0
• Microsoft’s cloud Hadoop offer• 100% open source Apache Hadoop• Built on the latest releases across Hadoop (2.2)• YARN• Stinger Phase 2 (Faster queries)
• Up and running in minutes with no hardware to deploy• Harness existing .NET and Java skills to write
MapReduce• Utilize familiar BI tools for analysis including Microsoft
Excel
Cloud, Hadoop
Microsoft Azure
See: http://www.windowsazure.com/en-us/solutions/big-data/
![Page 10: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/10.jpg)
Stinger Phase 2 in Hive 0.12
•QO improvements
•Predicate pushdown
•ORC file improvements
http://hortonworks.com/labs/stinger/
![Page 11: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/11.jpg)
Demo: Getting Started with Hadoop 2 in Azure with HDInsight
![Page 12: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/12.jpg)
HDFS
![Page 13: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/13.jpg)
HDFS Architecture
• Block based (64MB default)
• Hierarchical file organization of directories and files
• Write once, read many
• Highly portable• Optimized for small
numbers of very large files
Distributed Fault Tolerant File System
Source: http://hortonworks.com/hadoop/hdfs/
![Page 14: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/14.jpg)
YARN
![Page 15: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/15.jpg)
A long time ago, in a data center far, far away…
![Page 16: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/16.jpg)
Episode IV
There was Map Reduce
![Page 17: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/17.jpg)
Introduction to Map/Reduce
Map f(k1,v1) list(k2,v2)
Reduce f(k2, list(v2)) (k2, v3)
FunctionallyIn Practice, WordCount
The quick brown fox jumps over the lazy dog
Map
(the,1) (quick,1), (brown,1), (fox,1), (over,1), (the,1),(lazy,1),(dog,1)
Shuffle
(the,(1,1)) (quick,1), (brown,1), (fox,1), (over,1),(lazy,1),(dog,1)
Reduce
(the,2) (quick,1), (brown,1), (fox,1), (over,1), (lazy,1),(dog,1)
In Code
Then, scale to TB/PB of data over 10’s, 100’s or 1000’s of nodes
![Page 18: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/18.jpg)
![Page 19: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/19.jpg)
And Map Reduce was… good?
![Page 20: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/20.jpg)
Episode V
Then came the abstractions
![Page 21: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/21.jpg)
A pig who eats everything
![Page 22: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/22.jpg)
logs = LOAD 'wasb://[email protected]/weblogs'
USING PigStorage(' ') AS (datereq:chararray, timereq:chararray, s_sitename:chararray, cs_method:chararray, cs_uri_stem:chararray, cs_uri_query:chararray, s_port:chararray, cs_username:chararray, c_ip:chararray, cs_User_Agent:chararray, cs_Cookie chararray, cs_Referer:chararray, cs_host :chararray, sc_status:chararray, sc_substatus:chararray, sc_win32_status:chararray, sc_bytes:int, cs_bytes:int, time_taken:int );
SET default_parallel 5;
-- remove header rows
filtered_logs = FILTER logs BY datereq != '#';
referrer_logs = GROUP filtered_logs BY cs_Referer;
summary_referrer = FOREACH referrer_logs GENERATE $0, COUNT($1) AS COUNT, SUM(filtered_logs.sc_bytes) AS TotalEgress, AVG(filtered_logs.time_taken) AS AverageTimeTaken;
sorted_summary = ORDER summary_referrer BY COUNT DESC;
limit_summary = LIMIT sorted_summary 25;
grouped_by_stem = GROUP filtered_logs BY cs_uri_stem;
summary_ip = FOREACH grouped_by_stem GENERATE $0, COUNT($1) AS NumberOfRequests, SUM(filtered_logs.sc_bytes) AS TotalEgress, AVG(filtered_logs.time_taken) AS AverageTimeTaken;
sorted_summary = ORDER summary_ip BY NumberOfRequests DESC;
limited_summary = LIMIT sorted_summary 25;
STORE filtered_logs INTO 'wasb://[email protected]/tmp/results5/forhive' USING PigStorage('\t');
STORE limited_summary INTO 'wasb://[email protected]/tmp/results5/stemstats' USING PigStorage('\t');
STORE limit_summary INTO 'wasb://[email protected]/tmp/results5/referer_logs' USING PigStorage('\t');
![Page 23: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/23.jpg)
Hive for those who know SQL
![Page 24: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/24.jpg)
CREATE EXTERNAL TABLE websites_logs_raw (datereq STRING, timereq STRING, s_sitename STRING, cs_method STRING, cs_uri_stem STRING, cs_uri_query STRING, s_port STRING, cs_username STRING, c_ip STRING, cs_User_Agent STRING, cs_Cookie STRING, cs_Referer STRING, cs_host STRING, sc_status INT, sc_substatus STRING, sc_win32_status STRING, sc_bytes INT, cs_bytes INT, time_taken INT )ROW FORMAT DELIMITED FIELDS TERMINATED BY ' 'STORED AS TEXTFILELOCATION 'wasb://[email protected]/weblogs2'tblproperties ("skip.header.line.count"="1");
set mapred.input.dir.recursive=true;set hive.mapred.supports.subdirectories=true;
select count(*) from websites_logs_raw
![Page 25: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/25.jpg)
Cascading/Scalding to bring a modern JVM API for analytics
![Page 27: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/27.jpg)
But the abstractions all shared one thing… Map Reduce
![Page 28: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/28.jpg)
WordCount in Scalding…
See: https://github.com/twitter/scalding
Map Phase
Reduce Phase
![Page 29: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/29.jpg)
Map/Reduce v1 Architecture
Source: http://hortonworks.com/wp-content/uploads/2012/08/MRArch.png
![Page 30: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/30.jpg)
Episode VI
One YARN to rule them all
![Page 31: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/31.jpg)
Compute Model != Resource Model
![Page 32: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/32.jpg)
YARN Architecture
Source: http://hortonworks.com/wp-content/uploads/2012/08/YARNArch.png
• Thus, removing contention on Job Tracker to do everything
• Become more resilient to RM failures
• Number of active jobs more scalable
![Page 33: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/33.jpg)
![Page 34: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/34.jpg)
Other Interesting YARN projects
![Page 35: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/35.jpg)
Some Existing YARN apps
• Storm on YARN
•Hbase on YARN
• Spark
•Giraph
•Hamster (MPI on Yarn)
•MemcacheD
•Dryad
![Page 36: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/36.jpg)
Writing your own YARN app for fun and profit…
![Page 38: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/38.jpg)
Yikes…
![Page 39: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/39.jpg)
See Slide 20 – Enter Abstractions
![Page 41: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/41.jpg)
REEFhttp://www.reef-project.org/
![Page 42: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/42.jpg)
Kittenhttps://github.com/cloudera/kitten
http://www.lua.org/manual/5.1
![Page 43: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/43.jpg)
What about .NET?
![Page 44: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/44.jpg)
Dryad on YARNsources
background
![Page 45: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/45.jpg)
The Microsoft Data Platform
![Page 46: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/46.jpg)
Resources
•All about HDInsight
•Getting Started with HDInsight
•Windows HDP 2.0
•Hadoop project
•HadoopSDK Codeplex project
•Getting Started with YARN blog series
•YARN book
![Page 47: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/47.jpg)
Laat ons weten wat u vindt van deze sessie! Vul de evaluatie in via www.techdaysapp.nl en maak kans op een van de 20 prijzen*. Prijswinnaars worden bekend gemaakt via Twitter (#TechDaysNL). Gebruik hiervoor de code op uw badge.
Let us know how you feel about this session! Give your feedback via www.techdaysapp.nl and possibly win one of the 20 prices*. Winners will be announced via Twitter (#TechDaysNL). Use your personal code on your badge.
* Over de uitslag kan niet worden gecorrespondeerd, prijzen zijn voorbeelden – All results are final, prices are examples
![Page 48: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/48.jpg)
![Page 49: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/49.jpg)
Backup Slides
![Page 50: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/50.jpg)
Moving Data Between Stores
•Sqoop• Data in or out of relational store
•Pig• Set of Storage & Loaders (JDBC, Mongo, etc)
•Hive • Table formats (Mongo, Azure Tables)
![Page 51: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/51.jpg)
Website log processing, Pig, Hive
![Page 52: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/52.jpg)
logs = LOAD 'wasb://[email protected]/weblogs' USING PigStorage(' ') AS
(datereq:chararray, timereq:chararray, s_sitename:chararray, cs_method:chararray, cs_uri_stem:chararray, cs_uri_query:chararray, s_port:chararray, cs_username:chararray, c_ip:chararray, cs_User_Agent:chararray, cs_Cookie :chararray, cs_Referer:chararray, cs_host:chararray, sc_status:chararray, sc_substatus:chararray, sc_win32_status:chararray, sc_bytes:int, cs_bytes:int, time_taken:int );SET default_parallel 100;-- remove header rows filtered_logs = FILTER logs BY datereq != '#';grouped_by_stem = GROUP filtered_logs BY cs_uri_stem;summary_ip = FOREACH grouped_by_stem GENERATE $0, COUNT($1) AS NumberOfRequests, SUM(filtered_logs.sc_bytes) AS TotalEgress, AVG(filtered_logs.time_taken) AS AverageTimeTaken;sorted_summary = ORDER summary_ip BY NumberOfRequests DESC;limited_summary = LIMIT sorted_summary 1000;
--STORE limited_summary INTO 'wasb://[email protected]/build2014/stats' USING PigStorage('\t');
![Page 53: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/53.jpg)
CREATE EXTERNAL TABLE websites_logs_raw (datereq STRING, timereq STRING, s_sitename STRING, cs_method STRING, cs_uri_stem STRING, cs_uri_query STRING, s_port STRING, cs_username STRING, c_ip STRING, cs_User_Agent STRING, cs_Cookie STRING, cs_Referer STRING, cs_host STRING, sc_status INT, sc_substatus STRING, sc_win32_status STRING, sc_bytes INT, cs_bytes INT, time_taken INT )ROW FORMAT DELIMITED FIELDS TERMINATED BY ' 'STORED AS TEXTFILELOCATION 'wasb://[email protected]/weblogs2'tblproperties ("skip.header.line.count"="1");
set mapred.input.dir.recursive=true;set hive.mapred.supports.subdirectories=true;
select count(*) from websites_logs_raw
![Page 54: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/54.jpg)
Interacting with SQL DB
![Page 55: The Fundamentals Guide to HDP and HDInsight](https://reader033.vdocument.in/reader033/viewer/2022052117/58ecfabe1a28ab745f8b4613/html5/thumbnails/55.jpg)
bin\sqoop import --connect "jdbc:sqlserver://[yourserver].database.windows.net:1433;database=AdventureWorks2012;user=[username];password=[password]" --table SalesOrderDetail --hive-import -m 10 -- --schema Sales
New-AzureHDInsightSqoopJobDefinition –Command ‘import --connect "jdbc:sqlserver://[yourserver].database.windows.net:1433;database=AdventureWorks2012;user=[username];password=[password]" --table SalesOrderDetail --hive-import -m 10 -- --schema Sales’
REGISTER lib/piggybank.jar;REGISTER c:\apps\dist\sqljdbc_3.0\enu\sqljdbc4.jar;
STORE limited_summary INTO '/doesnotmatter' USING org.apache.pig.piggybank.storage.DBStorage('com.microsoft.sqlserver.jdbc.SQLServerDriver', 'jdbc:sqlserver://[yourserver].database.windows.net;database=AdventureWorks2012;user=[username];
password=[password]','INSERT INTO OutputFromPig(cs_uri_stem, NumberOfRequests, TotalEgress, AverageTimeTaken) VALUES (?,?,?,?)');