introduction to azure hdinsight
TRANSCRIPT
Introduction to HDInsight
Stéphane FréchetteSaturday February 7, 2015
Who am I?
My name is Stéphane Fréchette
SQL Server MVP | Consultant | Speaker | Data & BI Architect | Big Data |NoSQL | Data Science. Drums, good food and fine wine. Founder @TEDxGatineau
I have a passion for architecting, designing and building solutions that matter.
Twitter: @sfrechetteBlog: stephanefrechette.comEmail: [email protected]
Topics
• What is Big Data?
• Apache Hadoop
• Hadoop Ecosystem
• Microsoft Azure HDInsight
• Demos
• Summary
• Resources
• Q&A
“Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time…”
- Wikipedia
What is Big Data?
Many Options
Variability
Internet of things
Audio /
VideoLog Files
Text/Image
Social
Sentiment
Data Market Feeds
eGov Feeds
Weather
Wikis / Blogs
Click StreamSensors / RFID / Devices
Spatial & GPS Coordinates
WEB 2.0Mobile
Advertising CollaborationeCommerce
Digital Marketing
Search Marketing
Web Logs
Recommendations
ERP / CRM
Sales Pipeline
Payables
Payroll
Inventory
Contacts
Deal
Tracking
Terabytes
(10E12)
Gigabytes
(10E9)
Exabytes
(10E18)
Petabytes
(10E15)
Velocity - Variety
Vo
lum
e
1980190,000$
20100.07$
19909,000$
200015$
Storage/GB
ERP / CRM WEB
2.0
Internet of things
What is Big Data?
Common Scenarios
What is Big Data?
Hadoop
• Apache Hadoop is for big data
• Open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models
• Designed to scale up from single servers to thousands of machines, each offering local computation and storage
TRADITIONAL RDBMS HADOOP
Data Size
Access
Updates
Structure
Integrity
Scaling
DBA Ratio
Hadoop
HDFS
• Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.
HDFS ≠ Database
MapReduce
• MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
Processing function:
- Mapping
- Reducing
How it works?
ServerServer
ServerServer
Runtime
How it works?
Distributed Storage(HDFS)
Query(Hive)
Distributed Processing(MapReduce)
Scripting(Pig)
No
SQL D
atabase
(HB
ase)
Metadata(HCatalog)
Data Integratio
n( O
DB
C/ SQ
OO
P/ REST)
Relatio
nal
(SQL
Server)
Machine Learning(Mahout)
Graph(Pegasus)
Stats processing(RHadoop
Event Pipelin
e(Flu
me)
Active Directory (Security)
Monitoring & Deployment
(System Center)
C#, F#, .NETPowerShell
Pipelin
e / wo
rkflow
(Oozie)
Azure Storage Vault (ASV)
Bu
siness
Intelligence
Excel, Pow
er V
iew, SSA
S)
World's Data (Azure Data
Marketplace)
Event Driven
Pro
cessing
LegendRed = Core HadoopBlue = Data processingPurple = Microsoft integration points and value addsOrange = Data MovementGreen = Packages
Hadoop Ecosystem
HDInsight
• HDInsight is a Hadoop-based service that brings a 100 % Apache Hadoop solution that runs on the Microsoft Azure platform
• Based on the Hortonworks Data Platform (HDP)
• Scalable, on-demand service
Storage
Azure Storage (Blob)File System
Two choices
Demo[Spinning up a HDInsight Cluster ;-)]
Now what?
Working with your HDInsight cluster - running jobs, import/export data, viewing and consuming data…
• .NET
• Java
• Pig
• Hive
• Sqoop
• Excel
• Others
What is Hive?
• A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis
• Provides an SQL-Like language called HiveQL to query data
• Integration between Hadoop and BI and visualization tools
http://hive.apache.org
What is Pig?
• Write complex MapReduce jobs using a simple script language (Pig Latin)
• A platform for analyzing large data sets that consists of high-level language for expressing data analysis programs
• Pig translates and compiles complex MapReduce jobs on the fly
http://pig.apache.org
What is Sqoop?
• Command-line interface application to transfer bulk data between Hadoop and relational datastores
http://sqoop.apache.org
Demo[Query, Analyze, Transfer + Visual Studio Tools for HDInsight]
HadoopData Analytics
Data Flow
Demo[Self-Service BI with Hive and Excel…]
Machine Learning
Graph Processing
Distributed Compute
Extract LoadTransform
Predictive Analysis
Capabilities
Data Knowledge Action
Summary
Resources
• Apache Projects (list with links) http://bit.ly/MfpLtE
• Microsoft Azure HDInsight http://bit.ly/1dnlAX1
• HDInsight Documentation & Tutorials http://bit.ly/LWRYol
• Hortonworks Sandbox 2.2 & Tutorials http://bit.ly/1gkkCte
• Cloudera VMs CDH 5.3.x http://bit.ly/1ENWgHH
• Microsoft JDBC Driver 4.1 | 4.0 for SQL Server http://bit.ly/1kEgJ7O
• Microsoft Hive ODBC Driver http://bit.ly/NFkhcH
• Getting Started with Big Data (MVA) http://bit.ly/1wU90Xd
• Big Data and Business Analytics Immersion v3.1 (MVA) http://bit.ly/1unvvX1
• Introducing Microsoft Azure HDInsight (free e-book) http://bit.ly/1JOPe5F
What Questions Do You Have?
Thank YouFor attending this session