3 hadoop? cloud data warehousing? machine learning? nosql?
TRANSCRIPT
Spark the future.
May 4 – 8, 2015Chicago, IL
Advanced Analytics: Navigating Your Way ThereAndrew J. BrustSenior Director, Technical Product Marketing and Evangelism, Datameer
BRK2567
Meet Andrew
Senior Director,Technical Product Marketing and Evangelism
Big Data blogger for ZDNet Microsoft Regional Director, MVP Co-chair Visual Studio Live! “Redmond Review” columnist for Visual
Studio Magazine Twitter: @andrewbrust
3
Read all about it!
Using Relational Database Technologies? Using SQL?
Creating Line-of-Business/”CRUD” applications
Using production reporting? Using Analysis Services and MDX?
Worried about Big Data, Machine Learning and Analytics?
Are you?
Analytics-curious?
Hadoop?
Cloud data warehousing?
Machine learning?
NoSQL?
Why the new generation of technology?
Ecosystems around open source projects are very active
Basis in commodity hardware
Scale out, and cloud
Change in economics of computing power
Change in economics of storage
SQL Server Analysis Services (SSAS) Multidimensional Tabular
StreamInsight SQL Server Integration Services (SSIS) SQL Server Analysis Services Data Mining Analytics Platform System (APS) Excel Power Pivot/Query/View/Map
The Microsoft Analytics Stack – On-Premises
HDInsight Azure Machine Learning (Azure ML) Azure Stream Analytics Azure Data Factory (ADF) Azure Data Lake (ADL) Azure Data Warehouse Power BI 2.0 DocumentDB Revolution R
The Microsoft Analytics Stack - Cloud
SQL Server Enterprise Analytics Platform System (APS)
SQL Server Parallel Data Warehouse (PDW) PolyBase
Azure Data Warehouse
Data Warehousing Options
Microsoft Analytics Platform System (APS) includes SQL Server Parallel Data Warehouse (PDW) Massively Parallel Processing (MPP) data warehouse appliance version of SQL Server MPP manages a grid of relational database servers for divide-and-conquer processing
of large data sets.
PDW includes “PolyBase,” a component which allows PDW to query data in Hadoop directly. Bypasses MapReduce; addresses data nodes directly and orchestrates parallelism
itself
APS available with or without a built-in, PolyBase-configured HDInsight “region”
PolyBase
A mashup of Azure SQL v12, and PDW Uses Azure Data Lake for storage*
Features PolyBase Can connect to Azure HDInsight But also to Cloudera and Hortonworks clusters, in cloud or on-prem
Compute and storage scale separately Compare to Amazon Redshift
Integrations Azure Data Factory Power BI
Azure Data Warehouse
Column store engines Columnstore indexes Tables and columns versus dimensions MOLAP, ROLAP DirectQuery
Columnar Technology
Column-Oriented Stores
Employee ID Age Income1 43 9000
2 38 10000
43 35 10000
Employee ID 1 2 3
Age 43 38 35
Income 9000 10000 10000
Imagine if instead of:
You have:
•Perf: values you wish to aggregate are adjacent•Efficiency: great compression from identical or nearly-identical values in proximity•Fast aggregation and high compression means huge volumes of data can be stored and processed, in RAM
Demo: PowerQuery, PowerView, Power BI
Definition MapReduce Hadoop 1.0 Hadoop 2.0
YARN
Big Data Concepts
“Big” data input accepted in file form Data is partitioned and sent to mappers
(nodes in cluster) Mappers pre-process data into KV pairs,
then all output for (a) given key(s) goes to a reducer
Reducers aggregate; one line of output per unique key, with one value
Map and Reduce code natively written as Java functions
What’s MapReduce?
MapReduce, in a Diagram
mapper
mapper
mapper
mapper
mapper
mapper
Input
reducer
reducer
reducer
Input
Input
Input
Input
Input
Input
Output
Output
Output
Output
Output
Output
Output
Input
Input
Input
K1
K2
K3
Output
Output
Output
The Hadoop 1.0 Stack
MapReduce, HDFS
Database
RDBMS Import/Export
Query: HiveQL and Pig Latin
Machine Learning/Data Mining
Log file integration
Developed with Hortonworks and incorporates Hortonworks Data Platform (HDP) for Windows
All contributed back to open source Apache Project
Linux version in preview On premises options:
Single node emulator runs on Windows client Hortonworks HDP for Windows on Windows Server (or Azure VMs) Also HDInsight with Analytics Platform System
Azure HDInsight
Demo: HDInsight
Used by most BI products which connect to Hadoop
Provides a SQL-like abstraction over Hadoop Officially HiveQL, or HQL
Works on own tables, but also on HBase Query generates MapReduce job, output of
which becomes result set Microsoft has Hive ODBC driver
Connects Excel, Reporting Services, PowerPivot, Analysis Services Tabular Mode (only)
Hive
An initiative led by Hortonworks With major participation from Microsoft
Make Hive interactive Combined with Apache Tez SET hive.execution.engine = Tez
Add “writeback” Add vector processing Project is ongoing
“Stinger” Project
Demo: Hive and Power BI 2.0
Apache Drill SQL to – and between – everything, without needing schema
Apache Spark In-memory processing, machine learning, streaming data, graph
processing Although you can add it, using a script step
Hue Web UI for major Hadoop stack components
What is HDInsight Missing?
The Hadoop 2.0 Stack
HDFS, YARN
Database
Interactive SQL
MapReduce Abstractions
Machine Learning
Streaming Data
Impala
+
Kafka
Search
Algorithms Models Predictions Web Services
Data Mining/Machine Learning
Experiments Workflows Algorithms R and Python code Deployments Gallery
Azure Machine Learning
Demo: Azure Machine Learning
Uses Azure Machine Learning Pattern recognition via regression Give it a photo, it will guess gender and age Integrated with Bing images to make
finding photos simple
How-Old.net
Azure Data Factory
Data Lake/Enterprise Data HubStore
raw data, centrally in HDFS
Use different processing engines for
different analyses
Data Lake
What is it? A cloud-based, HDFS-compatible file/storage system
What’s it got? Geo-distribution Low latency because its parallelism-compatible No limits on file size or account size
Compatability Azure Active Directory access control in addition to HDFS, ADL is compatible with Azure Storage APIs
Azure Data Lake
StreamInsight Storm on HDInsight Azure Stream Analytics
Streaming Options
Azure Stream Analytics
HBase on HDInsight Column family store
DocumentDB (JSON) Document store Adjustable consistency Query with SQL
NoSQL Options
Machine learning becomes integrated into BI Separate = silo
Streaming data gets easier Hive keeps getting better Hadoop becomes more embedded Hadoop gets used for small data, too PolyBase comes down to SQL Server
Enterprise Spark puts up or shuts up
What’s Next?
Ignite Azure Challenge Sweepstakes
Attend Azure sessions and activities, track your progress online, win raffle tickets for great prizes!
Aka.ms/MyAzureChallenge
Enter this session code online: BRK2567
NO PURCHASE NECESSARY. Open only to event attendees. Winners must be present to win. Game ends May 9th, 2015. For Official Rules, see The Cloud and Enterprise Lounge or myignite.com/challenge
Visit Myignite at http://myignite.microsoft.com or download and use the Ignite Mobile App with the QR code above.
Please evaluate this sessionYour feedback is important to us!
© 2015 Microsoft Corporation. All rights reserved.