3 hadoop? cloud data warehousing? machine learning? nosql?

41
park the future. May 4 – 8, 2015 Chicago, IL

Upload: rosaline-randall

Post on 19-Dec-2015

223 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Spark the future.

May 4 – 8, 2015Chicago, IL

Page 2: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Advanced Analytics: Navigating Your Way ThereAndrew J. BrustSenior Director, Technical Product Marketing and Evangelism, Datameer

BRK2567

Page 3: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Meet Andrew

Senior Director,Technical Product Marketing and Evangelism

Big Data blogger for ZDNet Microsoft Regional Director, MVP Co-chair Visual Studio Live! “Redmond Review” columnist for Visual

Studio Magazine Twitter: @andrewbrust

3

Page 4: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Andrew’s New/Old Blog (bit.ly/bigondata)

Page 5: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Read all about it!

Page 6: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Using Relational Database Technologies? Using SQL?

Creating Line-of-Business/”CRUD” applications

Using production reporting? Using Analysis Services and MDX?

Worried about Big Data, Machine Learning and Analytics?

Are you?

Page 7: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Analytics-curious?

Hadoop?

Cloud data warehousing?

Machine learning?

NoSQL?

Page 8: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Why the new generation of technology?

Ecosystems around open source projects are very active

Basis in commodity hardware

Scale out, and cloud

Change in economics of computing power

Change in economics of storage

Page 9: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

SQL Server Analysis Services (SSAS) Multidimensional Tabular

StreamInsight SQL Server Integration Services (SSIS) SQL Server Analysis Services Data Mining Analytics Platform System (APS) Excel Power Pivot/Query/View/Map

The Microsoft Analytics Stack – On-Premises

Page 10: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

HDInsight Azure Machine Learning (Azure ML) Azure Stream Analytics Azure Data Factory (ADF) Azure Data Lake (ADL) Azure Data Warehouse Power BI 2.0 DocumentDB Revolution R

The Microsoft Analytics Stack - Cloud

Page 11: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

SQL Server Enterprise Analytics Platform System (APS)

SQL Server Parallel Data Warehouse (PDW) PolyBase

Azure Data Warehouse

Data Warehousing Options

Page 12: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Microsoft Analytics Platform System (APS) includes SQL Server Parallel Data Warehouse (PDW) Massively Parallel Processing (MPP) data warehouse appliance version of SQL Server MPP manages a grid of relational database servers for divide-and-conquer processing

of large data sets.

PDW includes “PolyBase,” a component which allows PDW to query data in Hadoop directly. Bypasses MapReduce; addresses data nodes directly and orchestrates parallelism

itself

APS available with or without a built-in, PolyBase-configured HDInsight “region”

PolyBase

Page 13: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

A mashup of Azure SQL v12, and PDW Uses Azure Data Lake for storage*

Features PolyBase Can connect to Azure HDInsight But also to Cloudera and Hortonworks clusters, in cloud or on-prem

Compute and storage scale separately Compare to Amazon Redshift

Integrations Azure Data Factory Power BI

Azure Data Warehouse

Page 14: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Column store engines Columnstore indexes Tables and columns versus dimensions MOLAP, ROLAP DirectQuery

Columnar Technology

Page 15: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Column-Oriented Stores

Employee ID Age Income1 43 9000

2 38 10000

43 35 10000

Employee ID 1 2 3

Age 43 38 35

Income 9000 10000 10000

Imagine if instead of:

You have:

•Perf: values you wish to aggregate are adjacent•Efficiency: great compression from identical or nearly-identical values in proximity•Fast aggregation and high compression means huge volumes of data can be stored and processed, in RAM

Page 16: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Demo: PowerQuery, PowerView, Power BI

Page 17: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Definition MapReduce Hadoop 1.0 Hadoop 2.0

YARN

Big Data Concepts

Page 18: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

“Big” data input accepted in file form Data is partitioned and sent to mappers

(nodes in cluster) Mappers pre-process data into KV pairs,

then all output for (a) given key(s) goes to a reducer

Reducers aggregate; one line of output per unique key, with one value

Map and Reduce code natively written as Java functions

What’s MapReduce?

Page 19: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

MapReduce, in a Diagram

mapper

mapper

mapper

mapper

mapper

mapper

Input

reducer

reducer

reducer

Input

Input

Input

Input

Input

Input

Output

Output

Output

Output

Output

Output

Output

Input

Input

Input

K1

K2

K3

Output

Output

Output

Page 20: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

The Hadoop 1.0 Stack

MapReduce, HDFS

Database

RDBMS Import/Export

Query: HiveQL and Pig Latin

Machine Learning/Data Mining

Log file integration

Page 21: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Developed with Hortonworks and incorporates Hortonworks Data Platform (HDP) for Windows

All contributed back to open source Apache Project

Linux version in preview On premises options:

Single node emulator runs on Windows client Hortonworks HDP for Windows on Windows Server (or Azure VMs) Also HDInsight with Analytics Platform System

Azure HDInsight

Page 22: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Demo: HDInsight

Page 23: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Used by most BI products which connect to Hadoop

Provides a SQL-like abstraction over Hadoop Officially HiveQL, or HQL

Works on own tables, but also on HBase Query generates MapReduce job, output of

which becomes result set Microsoft has Hive ODBC driver

Connects Excel, Reporting Services, PowerPivot, Analysis Services Tabular Mode (only)

Hive

Page 24: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

An initiative led by Hortonworks With major participation from Microsoft

Make Hive interactive Combined with Apache Tez SET hive.execution.engine = Tez

Add “writeback” Add vector processing Project is ongoing

“Stinger” Project

Page 25: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Demo: Hive and Power BI 2.0

Page 26: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Apache Drill SQL to – and between – everything, without needing schema

Apache Spark In-memory processing, machine learning, streaming data, graph

processing Although you can add it, using a script step

Hue Web UI for major Hadoop stack components

What is HDInsight Missing?

Page 27: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

The Hadoop 2.0 Stack

HDFS, YARN

Database

Interactive SQL

MapReduce Abstractions

Machine Learning

Streaming Data

Impala

+

Kafka

Search

Page 28: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Algorithms Models Predictions Web Services

Data Mining/Machine Learning

Page 29: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Experiments Workflows Algorithms R and Python code Deployments Gallery

Azure Machine Learning

Page 30: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Demo: Azure Machine Learning

Page 31: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Uses Azure Machine Learning Pattern recognition via regression Give it a photo, it will guess gender and age Integrated with Bing images to make

finding photos simple

How-Old.net

Page 32: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Azure Data Factory

Page 33: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Data Lake/Enterprise Data HubStore

raw data, centrally in HDFS

Use different processing engines for

different analyses

Data Lake

Page 34: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

What is it? A cloud-based, HDFS-compatible file/storage system

What’s it got? Geo-distribution Low latency because its parallelism-compatible No limits on file size or account size

Compatability Azure Active Directory access control in addition to HDFS, ADL is compatible with Azure Storage APIs

Azure Data Lake

Page 35: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

StreamInsight Storm on HDInsight Azure Stream Analytics

Streaming Options

Page 36: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Azure Stream Analytics

Page 37: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

HBase on HDInsight Column family store

DocumentDB (JSON) Document store Adjustable consistency Query with SQL

NoSQL Options

Page 38: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Machine learning becomes integrated into BI Separate = silo

Streaming data gets easier Hive keeps getting better Hadoop becomes more embedded Hadoop gets used for small data, too PolyBase comes down to SQL Server

Enterprise Spark puts up or shuts up

What’s Next?

Page 39: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Ignite Azure Challenge Sweepstakes

Attend Azure sessions and activities, track your progress online, win raffle tickets for great prizes!

Aka.ms/MyAzureChallenge

Enter this session code online: BRK2567

NO PURCHASE NECESSARY. Open only to event attendees. Winners must be present to win. Game ends May 9th, 2015. For Official Rules, see The Cloud and Enterprise Lounge or myignite.com/challenge

Page 40: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Visit Myignite at http://myignite.microsoft.com or download and use the Ignite Mobile App with the QR code above.

Please evaluate this sessionYour feedback is important to us!

Page 41: 3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

© 2015 Microsoft Corporation. All rights reserved.