microsoft's implementation of big data

21
Microsoft’s implementation of Big Data - HDInsight

Upload: gunder-biten

Post on 18-Aug-2015

24 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Microsoft’s implementation of Big Data - HDInsight

Gunder Bitén

IT Architect

Knowit Stockholm

30 years as IT Consultant

5 years with Azure

[email protected]

+46 72 553 94 81

https://www.linkedin.com/in/gunderbiten

Agenda

• Big Data Basics

• Microsoft Cloud Offer

• HDInsight Cluster creation

• HBase RDP

• HBase processing from C#

• Query Jobs

• Machine Learning – Predictive Analysis

• Internet of Things – Event Hubs & Storm Cluster

Big Data

Big data is high-volume, high-velocity and high-variety

information assets that demand cost-effective, innovative

forms of information processing for enhanced insight and

decision making.

Gartner

A new set of Questions

Microsoft Cloud Offer

• Office 365 including SharePoint Online & Dynamics CRM Online

• Virtual Machines

• Websites

• Service Bus

• Virtual Networks

• AD

• HDInsight

• Machine Learning

• Event Hub

• More coming next week

DEMO Azure Portal

Apache Hadoop (named after the creator Doug Cutting’s child’s pet elephant)

• Apache Open Source Project

• Java

• De facto standard

• MapReduce (widely used until now)• Programming model and implementation for processing and generating large data sets

• Map() procedure that performs filtering and sorting

• Reduce() procedure that performs a summary operation

• Used by• Facebook & Google

• Microsoft Azure & Amazon EC2

• Almost all of Fortune 100 enterprises

• The foundation for numerous other Apache projects

Open Source Community

We Consume Code

We Contribute Code

Core Code Same Across

Distributions

Apache Hadoop

Microsoft Partner –

HDP for Windows

Heavy Contributors to

Open Source Hadoop

Trusted by Community

Hortonworks

HDInsight Service,

HDInsight Server Built on

Hortonworks Platform

Additional Functionality

HDInsight

HDInsight and Hadoop Ecosystem

Hadoop Architecture

Subscription cost

per node

$0.32 per hour

(~$239 per month)

for 4 CPU 7 GB RAM

Cluster Types

• Hadoop – Work directly with unstructured data in OS files

• HBase - NoSQL database that allows online transactional processing of big data

• Includes larger Zookeeper instances that cost money

• Offers the same response time no matter if you have 100MB or 100 PB of data in a table

• Storm - System for processing streams of data (Preview as of mid October)

• Works well with Event Hub in the Service Bus

DEMOCluster Creation, table action both from cluster command line and Visual Studio

Hive Architecture

DEMO Hive create and query table

Predictive Analytics – Psychohistory Come True

• Science Fiction novelist Isaac Asimov’s first book in the Foundation Trilogy was published in 1951

• The whole story is based upon the creation of the science Psychohistory by Hari Seldon

• Psychohistory depends on the idea that, while one cannot foresee the actions of a particular individual, the laws of statistics as applied to large groups of people could predict the general flow of future events

• Today Psychohistory is reality but with another name – Predictive Analytics

Machine Learning – Predictive Analysis with Mahout

Internet of Things – Event Hub & Storm Cluster

•Streams – an unbounded sequence of tuples.

•Spouts –sources of streams in a computation (e.g. a Twitter API)

•Bolts – process input streams (Tuples) and produce output streams

•Nimbus node (master node, similar to the Hadoop JobTracker)

•ZooKeeper nodes – coordinates the Storm cluster

•Supervisor nodes – starts and stops workers according to signals from Nimbus

EEEvent

Hub

Analyze real-time Twitter sentiment with HBase

Visual Analyze with Excel Power Query

The world of NoSQL

• MapReduce on the way out !?!

• Spark SQL

• Impala

• Hive – getting faster and better (MapReduce internally)

• SQL look a likes in other Azure products

Why Azure and not HDP for Windows

• Compute capacity expensive – spin up/remove on demand

• Complex infrastructure – use human resources for analytics, not infrastructure

• Get access to everything else in Azure

• Start now – not in six months

• Let Microsoft worry about the 99.9% SLA

From Hype to Commodity

• On Gartner’s Top 10 Strategic Technological Trends three years ago• Internet of Things

• Next Generation Analytics

• Big Data

• Cloud Computing

• Two years ago I started to hope Microsoft would implement 1,2 & 3 in 4

• In Azure today

• Event Hub & Storm

• Machine Learning

• HDInsight

• Hypes or Commodity?

References

• http://azure.microsoft.com - Azure

• http://azure.microsoft.com/en-us/services/hdinsight/ - HDInsight

• http://hadoop.apache.org/ - Hadoop

• http://www.microsoft.com/en-us/server-cloud/products/analytics-platform-system/ - Analytics Platform

• http://aka.ms/IntroHDInsight/PDF - Introducing Microsoft Azure HDInsight E-Book