big data analytics rafal lukawiecki strategic consultant project botticelli ltd...

25
Big data analytics Rafal Lukawiecki Strategic Consultant Project Botticelli Ltd [email protected] @rafaldotnet

Upload: ethel-brown

Post on 26-Dec-2015

227 views

Category:

Documents


2 download

TRANSCRIPT

Big data analytics

Rafal LukawieckiStrategic Consultant

Project Botticelli Ltd

[email protected] @rafaldotnet

Objectives

Explain big data analytics

Introduce data mining, Hadoop, and PDW

The information herein is for informational purposes only and represents the opinions and views of Project Botticelli and/or Rafal Lukawiecki. The material presented is not certain and may vary based on several factors. Microsoft makes no warranties, express, implied or statutory, as to the information in this presentation.

Portions © 2014 Project Botticelli Ltd & entire material © 2014 Microsoft Corp unless noted otherwise. Some slides contain quotations from copyrighted materials by other authors, as individually attributed or as already covered by Microsoft Copyright ownerships. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Project Botticelli Ltd as of the date of this presentation. Because Project Botticelli & Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft and Project Botticelli cannot guarantee the accuracy of any information provided after the date of this presentation. Project Botticelli makes no warranties, express, implied or statutory, as to the information in this presentation. E&OE.

Big data, or just complex data?

velocity

variety complexity

volume

Data

interpretingpreparing

Domain Common big data scenariosFinancial services Modeling true risk

Threat analysis and fraud detection

Trade surveillanceCredit scoring and analysis

Media & Entertainment

Recommendation enginesAd targeting

Search qualityAbuse and click fraud detection

Retail Point of sales transaction analysisCustomer churn analysis

Sentiment analysis

Telecommunications

Customer churn preventionNetwork performance optimization

Call Detail Record (CDR) analysisNetwork failure prediction

Government Cyber security (botnets, fraud)Traffic congestion and re-routing

Environmental monitoringAntisocial monitoring via social media

Healthcare Genomics researchCancer research

Health pandemics early detectionAir quality monitoring

Which big data?

Traditional, relational

SSAS Data MiningPDW

Non-traditional

HDInsightPDW

Massively Parallel Processing (MPP) for queries

In-memory columnstore

Multiple nodes with dedicated CPU, memory, storage

Incrementally extensible

Scale from terabytes to multi-petabytes

PDW principles

Low latency

Sub-zero processing of large event streams

Continuous insight through historical data mining

Event targets

Event sources

PDW: near real-time insightsReal-time with complex event processing

Advanced analyticsDescriptive & predictiveClustering, neural nets, decision trees, time series, naïve Bayes, sequence clustering, linear and logistic regression

Semantic searchConceptual similarities

GeospatialGeometry and geography

Big dataHadoop, Mahout

DemoData mining with Excel and SQL

Parallel Data WarehouseHDP

Windows Azure

Apache Hadoop distribution

Developed by Hortonworks & Microsoft

Integrated with Microsoft BI

Microsoft HDInsight

Big data + traditional BI = powerful + easy

Big, fast, or

complex data

Microsoft

HDInsight

Tabular

OLAPSQL

0101010101010101011010101010101010

01010101010101101010101010

Interaction, exploration,reporting,

visualisationPDW +

Polybase

Hadoop principles

Practical method for massive parallelisation of analytical data processing

DistributeddataDistributed processing

Analytics engine of Microsoft, Yahoo, Google, Facebook, Netflix, Klout…

DemoBig data recommendation enginePart 1: the job

Hadoop data

HDFS (Hadoop File System)Network rack aware to minimise transfers

Access like normal filesQuery the Hive, like a data warehouse, using HiveQL

Hadoop MapReduce

Your processing logic is split between map and reduce functions

Map your problem into smaller (divide)

Reduce results into higher-level aggregates (conquer)

MapReduce is like divide-and-conquer

DistributeddataDistributedprocessing

Hadoop cluster

Yahoo! Hadoop cluster, about 2007.Source: http://developer.yahoo.com. Picture used with permission.

Hadoop cluster

DistributeddataDistributedprocessingBuster Cluster, an early research project by Miles Osborne, University of Edinburgh, School of Informatics. Picture used with permission. http://homepages.inf.ed.ac.uk/miles/

Hadoop cluster

Cloudrent-a-Hadoop-cluster, or:“Supercomputer for cents”Windows Azure HD Insight

DistributeddataDistributedprocessing

Processing logic in HDInsight1.6 2.1 3.0

Write MapReduce jobs in Java, or any Windows language, using stdin-stdout

Pig Latin with User-Defined Functions (UDFs) in Python, JS, C#, Java, and .NET

Low-level, fast, harder

Easy, massively parallel

Processing logic in HDInsight 3.0Hadoop 2.2

Middleware between data and processing

MapReduce and Pig, or:Tez (interactive)HBase (online)streaming, graph, in-memory, search...

YARN Apps

Hadoop data scienceMahout 0.9 (not HDInsight 3.0 yet)

Machine learningScalable data mining

Collaborative filtering, recommenders, clustering, singular value decomposition, parallel frequent pattern mining, naive Bayes, decision tree

DemoBig data recommendation enginePart 2: the results

Summary

Big data = too complex for traditional methods

HDInsight + PDWfor yourbig data opportunity

projectbotticelli.comBI video tutorials, PPTs, and articles15% Off: 2014SWISS15Valid in March 2014 only

Follow: @rafaldotnetEmail: [email protected]: rafal.net

The information herein is for informational purposes only and represents the opinions and views of Project Botticelli and/or Rafal Lukawiecki. The material presented is not certain and may vary based on several factors. Microsoft makes no warranties, express, implied or statutory, as to the information in this presentation.

Portions © 2014 Project Botticelli Ltd & entire material © 2014 Microsoft Corp unless noted otherwise. Some slides contain quotations from copyrighted materials by other authors, as individually attributed or as already covered by Microsoft Copyright ownerships. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Project Botticelli Ltd as of the date of this presentation. Because Project Botticelli & Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft and Project Botticelli cannot guarantee the accuracy of any information provided after the date of this presentation. Project Botticelli makes no warranties, express, implied or statutory, as to the information in this presentation. E&OE.