Download - Microsoft on big data
![Page 1: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/1.jpg)
Bild durch Klicken auf Symbol hinzufügen
Microsoft on Big Data Bild durch Klicken auf Symbol hinzufügen
Bild durch Klicken auf Symbol hinzufügen
Donnerstag, 28.05.2015
![Page 2: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/2.jpg)
Vorweg:
Wir sind heute live auf Meerkat
![Page 3: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/3.jpg)
Agenda Was ist Big Data?
Funktionsweise und Ansätze
Microsoft Architektur
Hadoop und Map Reduce
Pig
![Page 4: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/4.jpg)
Die 3 Vs
Quelle: http://www.datasciencecentral.com/forum/topics/the-3vs-that-define-big-data
![Page 5: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/5.jpg)
Was ist Big Data ?
![Page 6: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/6.jpg)
Was ist Big Data?
![Page 7: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/7.jpg)
Why Big Data? 2008: Google processes 20 PB a day
2009: Facebook has 2.5 PB user data + 15 TB/day
2009: eBay has 6.5 PB user data + 50 TB/day
2011: Yahoo! has 180-200 PB of data
2012: Facebook ingests 500 TB/day
![Page 8: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/8.jpg)
Nächster Großer Datenlieferant
![Page 9: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/9.jpg)
Funktionsweise und Ansätze
![Page 10: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/10.jpg)
How to store data? Data storage is not trivial
Data volumes are massive
Reliably storing PBs of data is challenging
Disk/hardware/network failures
Probability of failure event increases with number of machines
For example: 1000 hosts, each with 10 disks a disk lasts 3 year how many failures per day?
![Page 11: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/11.jpg)
Historical basics Hadoop is an open-source implementation based on GFS and MapReduce from
Google Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. (2003)
The Google File System Jeffrey Dean and Sanjay Ghemawat. (2004)
MapReduce: Simplified Data Processing on Large Clusters OSDI 2004
![Page 12: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/12.jpg)
Klassische Big Data Architektur Hadop
![Page 13: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/13.jpg)
Characteristics and Features Distributed file system
Redundant storage
Designed to reliably store data using commodity hardware
Designed to expect hardware failures
Intended for large files
Designed for batch inserts
The Hadoop Distributed File System
![Page 14: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/14.jpg)
HDFS - files and blocks Files are stored as a collection of blocks
Blocks are 64 MB chunks of a file (configurable)
Blocks are replicated on 3 nodes (configurable)
The NameNode (NN) manages metadata about files and blocks
The SecondaryNameNode (SNN) holds a backup of the NN data
DataNodes (DN) store and serve blocks
![Page 15: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/15.jpg)
Replication Multiple copies of a block are stored
Replication strategy: Copy #1 on another node on same rack Copy #2 on another node on different rack
![Page 16: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/16.jpg)
Failure DataNode DNs check in with the NN to report health
Upon failure NN orders DNs to replicate under-replicated blocks
![Page 17: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/17.jpg)
Microsoft
![Page 18: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/18.jpg)
Distributed Storage(HDFS)
Query(Hive)
Distributed Processing
(MapReduce)
Scripting(Pig)
NoSQ
L Data
base
(HB
ase
)
Metadata(HCatalog)
Data
Inte
gra
tion
( OD
BC
/ SQ
OO
P/ REST)
Rela
tiona
l(S
QL
Serve
r)
Machine Learning(Mahout)
Graph(Pegasus)
Stats processin
g(RHadoo
p)
Eve
nt Pip
elin
e(Flu
me)
Active Directory (Security)
Monitoring & Deployment
(System Center)
C#, F#, .NET
JavaScript
Pipelin
e / w
orkflo
w(O
ozie
)
Azure Storage Vault (ASV)
PD
W Po
lybase
Busin
ess
Inte
lligence
(E
xcel, Po
wer
Vie
w, S
SA
S)
HDINSIGHT / HADOOP Eco-System
World's Data (Azure Data Marketplace)
Eve
nt
Drive
n
Proce
ssing
LegendRed = Core HadoopBlue = Data processingPurple = Microsoft integration points and value addsOrange = Data MovementGreen = Packages
![Page 19: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/19.jpg)
Wie funktioniert Hadoop
![Page 20: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/20.jpg)
Hadoop Distributed Architecture
![Page 21: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/21.jpg)
FIRST, STORE THE DATA
Server
ServerServer
So How Does It Work?
Files
Server
![Page 22: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/22.jpg)
SECOND, TAKE THE PROCESSING TO THE DATA
So How Does It Work?
// Map Reduce function in JavaScript
var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {
if (words[i] !== "")context.write(words[i].toLowerCase(),1);}}};
var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());
}context.write(key, sum);};
ServerServer
ServerServer
RUNTIME
Code
![Page 23: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/23.jpg)
MapReduce – Workflow
![Page 24: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/24.jpg)
Programming Models
PigData scripting language
HiveSQL-like set-oriented language
Pegasus, GiraphGraph processing
![Page 25: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/25.jpg)
Demo
![Page 26: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/26.jpg)
Example Video Streams
![Page 27: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/27.jpg)
Meerkat API
![Page 28: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/28.jpg)
Vorgehen
Ziel Verteilung von Streams über Tag und Nutzer
C# Dienst Daten sammeln
Persistierung in Azure
Aufbereitung und Analyse mit Hive
Analyse in Excel
![Page 29: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/29.jpg)
Erwartetes Ergebnis
![Page 30: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/30.jpg)
Weitere Beispiele
![Page 31: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/31.jpg)
Beispiel: Social Media Analyse
Auswertung von sozialen Netzwerken
• Untersuchung des Medien-Konsumverhaltens • Quantitativ-statistische Auswertung von Kommunikationsinhalten• Erkennung von Trends, Influencern und Konkurrenzaktivitäten• Nutzung von Facebook, Twitter und anderen Sozialen Netzwerken als Datenquelle• Hohes Datenwachstum• Semi-strukturierte Datenformate• Häufige Änderungen der Datenstrukturen
![Page 32: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/32.jpg)
Quelle: Facebook Graph API
![Page 33: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/33.jpg)
Analyse der Ergebnisse mit Excel
![Page 34: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/34.jpg)
Eigene Map Reduce Tasks
![Page 35: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/35.jpg)
Beispiel: Analyse von Freitext
Textanalye von Sitzungs- protokollen
• Entdeckung von Bedeutungsstrukturen aus un- oder schwachstrukturierten Textdaten• Schnelle Erkennung von Kerninformationen der verarbeiteten Texte• Erkennung nicht bekannter Zusammenhänge• Hypothesen generieren, überprüfen und schrittweise verfeinern• Extraktion von Haltungen gegenüber einem Thema durch semantische Algorithmen• Hohes Datenwachstum
![Page 36: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/36.jpg)
Quelle: Plenarprotokolle Bundestag
![Page 37: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/37.jpg)
Verarbeitung der Daten mit Hadoop
![Page 38: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/38.jpg)
Analyse der Ergebnisse mit Excel
![Page 39: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/39.jpg)
DocumentDB
![Page 40: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/40.jpg)
What is Azure DocumentDB?
It is a fully managed, highly scalable, queryable, schema-free document database, delivered as a service, for modern applications.
Query against Schema-Free JSONMulti-Document transactionsTunable, High PerformanceDesigned for cloud first
40
![Page 41: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/41.jpg)
Azure DocumentDB Resources41
Source: http://azure.microsoft.com/en-us/documentation/articles/documentdb-introduction/
![Page 42: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/42.jpg)
Document DB Data model
![Page 43: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/43.jpg)
Verwaltung in Azure
![Page 44: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/44.jpg)
Darstellung als Webseite
![Page 45: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/45.jpg)
Traditional RDBMS vs. MapReduceTRADITIONAL RDBMS MAPREDUCE
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
DBA Ratio 1:40 1:3000
Reference: Tom White’s Hadoop: The Definitive Guide
![Page 46: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/46.jpg)
Do I really need Hadoop?
Generalized
No SQL
Hadoop
Standard SQL
or MPP Appliances
Specialized No SQL
Streaming
In-MemoryAnalytics
Velocity
Variety
HighlyStructured
PolyStructured
Batch Realtime
![Page 47: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/47.jpg)
Ausblick: Data Management Prozesse
Ziel: Big Data Pipeline kombinieren
Steuern und Administrieren von Diensten
Produkt: Azure Data Factory
![Page 48: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/48.jpg)
Azure Blob Storage
Call Log Files
Customer Table
On Premises
Data Mart
Call Log Files
Customer Table
Azure DB
Customer Churn Table
Visualize
Data Set(Collection of files, DB table, etc)
Activity: a processing step (Hadoop job, custom code, ML model, etc)
Pipeline: a sequence of activities (logical group)
Data Factory Concepts
…Data Sources
Ingest Transform & Analyze Publish
Customer Call
Details
Customers Likely to Churn
Transform, Combine, etc
Analyze Move
![Page 49: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/49.jpg)
Zusammenfassung Datenanalyse verändert sich
Technologien abwägen (JSON in Integration Services)
Daten Analysten sind nicht überflüssig
Das Toolset muss sich erweitern
Coole Vorlesung zum Weiter machen http://blogs.ischool.berkeley.edu/i290-abdt-s12/
![Page 50: Microsoft on big data](https://reader030.vdocument.in/reader030/viewer/2022032506/55c9dad4bb61eb1d4d8b45a5/html5/thumbnails/50.jpg)
Vielen Dank!