building big data solutions on azure
Post on 16-Apr-2017
553 Views
Preview:
TRANSCRIPT
© Copyright SELA Software & Education Labs Ltd. | 14-18 Baruch Hirsch St Bnei Brak, 51202 Israel | www.selagroup.com
Eyal Ben Ivri
Building Big Data Solutions on Azure
About meEyal Ben IvriBig Data & Cloud Architect, Sela GroupFocus On Hadoop Eco-System & Big-Data + NoSQL Solutions
Modern Data – The Big PictureIoT
User Data
Media Files
Documents
Machine Data
Log Files
The Light Rail problem – TLV Railway
Imagine the new light Rail maintenance company
IoT – Internet of Trains (and cameras, and cash registers and carts and rails and more…)Analyze data in stream and in batchDashboardsAlertsThe perfect problem
What We NeedAn integrated data solution that will be:
Able to process events from external sourcesAble to walk data through different pipelinesFast and responsiveBig-Data Ready
In Other Words
Consume
BI Dashboards Applications
ProcessETL Aggregations Computation Analysis Querying
PersistHadoop SQL NoSQL
IngestIoT Structured Data Un-Structured Data
Microsoft Azure Services forIoT and BigData
Devices Device Connectivity Storage Analytics Presentation & Action
Event Hubs SQL Database Machine Learning App Service
Service Bus Table/Blob Storage Stream Analytics Power BI
External Data Sources DocumentDB HDInsight Notification
Hubs
Data Lake Store Data Factory Mobile Services
External Data Sources Data Lake
AnalyticsBizTalk Services
{ }
Microsoft Azure Services forIoT and BigData
Devices Device Connectivity Storage Analytics Presentation & Action
Event Hubs SQL Database Machine Learning App Service
Service Bus Table/Blob Storage Stream Analytics Power BI
External Data Sources DocumentDB HDInsight Notification
Hubs
Data Lake Store Data Factory Mobile Services
External Data Sources Data Lake
AnalyticsBizTalk Services
{ }
Event HubMessages at scaleWhy not throw it into a queue, and have a listener at the backend?
Scaling limits, because of the architecture of queues and topics of a standard Service BusEvent Hub uses a partition model
Getting StartedEasy to set up
Two ConfigurationsPartition Count – Depend on the number of consumers (2-32)Message Retention (days) – between 1 and 7 days
Secured using SAS Policies
Field Gateway
Device Connectivity & Management
IoT with Event HubsDe
vices
RTOS
, Linu
x, W
indow
s, An
droid
, iOS
Cloud GatewayEvent Hubs
Field Gateway
Protocol Adaptation
Event Hubs• High scale telemetry ingestion service• HTTP/AMQP protocol support• Each Event Hub supports
• 1 million publishers• 1GB/s ingress
• Generally available worldwide• 18 Billion messages per day• 60+ TB ingested per day
Field Gateway
Device Connectivity & Management
Analytics & Operationalized Insights
IoT & Data Processing PatternsDe
vices
RTOS
, Linu
x, W
indow
s, An
droid
, iOS
Protocol Adaptation
Batch Analytics & VisualizationsAzure HDInsight, AzureML, Power BI, Azure Data Factory
Hot Path AnalyticsAzure Stream Analytics, Azure HDInsight Storm
Hot Path Business LogicService Fabric & Actor Framework
Cloud GatewayEvent Hubs&IoT Hub
Field Gateway
Protocol Adaptation
Find insights to• Power new services• Improve your “things”
Operationalize your insights in real time
IoT Scale Object Models & Business Logic
TLV RailwayCan now ingest millions of messages each secondThese messages carry data from:
DevicesEnd-MachinesServers
Next, we need to use this data to create real-time alerts when something goes wrong
Azure Stream AnalyticsMission critical reliability and scale
Enables rapid development
Fully managed real-time analytics
Automatic recoveryMonitoring and alertingScale on demand
Managed Cloud ServiceEach unit handles 1MB/sCan scale up to 1GB/s
SQL like languagetemporal windowing semanticssupport for reference data
Stream Analytics – Main Concepts
InputsCan be stream or reference data (metadata)Stream Data sources can be Event Hub, Blob Storage (using blobs with timestamps) or IoT Hub (preview)Serialization types support CSV, JSON, and Avro
QueryA SQL query to that will select from input(s) and dump results to output(s)
OutputCan be Blob, SQL, Event Hub (notification), Power BI (preview), Table storage, Service Bus or DocumentDB
Tumbling WindowsHow many trains entered each station every 5 minutes?
SELECT TrainId, COUNT(*) FROM EntryStream GROUP BY TrainId, TumblingWindow(minute,5)
Temporal WindowsTumbling Window
A series of fixed-sized, non-overlapping and contiguous time intervals
Hopping WindowScheduled overlapping windows
Sliding WindowOutputs events only for those points in time when the content of the window actually changes
TLV RailwayCan now respond in near-real-time to events as they happenTrack and maintain malfunctioning equipmentReceive real time data regarding customers entering and leaving stationsData can now be processed, so we need a place to save it, preferably at scale.
DocumentDB and Azure Data Services
fully managed, scalable, queryable, schema free JSON document database service for modern applications
fully featured RDBMStransactional processing
rich query managed as a service
elastic scale
internet accessible http/rest
schema-free data model
arbitrary data formats
DocumentDB featuresJSON DocumentsSQL support Linq SupportREST API SupportJS Support (triggers, UDFs, stored procedures)Automatic IndexMultiple Document TransactionsTunable Consistency
DocumentDB Key ConceptCollection
A collection of DocumentsNot a table (different entities can go into the same collection)Collections = PartitionsNot just logical containers, but physical ones
Demo
TLV Railway – Part 1
TLV RailwayCan now store it’s data in a highly scalable storeGreat for interactive querying of any data
Messages from sensorsReference Data
But this data (and other data) needs to move to other places (SQL, Batch processing, ML). How?
What is Azure Data Factory?
Azure Data Factory is a managed service to produce trusted information from data stored in the cloud and on-premises. Easily create, orchestrate and schedule highly-available, fault tolerant work flows to move and transform your data at scale.
Evolving Approaches to Analytics
ETL Tool(SSIS, etc)
EDW(SQL Svr, Teradata, etc)
Extract
Original Data
Load
Transformed Data
Transform
OLTP
ERP LOB
…BI Tools
Devices
Web
Sensors
SocialIngestOriginal Data
Scale-out Storage & Compute
(HDFS, Blob Storage, etc)
Transform & Load
Data MartsData
Lake(s)Dashboar
dsApps
Streaming data
Data Factory – Main conceptsData Store
A data source/sink componentSQL (Azure or On-Premise), Storage, DocumentDB and more)
Data SetA defined data set that is contained inside a data storeOne data store can have many data sets
ComputeA service for computationHDInsight, Azure Batch, Data Lake Analytics, Azure ML
Data Factory – Main conceptsPipeline
Set of instructions“Take data from data set A and move to compute, then store results in data set B”
SlicesEverything is time slicedA data set (source) can declare on what time intervals the data can be sliced, and the pipeline will be activated when a new slice is ready
JSON
Microsoft Azure Services forIoT and BigData
Devices Device Connectivity Storage Analytics Presentation & Action
Event Hubs SQL Database Machine Learning App Service
Service Bus Table/Blob Storage Stream Analytics Power BI
External Data Sources DocumentDB HDInsight Notification
Hubs
Data Lake Store Data Factory Mobile Services
External Data Sources Data Lake
AnalyticsBizTalk Services
{ }
Microsoft Azure Services forIoT and BigData
Devices Device Connectivity Storage Analytics Presentation & Action
Event Hubs SQL Database Machine Learning App Service
Service Bus Table/Blob Storage Stream Analytics Power BI
External Data Sources DocumentDB HDInsight Notification
Hubs
Data Lake Store Data Factory Mobile Services
External Data Sources Data Lake
AnalyticsBizTalk Services
{ }
TLV RailwayCan now integrate different services and different data sourcesMove data with ease and as little hassle as possibleWhat about aggregations, deeper dive into data, for more complex analysis?
HDInsightHadoop-as-a-ServiceBased on the Hortonworks distributionFew flavors:
Hadoop (Windows + Linux)Storm (Windows + Linux)HBase (Windows + Linux)Spark (Windows + Linux)
Data size
Access
Updates
Structure
Integrity
Scaling
Hadoop vs. Relational DB
Demo
TLV Railway – Part 2
TLV Railway - SummaryCan now perform advanced analytics on top of large amounts of data, in a variety of formats (not just structured, boring data)Can integrate all the loose ends of data coming in, with data generated in ”Old-School” data platforms like SQL that is collected from Line-of-Business applicationsWe’ve covered data ingestion, responding in real-time, querying, storing and processingAzure Stack
Hadoop and OSS vs.Azure IoT and BigData Ecosystem
Azure Ecosystem OSS
Event Hubs Kafka
Stream Analytics Storm
HDInsight Hadoop
Map Reduce Map Reduce
Hive Hive
Spark Spark
HBase HBase
Azure ML Mahout
Data Factory Pig
DocumentDB MongoDB / Couchbase
Data Lake (preview)
Is “TLV Railway” fake?
London did it first
SummaryPresentation and action
Storage andBatch Analysis
StreamAnalysis
IngestionCollectionEvent production
Event hubs
Cloud gateways(web APIs)
Field gateways
Applications
Legacy IOT (custom protocols)
Devices
IP-capable devices(Windows/Linux)
Low-power devices (RTOS)
Search and query
Data analytics(Power BI)
Web/thick client dashboards
SQL DB
DocumentDBPower BI
Storage
Stream Analytics
Devices to take action
MachineLearning
DataFactory
Get started today at http://azure.microsoft.com
HDInsight
Questions
top related