school ofengineering andnatural sciences, university ... · 2. machine learning models in clouds 3....
TRANSCRIPT
PARALLEL & SCALABLE MACHINE LEARNING & DEEP LEARNING
Prof. Dr. – Ing. Morris RiedelAdjunct Associated ProfessorSchool of Engineering and Natural Sciences, University of IcelandResearch Group Leader, Juelich Supercomputing Centre, Germany
Data Streaming Tools & ApplicationsNovember 27th, 2018Room V02-138
Cloud Computing & Big Data
SHORT LECTURE 15
Review of Lecture 14 – Online Social Networking & Graphs
Online Social Networking (OSN)
Short Lecture 15 – Data Streaming Tools & Applications
Graph Theory & Applications
[1] Top 15 most popular social networking sites
(‘big data‘ from OSN services is
accumulatingin an
inhomogenous& highly
unstructuredway)
[2] www.mashable.com
(entities & relationships
characteristics)
(use of ‘social graph‘ databases & traversal methods)
Online advertising uses manymachine learning algorithms
[3] Distributed & Cloud Computing Book
(e.g. associations& objects – TAO graph database)
2 / 26
Outline of the Course
1. Cloud Computing & Big Data
2. Machine Learning Models in Clouds
3. Apache Spark for Cloud Applications
4. Virtualization & Data Center Design
5. Map-Reduce Computing Paradigm
6. Deep Learning driven by Big Data
7. Deep Learning Applications in Clouds
8. Infrastructure-As-A-Service (IAAS)
9. Platform-As-A-Service (PAAS)
10. Software-As-A-Service (SAAS)
Short Lecture 15 – Data Streaming Tools & Applications
11. Data Analytics & Cloud Data Mining
12. Docker & Container Management
13. OpenStack Cloud Operating System
14. Online Social Networking & Graphs
15. Data Streaming Tools & Applications
16. Epilogue
+ additional practical lectures for our
hands-on exercises in context
Practical Topics
Theoretical / Conceptual Topics3 / 26
Short Lecture 15 – Data Streaming Tools & Applications
Outline
This is only a Short Lecture Goal is to provide a few pointers to other
advanced related university courses/topics ‘Data streaming tools’ & their applications
needs a full course & substantial tutorial Links previous Lectures & Practical Lectures
with further material to study & research
Data Streaming Tools & Applications Data Streams & Interactive Access & Visual Analytics Apache Flume Data Streams Analysis & Interceptors Apache Spark Streaming Library & AWS Cloud Products Apache Flink In-Memory Approaches & Libraries Online Learning Approach in Machine Learning
4 / 26
What are Data Streams?
Short Lecture 15 – Data Streaming Tools & Applications
Data streams stands for ‘real-time data‘ with ‘high velocity‘ exchanged between n systems Data streams originate from the fine granular logs of ‘actions/events per users‘ from Web pages Data streams can also originate from large measurement devices or large computational systems
‘Web Actions’
Data stream with data ‘hidden’ in logs
‘Measurement Device’
Data stream with measured data
‘Computational Simulation’
Data stream with data of HPC/HTC simulation
Da
ta S
ourc
es
Da
ta S
inks
n optional filters
5 / 26
Crowdsourcing
Short Lecture 15 – Data Streaming Tools & Applications
Usual Citizens / ‘Citizen Scientist’
Data streams with data (low trust)
Individuals with domain as Hobby
Data streams with data (moderate trust)
Scientific/Engineering Domain Experts
Data streams with data (high trust) TBs
PBs
EBs
Crowdsourcing is the practice of obtaining data (streams) by soliciting contributions from a large group of people (e.g. online community) rather than from traditional eomployees/experts
The high amount of data (streams) obtained from crowdsourcing can originate from a wide variety of technical sources (e.g. handheld devices, phone, desktop computer, automatic sensor, etc.)
modified from [4] Wikipedia on ‘Crowdsourcing‘
6 / 26
Data-Stream Challenges – Quality Control
Short Lecture 15 – Data Streaming Tools & Applications
Streams of (big) data represent a difficult challenge for many software systems, because of their ‘high volume & velocity(!)’
Different data formats and data from different sources making the storing and potentially ‘ad-hoc’ analysis/tranformation hard
Missing data sets or error-prone measurements are hard to detect or to correct due to the high velocity of data streams
Quality Control Options (i.e. getting trust in data) ‘Correlate and/or validate’ data set with
data from n different sources(e.g. overlap in measurement regions from devices)
‘Missing values or Outlier detection’(e.g. by using statistical data mining methods or visualizations)
Outliers are data objects that have characteristics that are different from most of the other data objects in the data set (e.g. are clearly visible in visualizations , boxplot, etc.)
Outliers can also be values of an attribute that are unusual with respect to the typical valuesmodified from [5] Data-Mining Book
7 / 26
Data-Stream Challenges – Persistent Identifiers (PIDs)
Short Lecture 15 – Data Streaming Tools & Applications
The nature of data streams creates often ‘unfinished datasets‘ or ‘open time series data‘ that makes it hard to assign persistent identifiers (PIDs) to them
Trade-off between huge number of PIDs (millions per observations) or PIDs for collections (coarse)
How to reference real-time data streams?
‘Measurement Device’
Data stream with measured data
‘Big Instrument’
Permanentlystoring data(of a subset)
Storage
How can we updateunfinished datasets and
add missing values later?
PID Assignment
Filter
8 / 26
Enabling Interactive Access
Short Lecture 15 – Data Streaming Tools & Applications
Interactive access enables real-time access and often some form of remote control of a device Goal is to influence at the source its outgoing n data (objects) that are part of the data streams Interactive access maybe hindered due to security implications or scheduling conflicts (i.e. batch)
‘Web Actions’
Data stream with data ‘hidden’ in logs
‘Measurement Device’
Data stream with measured data
‘Computational Simulation’
Data stream with data of HPC/HTC simulation
Da
ta S
ourc
es
Da
ta S
inks
Changes of parameters for Web applications (e.g. video timeline click)
Steer measurement device parameters (e.g. change angles, resolution)
Steer HPC simulation parameters on the fly (e.g. particle positions)
9 / 26
Computational Steering on Large-scale HPC Resources
Application Type Examples Mechanic simulation element (e.g. open/shut doors in air ventilation flow) Guide computational application to regions of interest (e.g. black hole)
Application Software exist to ‘instrument parallel code’ COVS Package, SciRun, etc.
Short Lecture 15 – Data Streaming Tools & Applications
The automated computation of results is influenced periodically with computational steering Computational steering requires a ‘bi-directional data-stream channel’ to transport data from HPC
resources to the visualization & steering parameters from the visualization to the HPC resources Computational steering ‘method’ is used for parameter space exploration in MPI applications
‘Computational Simulation (e.g. using MPI)’
Data stream with data objects of HPC simulation
Steer HPC simulation parameters on the fly (e.g. particle positions)
HPC Resource Visualization
[12] M. Riedel et al., COVS [13] University of Utah, SciRun
10 / 26
HPC Computational Steering Applications
Scientific case: research star cluster dynamics in astrophysics Parallel computing application using nbody6++ parallel code
Steering: changing parameters during the run-time of simulation
[12] M. Riedel et al., computational steering, 2007
changeparametersinteractively
visualizestatus
MPI
Short Lecture 15 – Data Streaming Tools & Applications 11 / 26
Visual Analytics in Human Brain Research – Example
Interactive access is a ‘key ingredient’ for advanced visual analytics to enable cross updates of data and ad-hoc processing Change of data focus in a display might recompute data in another display
Short Lecture 15 – Data Streaming Tools & Applications
[11] T. Kuhlen, ‘Visual Analysis of Human Brain Simulations’
12 / 26
Apache Flume Data Stream Tool in Clouds
Apache Flume as ‘open source tool’ Efficiently collects, aggregates,
and moves large amounts of ‘log data’ Provides a distributed, reliable, and
available service (in context of HDFS,cf. Lecture 5)
Apache Flume ‘data flow model’ to process data streams
Short Lecture 15 – Data Streaming Tools & Applications
[6] Apache Flume
‘Flume events’ are units of data flow having a ‘byte payload’ and an optional set of ‘string attributes’
External sources (e.g. Web servers) send ‘Flume events’ in a format that is recognized by the target Flume source
Modified from [6] Apache Flume
13 / 26
Apache Flume uses ‘channels‘ to process ‘event data streams‘
‘Multi-hop flows‘ enable events to travel through multiple agents Set of events are ‘reliably’ passed from point to point (transaction) Modify/drop events in-flight via various ‘interceptors’ (e.g. filters)
Apache Flume Data Stream Tool – Channels & Interceptors
Short Lecture 15 – Data Streaming Tools & Applications
Modified from [6] Apache Flume
When a Source receives an event,the event is ‘stores into n channels’
Channels are ‘passive stores’ that keep the events until they are consumed by a sink
Example: ‘File Channel’ stores on local filesystem until the ‘HDFS Flume Sink’ puts the file into HDFS
(interceptors)
14 / 26
Apache Spark Streaming Library – Revisited (cf. Lecture 3)
Short Lecture 15 – Data Streaming Tools & Applications
[7] Apache Spark
Apache Spark Streaming library enables to write streaming jobs the same way users would write batch jobs or to combine streaming with batch and interactive queries or filters
Usage example Combine streaming with batch
and interactive queries or using ‘windows’ E.g. find words with higher
frequency than historic data Recovers lost work without any
extra code for users to write(e.g. cf. RDD fault tolerance features)
15 / 26
AWS Cloud Service Portfolio – Analytics (cf. Lecture 5)
Multiple analytics products Extracting insights and actionable information from data
requires technologies like analytics & machine learning
Products & Usage Amazon Athena: Serverless Query Service Amazon ElasticMapReduce: Hadoop Amazon ElasticSearch Service: Elasticsearch on AWS Amazon Kinesis: Streaming Data Amazon QuickSight: Business Analytics Amazon Redshift: Data Warehouse …
Short Lecture 15 – Data Streaming Tools & Applications
[8] AWS Web page
Amazon Kinesis offers multiple analytics products that support extracting insights andactionable information from data using technologies like data analysis & machine learning
Amazon Kinesis consists of Kinesis Video Streams, Kinesis Data Streams, and KinesisFireHouse products that re-use other AWS products (e.g. SageMaker) or TensorFlow
[9] Amazon Kinesis
16 / 26
AWS Cloud Service Portfolio – Amazon Kinesis Example
Multiple analytics products Combined together depending on application demands:
Video Streams, Data Streams, Data Firehose Re-use of other AWS services:
e.g. Amazon SageMaker, Spark on Elastic MapReduce,TensorFlow / MxNet
E.g. application analyzing text data streams obtained from Twitter API
Short Lecture 15 – Data Streaming Tools & Applications
[9] Amazon Kinesis
[8] AWS Web page
(cf. Lecture 5)
(cf. Lecture 7 & 10)
17 / 26
AWS Sagemaker Example – Revisited (cf. Lecture 10)
AWS Cloud –Amazon Sagemaker Fully managed service
that enables quick & easy machine & deep learning applications(cf. Installation overheads of many required frameworks)
Short Lecture 15 – Data Streaming Tools & Applications
[15] AWS – Amazon Sagemaker AWS Amazon Sagemaker is a SAAS orientedservice thatprovides fullymanaged instancesrunning Jupyternotebooks thatinclude examplestraining & tuningvarious machinelearning models
[26] Jupyter Web page
18 / 26
In-Memory Technology: Apache Flink
New Features Open source platform Enables distributed stream
and batch processing Data streaming core runtime
Experience from Practice Apache projects need to be
carefully reviewed w.r.t. functionality promised & stability
Short Lecture 15 – Data Streaming Tools & Applications
[10] Apache Flink
Apache projects evolve with an enormous speed and while having good functionality and performance increase for certain applications their stability can be problematic sometimes
The maturity of projects like Apache Mahout, Apache Spark, or Apache Flink can be not compared to the MPI/OpenMP standards which implementations evolved over decades
[10] YouTube, Apache Spark
19 / 26
Apache Flink – Streaming & In-Memory
Flink/Storm different approach to distributed computing Provides a streaming dataflow engine approach Offers fast techniques for data distribution, communication, and fault
tolerance for distributed computing using data streams
Apache Flink libraries and APIs Datastream API: streams embedded in Java/Scala Dataset API: static data embedded in Java/Scala/Python Table API: SQL-enabled
language embedded in Java/Scala Complex event
processing (CEP) library Machine Learning Library Gelly as graph
processing API and library
Short Lecture 15 – Data Streaming Tools & Applications
[10] Apache Flink
20 / 26
One general criterion used to categorize machine learning systems is whether or not themodel and/or system can learn incrementally from a stream of incoming ‘big data‘
The ‘online learning‘ approach trains a system incrementally by feeding it data instancessequentially – either individually or by small groups that are called ‘mini-batches‘
The goal of ‘online learning‘ is to be fast and cheap in order to achieve that the system canlearn about new data elements on the fly as it arrives in data streams or other sources
Online Learning Approach in Machine Learning
Short Lecture 15 – Data Streaming Tools & Applications
modified from [17] Book ‘Hands-On Machine Learning with Scikit-Learn & TensorFlow‘
(‘online learning‘ ‘incremental learning‘) (‘batch learning‘ trained using all available data,but not incrementally – all in one ‘batch‘)
21 / 26
[Video] EPOS Use Cases
[14] YouTube, EPOS Use Cases
Short Lecture 15 – Data Streaming Tools & Applications 22 / 26
Lecture Bibliography
Short Lecture 15 – Data Streaming Tools & Applications 23 / 26
Lecture Bibliography (1)
[1] Top 15 Most Popular Social Networking Sites, Online: http://www.ebizmba.com/articles/social-networking-websites
[2] www.mashable.com, ‘Graph Databases: The New Way to Access Super Fast Social Data’, Online: http://mashable.com/2012/09/26/graph-databases/
[3] K. Hwang, G. C. Fox, J. J. Dongarra, ‘Distributed and Cloud Computing’, Book, Online: http://store.elsevier.com/product.jsp?locale=en_EU&isbn=9780128002049
[4] Wikipedia on ‘Crowdsourcing‘, Online: http://en.wikipedia.org/wiki/Crowdsourcing
[5] P-N Tan, M. Steinbach, and V. Kumar, ‘Introduction to Data Mining’, Pearson International Edition, ISBN 0-321-42052-7
[6] Apache Flume Web Page, Online: http://flume.apache.org/index.html
[7] Apache Spark Web Page, Online: http://spark.apache.org/
[8] Amazon Web Services Web Page, Online: https://aws.amazon.com
[9] Amazon Kinesis Web Page, Online: https://aws.amazon.com/kinesis
[10] Apache Flink, Online: https://flink.apache.org/
Short Lecture 15 – Data Streaming Tools & Applications 24 / 26
Lecture Bibliography (2)
[11] T. Kuhlen, ‘Visual Analysis of Human Brain Simulations’, Changes Workshop 2013, Online: http://www.ncsa.illinois.edu/Conferences/CHANGES2013/agenda.html
[12] M. Riedel, Th. Eickermann, S. Habbinga, W. Frings, P. Gibbon, D. Mallmann, F. Wolf, A. Streit, Th. Lippert, Felix Wolf, Wolfram Schiffmann, Andreas Ernst, Rainer Spurzem, Wolfgang E. Nagel, "Computational Steering andOnline Visualization of Scientific Applications on Large-Scale HPC Systems within e-Science Infrastructures," e-science, pp.483-490, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007), 2007
[13] SciRun Computational Steering, Online: http://www.sci.utah.edu/cibc/software/106-scirun.html
[14] EPOS - European Plate Observing System – Use Cases, Online: http://www.youtube.com/watch?v=a4MLbZpHdvE
[15] Amazon Web Services – Amazon SageMaker, Online: https://aws.amazon.com/sagemaker/
[16] Jupyter Web page,Online: http://jupyter.org/
[17] Aurélien Géron, ‘Hands-On Machine Learning with Scikit-Learn & TensorFlow‘, O‘Reilly Book,ISBN 9781491962282, 574 pages, 2017
Short Lecture 15 – Data Streaming Tools & Applications 25 / 26
Short Lecture 15 – Data Streaming Tools & Applications 26 / 26