school ofengineering andnatural sciences, university ... · 2. machine learning models in clouds 3....

PARALLEL & SCALABLE MACHINE LEARNING & DEEP LEARNING

Prof. Dr. – Ing. Morris RiedelAdjunct Associated ProfessorSchool of Engineering and Natural Sciences, University of IcelandResearch Group Leader, Juelich Supercomputing Centre, Germany

Data Streaming Tools & ApplicationsNovember 27th, 2018Room V02-138

Cloud Computing & Big Data

SHORT LECTURE 15

Review of Lecture 14 – Online Social Networking & Graphs

Online Social Networking (OSN)

Short Lecture 15 – Data Streaming Tools & Applications

Graph Theory & Applications

[1] Top 15 most popular social networking sites

(‘big data‘ from OSN services is

accumulatingin an

inhomogenous& highly

unstructuredway)

[2] www.mashable.com

(entities & relationships

characteristics)

(use of ‘social graph‘ databases & traversal methods)

Online advertising uses manymachine learning algorithms

[3] Distributed & Cloud Computing Book

(e.g. associations& objects – TAO graph database)

2 / 26

Outline of the Course

1. Cloud Computing & Big Data

2. Machine Learning Models in Clouds

3. Apache Spark for Cloud Applications

4. Virtualization & Data Center Design

5. Map-Reduce Computing Paradigm

6. Deep Learning driven by Big Data

7. Deep Learning Applications in Clouds

8. Infrastructure-As-A-Service (IAAS)

9. Platform-As-A-Service (PAAS)

10. Software-As-A-Service (SAAS)


11. Data Analytics & Cloud Data Mining

12. Docker & Container Management

13. OpenStack Cloud Operating System

14. Online Social Networking & Graphs

15. Data Streaming Tools & Applications

16. Epilogue

+ additional practical lectures for our

hands-on exercises in context

Practical Topics

Theoretical / Conceptual Topics3 / 26


Outline

This is only a Short Lecture Goal is to provide a few pointers to other

advanced related university courses/topics ‘Data streaming tools’ & their applications

needs a full course & substantial tutorial Links previous Lectures & Practical Lectures

with further material to study & research

Data Streaming Tools & Applications Data Streams & Interactive Access & Visual Analytics Apache Flume Data Streams Analysis & Interceptors Apache Spark Streaming Library & AWS Cloud Products Apache Flink In-Memory Approaches & Libraries Online Learning Approach in Machine Learning

4 / 26

What are Data Streams?


Data streams stands for ‘real-time data‘ with ‘high velocity‘ exchanged between n systems Data streams originate from the fine granular logs of ‘actions/events per users‘ from Web pages Data streams can also originate from large measurement devices or large computational systems

‘Web Actions’

Data stream with data ‘hidden’ in logs

‘Measurement Device’

Data stream with measured data

‘Computational Simulation’

Data stream with data of HPC/HTC simulation

Da

ta S

ourc

es

Da

ta S

inks

n optional filters

5 / 26

Crowdsourcing


Usual Citizens / ‘Citizen Scientist’

Data streams with data (low trust)

Individuals with domain as Hobby

Data streams with data (moderate trust)

Scientific/Engineering Domain Experts

Data streams with data (high trust) TBs

PBs

EBs

Crowdsourcing is the practice of obtaining data (streams) by soliciting contributions from a large group of people (e.g. online community) rather than from traditional eomployees/experts

The high amount of data (streams) obtained from crowdsourcing can originate from a wide variety of technical sources (e.g. handheld devices, phone, desktop computer, automatic sensor, etc.)

modified from [4] Wikipedia on ‘Crowdsourcing‘

6 / 26

Data-Stream Challenges – Quality Control


Streams of (big) data represent a difficult challenge for many software systems, because of their ‘high volume & velocity(!)’

Different data formats and data from different sources making the storing and potentially ‘ad-hoc’ analysis/tranformation hard

Missing data sets or error-prone measurements are hard to detect or to correct due to the high velocity of data streams

Quality Control Options (i.e. getting trust in data) ‘Correlate and/or validate’ data set with

data from n different sources(e.g. overlap in measurement regions from devices)

‘Missing values or Outlier detection’(e.g. by using statistical data mining methods or visualizations)

Outliers are data objects that have characteristics that are different from most of the other data objects in the data set (e.g. are clearly visible in visualizations , boxplot, etc.)

Outliers can also be values of an attribute that are unusual with respect to the typical valuesmodified from [5] Data-Mining Book

7 / 26

Data-Stream Challenges – Persistent Identifiers (PIDs)


The nature of data streams creates often ‘unfinished datasets‘ or ‘open time series data‘ that makes it hard to assign persistent identifiers (PIDs) to them

Trade-off between huge number of PIDs (millions per observations) or PIDs for collections (coarse)

How to reference real-time data streams?



‘Big Instrument’

Permanentlystoring data(of a subset)

Storage

How can we updateunfinished datasets and

add missing values later?

PID Assignment

Filter

8 / 26

Enabling Interactive Access


Interactive access enables real-time access and often some form of remote control of a device Goal is to influence at the source its outgoing n data (objects) that are part of the data streams Interactive access maybe hindered due to security implications or scheduling conflicts (i.e. batch)

‘Web Actions’

Data stream with data ‘hidden’ in logs



‘Computational Simulation’

Data stream with data of HPC/HTC simulation

Da

ta S

ourc

es

Da

ta S

inks

Changes of parameters for Web applications (e.g. video timeline click)

Steer measurement device parameters (e.g. change angles, resolution)

Steer HPC simulation parameters on the fly (e.g. particle positions)

9 / 26

Computational Steering on Large-scale HPC Resources

Application Type Examples Mechanic simulation element (e.g. open/shut doors in air ventilation flow) Guide computational application to regions of interest (e.g. black hole)

Application Software exist to ‘instrument parallel code’ COVS Package, SciRun, etc.


The automated computation of results is influenced periodically with computational steering Computational steering requires a ‘bi-directional data-stream channel’ to transport data from HPC

resources to the visualization & steering parameters from the visualization to the HPC resources Computational steering ‘method’ is used for parameter space exploration in MPI applications

‘Computational Simulation (e.g. using MPI)’

Data stream with data objects of HPC simulation

Steer HPC simulation parameters on the fly (e.g. particle positions)

HPC Resource Visualization

[12] M. Riedel et al., COVS [13] University of Utah, SciRun

10 / 26

HPC Computational Steering Applications

Scientific case: research star cluster dynamics in astrophysics Parallel computing application using nbody6++ parallel code

Steering: changing parameters during the run-time of simulation

[12] M. Riedel et al., computational steering, 2007

changeparametersinteractively

visualizestatus

MPI

Short Lecture 15 – Data Streaming Tools & Applications 11 / 26

Visual Analytics in Human Brain Research – Example

Interactive access is a ‘key ingredient’ for advanced visual analytics to enable cross updates of data and ad-hoc processing Change of data focus in a display might recompute data in another display


[11] T. Kuhlen, ‘Visual Analysis of Human Brain Simulations’

12 / 26

Apache Flume Data Stream Tool in Clouds

Apache Flume as ‘open source tool’ Efficiently collects, aggregates,

and moves large amounts of ‘log data’ Provides a distributed, reliable, and

available service (in context of HDFS,cf. Lecture 5)

Apache Flume ‘data flow model’ to process data streams


[6] Apache Flume

‘Flume events’ are units of data flow having a ‘byte payload’ and an optional set of ‘string attributes’

External sources (e.g. Web servers) send ‘Flume events’ in a format that is recognized by the target Flume source

Modified from [6] Apache Flume

13 / 26

Apache Flume uses ‘channels‘ to process ‘event data streams‘

‘Multi-hop flows‘ enable events to travel through multiple agents Set of events are ‘reliably’ passed from point to point (transaction) Modify/drop events in-flight via various ‘interceptors’ (e.g. filters)

Apache Flume Data Stream Tool – Channels & Interceptors


Modified from [6] Apache Flume

When a Source receives an event,the event is ‘stores into n channels’

Channels are ‘passive stores’ that keep the events until they are consumed by a sink

Example: ‘File Channel’ stores on local filesystem until the ‘HDFS Flume Sink’ puts the file into HDFS

(interceptors)

14 / 26

Apache Spark Streaming Library – Revisited (cf. Lecture 3)


[7] Apache Spark

Apache Spark Streaming library enables to write streaming jobs the same way users would write batch jobs or to combine streaming with batch and interactive queries or filters

Usage example Combine streaming with batch

and interactive queries or using ‘windows’ E.g. find words with higher

frequency than historic data Recovers lost work without any

extra code for users to write(e.g. cf. RDD fault tolerance features)

15 / 26

AWS Cloud Service Portfolio – Analytics (cf. Lecture 5)

Multiple analytics products Extracting insights and actionable information from data

requires technologies like analytics & machine learning

Products & Usage Amazon Athena: Serverless Query Service Amazon ElasticMapReduce: Hadoop Amazon ElasticSearch Service: Elasticsearch on AWS Amazon Kinesis: Streaming Data Amazon QuickSight: Business Analytics Amazon Redshift: Data Warehouse …


[8] AWS Web page

Amazon Kinesis offers multiple analytics products that support extracting insights andactionable information from data using technologies like data analysis & machine learning

Amazon Kinesis consists of Kinesis Video Streams, Kinesis Data Streams, and KinesisFireHouse products that re-use other AWS products (e.g. SageMaker) or TensorFlow

[9] Amazon Kinesis

16 / 26

AWS Cloud Service Portfolio – Amazon Kinesis Example

Multiple analytics products Combined together depending on application demands:

Video Streams, Data Streams, Data Firehose Re-use of other AWS services:

e.g. Amazon SageMaker, Spark on Elastic MapReduce,TensorFlow / MxNet

E.g. application analyzing text data streams obtained from Twitter API


[9] Amazon Kinesis

[8] AWS Web page

(cf. Lecture 5)

(cf. Lecture 7 & 10)

17 / 26

AWS Sagemaker Example – Revisited (cf. Lecture 10)

AWS Cloud –Amazon Sagemaker Fully managed service

that enables quick & easy machine & deep learning applications(cf. Installation overheads of many required frameworks)


[15] AWS – Amazon Sagemaker AWS Amazon Sagemaker is a SAAS orientedservice thatprovides fullymanaged instancesrunning Jupyternotebooks thatinclude examplestraining & tuningvarious machinelearning models

[26] Jupyter Web page

18 / 26

In-Memory Technology: Apache Flink

New Features Open source platform Enables distributed stream

and batch processing Data streaming core runtime

Experience from Practice Apache projects need to be

carefully reviewed w.r.t. functionality promised & stability


[10] Apache Flink

Apache projects evolve with an enormous speed and while having good functionality and performance increase for certain applications their stability can be problematic sometimes

The maturity of projects like Apache Mahout, Apache Spark, or Apache Flink can be not compared to the MPI/OpenMP standards which implementations evolved over decades

[10] YouTube, Apache Spark

19 / 26

Apache Flink – Streaming & In-Memory

Flink/Storm different approach to distributed computing Provides a streaming dataflow engine approach Offers fast techniques for data distribution, communication, and fault

tolerance for distributed computing using data streams

Apache Flink libraries and APIs Datastream API: streams embedded in Java/Scala Dataset API: static data embedded in Java/Scala/Python Table API: SQL-enabled

language embedded in Java/Scala Complex event

processing (CEP) library Machine Learning Library Gelly as graph

processing API and library


[10] Apache Flink

20 / 26

One general criterion used to categorize machine learning systems is whether or not themodel and/or system can learn incrementally from a stream of incoming ‘big data‘

The ‘online learning‘ approach trains a system incrementally by feeding it data instancessequentially – either individually or by small groups that are called ‘mini-batches‘

The goal of ‘online learning‘ is to be fast and cheap in order to achieve that the system canlearn about new data elements on the fly as it arrives in data streams or other sources

Online Learning Approach in Machine Learning


modified from [17] Book ‘Hands-On Machine Learning with Scikit-Learn & TensorFlow‘

(‘online learning‘ ‘incremental learning‘) (‘batch learning‘ trained using all available data,but not incrementally – all in one ‘batch‘)

21 / 26

[Video] EPOS Use Cases

[14] YouTube, EPOS Use Cases


Lecture Bibliography


Lecture Bibliography (1)

[1] Top 15 Most Popular Social Networking Sites, Online: http://www.ebizmba.com/articles/social-networking-websites

[2] www.mashable.com, ‘Graph Databases: The New Way to Access Super Fast Social Data’, Online: http://mashable.com/2012/09/26/graph-databases/

[3] K. Hwang, G. C. Fox, J. J. Dongarra, ‘Distributed and Cloud Computing’, Book, Online: http://store.elsevier.com/product.jsp?locale=en_EU&isbn=9780128002049

[4] Wikipedia on ‘Crowdsourcing‘, Online: http://en.wikipedia.org/wiki/Crowdsourcing

[5] P-N Tan, M. Steinbach, and V. Kumar, ‘Introduction to Data Mining’, Pearson International Edition, ISBN 0-321-42052-7

[6] Apache Flume Web Page, Online: http://flume.apache.org/index.html

[7] Apache Spark Web Page, Online: http://spark.apache.org/

[8] Amazon Web Services Web Page, Online: https://aws.amazon.com

[9] Amazon Kinesis Web Page, Online: https://aws.amazon.com/kinesis

[10] Apache Flink, Online: https://flink.apache.org/


Lecture Bibliography (2)

[11] T. Kuhlen, ‘Visual Analysis of Human Brain Simulations’, Changes Workshop 2013, Online: http://www.ncsa.illinois.edu/Conferences/CHANGES2013/agenda.html

[12] M. Riedel, Th. Eickermann, S. Habbinga, W. Frings, P. Gibbon, D. Mallmann, F. Wolf, A. Streit, Th. Lippert, Felix Wolf, Wolfram Schiffmann, Andreas Ernst, Rainer Spurzem, Wolfgang E. Nagel, "Computational Steering andOnline Visualization of Scientific Applications on Large-Scale HPC Systems within e-Science Infrastructures," e-science, pp.483-490, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007), 2007

[13] SciRun Computational Steering, Online: http://www.sci.utah.edu/cibc/software/106-scirun.html

[14] EPOS - European Plate Observing System – Use Cases, Online: http://www.youtube.com/watch?v=a4MLbZpHdvE

[15] Amazon Web Services – Amazon SageMaker, Online: https://aws.amazon.com/sagemaker/

[16] Jupyter Web page,Online: http://jupyter.org/

[17] Aurélien Géron, ‘Hands-On Machine Learning with Scikit-Learn & TensorFlow‘, O‘Reilly Book,ISBN 9781491962282, 574 pages, 2017


school ofengineering andnatural sciences, university ... · 2. machine learning models in clouds 3....

Documents