making analytics viable in enterprises: potential routes ... · apache spark for distributed...

47
Making Analytics Viable in Enterprises: Potential routes for Industry 4.0 Jorge Sanz Anusha Choori Business Analytics Center National University of Singapore

Upload: others

Post on 20-May-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Making Analytics Viable in Enterprises:

Potential routes for Industry 4.0

Jorge Sanz

Anusha Choori

Business Analytics Center

National University of Singapore

Page 2: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

• Business Analytics as an enabler for Industry 4.0

• Cases from the field, typical challenges and lessons-learnt

• Viable roadmaps for Industrial Sector companies

• Potential opportunities for Luxembourg

• Conclusions

Agenda and Goals

Page 3: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Industry 4.0

Source: 2016 Global Industry Survey – Industry 4.0: Building the digital enterprise - PWC

Industry 4.0 – The Multi-Faceted Goal Framework

Core Capability

Key Dimensions in the Framework

Page 4: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Some Key Enablers of Industry 4.0

Capture and process large data sources1.

Collect, retrieve and query data

Analyze data (from reporting to prediction)

Integrate conventional IT silos more deeply 2.

Integrate Information-based Insights into Process Lifecycle

Cloud and Everything-as-a-Service

Reduce cost-of-ownership for viability 3.

Page 5: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

• Not only for Industrial Segments but also for most other industries …

• … realizing some critical capabilities for Industry 4.0:

Large-data analytics and enterprise architecture enable a new thinking of production

management and factory management

Analytical algorithms (some capable of learning from data) will be able to achieve

more flexibility and robustness in manufacturing, supply chain and distribution

Different forms of “cognitive systems” to support decision-making are part of the

emerging jargon (back to AI, NLP, etc. powered up by big data)

The Role of Business Analytics

A relatively new discipline that addresses the key enablers

Page 6: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Key Imperative: Shorten the solution cycle and reduce costs

Exploit new opportunities based on business analytics (from production improvements to new business models)

Collect, transmit, analyze large data from devices to monitor and predict / anticipate service needs

Shorten the Ideate-to-Monetize value-generation path and reduce cost-of-ownership

1

2

3

Page 7: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Business Competence

Organization Processes

Modeling and

Technology

Knowledge Areas building the BA Domain

Business Competence• Finance, accounting, marketing, supply chain, HR,

channels, IT, customer relationship …

• Industry-specific competences: underwriting, fraud,

claim life-cycle, product design, wealth

management, traffic …

Organization Processes• The design and transformation of work processes

Decision-making processes

Strategy processes

Operational processes

• How information improves and innovate processes

Modeling and Technology • Stochastic Models, Operations Research (and

tools: R, SAS, SPSS, …)

• Data generation sources (eg: Mobile messaging,

GPS locator, Surveillance cameras, ATMs, etc)

• Systems in support of Cognitive and Information

processes (Watson, HANA, etc)

Extending & Emerging Professions

Page 8: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Business Analytics Applications (Most Active Markets)

Source: IDC

Page 9: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

9

Projects from NUS Business Analytics Center, 2016 (I)

- PnL Analytics

- Cyber Security

- Anti Money Laundering

- Fraud analytics

- Risk assessment model forinvestigation program

- Optimising maintenance schedule for fleet management

- Text Mining - Social Network and Geospatial Analytics in context of Insurance

- Motor Pricing KPIs

- Travel Pricing KPIs and exploratory analysis

- Cross-sell and up-sell in insurance - POS transactional data - News recommendation engine for high net worth customers

- Economic Scenario Stress Testing

- The Future of Audit (1)

- The Future of Audit (2)

- Analysis of Customer Queuing Time & Headcount planning

- Predicting High Risk Churn Segments Via Product Usage Data

- BlueMix & Watson Analytics - Breakout detection for Hep C patients. - An Analytics Approach to Improve Subscription Rate for Nursing Course (prelim title)

- Case Study on Global FP&A Transformation

- Balance Sheet Forecasting

- IT Tools Comparison

- Case Study on Global FP&A Transformation

- Sales Forecasting and Tools for Predictive Analytics

Page 10: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

10

Projects from NUS Business Analytics Center, 2016 (II)

- Supply Chain Optimization 2.0

- Global logistics cost optimisation

- Social Media

- Automatic Rostering System

- Understanding Family Attitudes and Social Support Networks through Analytics

- Deriving Insights from NEHR (National Electronic Health Record)

- Developing an accurate model to provide estimates on how long a job should take given the characteristics of the job

- ALM Roll-Tagging Prediction

- Applications of Analytics to AML –Proposing a Risk Classification Model

- Analysing High Risk Segments in Auto Loan Portfolio

- Customer-Money Life Cycle

- Marketplace analysis - Customer credit risk analysis - Emergency Medical Service (EMS) Ambulance Demand Analytics & Prediction

- Sales Management Analytics

- HR Analytics

- Pricing Assessment Tool based on Analytics

- Healthcare analytics

- Frequent Attenders to the Emergency Department

- Online Analytics - Analysis on overtime cost- Analyzing cancer claims for

policy holders

Page 11: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

11

Projects from NUS Business Analytics Center, 2016 (III)

- IoT / Event Analytics in Manufacturing

- Data Lake architecture to deliver a virtualization layer for disparate data sources

- Market Research

- Social Media /Digital Marketing/PR- CRM / Markets

- Optimising Endowment Portfolio Performance

- Determining optimal level of markdowns through customer segmentation for revenue maximization

- Predicting eBay Auction Sales for Laptops

- Predicting Airbnb New User Bookings

- Predict which hotel type will an Expedia customer book

- Analysing Residential Property Prices

Page 12: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Industry 4.0

• Manufacturers could improve preventive and predictive maintenance of different production assets

• Many manufacturing systems are finding it difficult to collect, aggregate and benefit from data originating from large data sources because of the lack of novel analytical tools and appropriate infrastructure– For example, unplanned and excessive downtime of equipment increases – this directly affects

the operational cost and throughput

– This requires the utilization of advance prediction tools and algorithms so that data can be systematically processed into information that can explain the uncertainties, breakdowns, failures, short-stops and can thereby make more “informed” decisions

Manufacturing – Prevailing Challenges

By introducing analytics and more flexible production techniques,

manufacturers could boost their productivity by as much as 30 percent

promises … promises

“Source: Industrial Insights Report, Accenture – Industrial-Internet-Changing-Competitive-Landscape-Industries-2015

Page 13: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

From Physical to Digital to Analytics …

Physical World (Entities)

Computational Space

1. Cyber – Physical Interaction

Learn & synchronize

from physical world:

Knowledge extraction &

accumulation

Feedback to the physical

world:

Production scheduling

Maintenance & Adaptation

2. Machine

Health Awareness

Analytics

3. Optimal

Decision Support

Analytics

Computational Space

2. Machine

Health Awareness

Analytics

3. Optimal

Decision Support

Analytics

Computational Space

2. Machine

Health Awareness

Analytics

3. Optimal

Decision Support

Analytics

An e

nsem

ble

of d

igita

l life-c

ycle

s o

f

entitie

s d

eplo

ye

d in

diffe

rent s

ettin

gs

Page 14: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

… yielding opportunities for new business models

Augmented Digital

Product Player

Focus on products

digitally-endowed

(like sensors or

communication

devices)

Focus on data

analytics services;

Give access to

customers via a

dedicated (online)

platform (APIs)

Integration of third-

party partner or

competitor products

and control systems

in a complete

customer ecosystem

Asset Intensity

Data Intensity

Industrial companies are moving towards greater digital value-creation, from

augmented products to serving digital ecosystems

Source: 2016 Global Industry Survey – Industry 4.0: Building the digital enterprise - PwC

Complete Solution/

Service Provider

Data Analytics,

Content & Platform

Integrator

Integrated Digital

Ecosystem

Provider

Focus on digital

products and data-

services; which

provide a complete

solution for the

customer

Page 15: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Predictive Maintenance Analytics

Other Internal

motors

Receipt Printer

Card ReaderKeypad

Cash Dispenser

Cash

Deposit Unit

Data Transfer

Asset data aggregation

Bank and / or owner of ATMs

Monitoring/ Maintenance

Embedded Sensors

-as-a-Service

Page 16: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Other forms of Analytics-as-a-Service

Engine related sensors

Combustion sensor

Front light sensor

Internal light sensor

ExhaustSensor

Mobile Application

Manufacturing / Assembly

Monitoring

Gas station

w/ intelligent

appliance

Dashboard

Cloud Repository

Analytics

Page 17: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Sensor data from the equipment

Factory equipment

Gas GeneratorTemperature

Water Pressure

Humidity

Speed

Power

Influencing

variables

Gas concentration Response

variable

Predictive Analytics

Temperature

Water Pressure

Power

Gas

Concentration

• 200 sensors on every

equipment on the shop

floor

• Sensors emit data at every

500 millisecond interval

Predictive Analytics being

performed to understand the

underlying the data patterns

and to predict the

abnormality of gas

concentration in equipment.

Page 18: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

But …

Current capabilities of Industry 4.0 segments in Analytics

Q.: Are companies ready for more predictive and innovative kinds of solutions?

A.: “Not yet”

58% of the companies

have capabilities to

collect data and analyze

it

Only 40% of the

companies can predict

based on existing data

Fewer still, 36% only,

can optimize operations

from data insights

Source: Based on a survey conducted by GE and Accenture, 2015

Page 19: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Analytics and Reporting

Integration

Sensor and Other Data Sources

Analytics &Reporting

Higher-LatencyIntegration

Low-Latency

Three-Tier Scenarios in Business Analytics

ERP Data

EDWIn-Memory

Real-Time, Near Real-time, Batch / Off-line

Depending on acceptable latency of decision-making and cost-of-ownership

Different analyticalsituations in trainingmode from production

Analytics &Reporting

Page 20: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Analytics

Data Lake Analytics

Hadoop for distributed storage

Apache Spark for distributed computing

Edge Analytics

Apache Spark Streaming for low latency analytics

Managing Industrial data sources and analytics infrastructure- Open Source Infrastructure -

Business Analytics – Types of analysis

Depending on the use-case, the type of analytical approaches differ:

• Offline Analysis is performed on static data Data Lake Analytics (or Data Store Analytics)

• Online Analysis is performed on data that is streaming in Edge Analytics

Types of analytics tools (Open Source)

Page 21: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Managing Industrial Big Data and Analytics Infrastructure

- Open Source Infrastructure -

Industry 4.0 opportunities in the manufacturing unit of a leading packaging and

processing company

Case-Study

The infrastructure in the organization comprises of ERP (Enterprise Resource Planning)

systems, Business Warehouse (BW) units, and traditional transactional databases (MES

– Manufacturing Execution Systems) for capturing and analyzing sensor data and

operations data

Data ingestion, storage and processing are all performed in their current environment

which consists of traditional data stores

Architecture gives very little scope to perform analysis on massive data and near-real

time analytics. By adding more BW support, databases, compute power, they run into

the risk of paying a HUGE cost

Current setup lacks:

Infrastructure to ingest/ store massive data

Framework to perform large-scale data analytics

Problem Statement

Page 22: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Industry 4.0 Business Analytics – A case-study in the manufacturing

unit of a leading packaging and processing company

Proposed Infrastructure for Business AnalyticsExisting infrastructure

On-Line Basic Processing

Data Input (Batch into SQL Server)

Storage and EDW

Processing and Analysis (Business Warehouse)

Descriptive Diagnostic Predictive Prescriptive

Machine Data and Sensor data

Data Ingestion (Kafka)

Storage(Central + Distributed) HDFS

Processing (Central + Distributed) - Spark

Status Quo

Proposed architecture

Solution Overview

The right-hand side depicts the overall solution overview to analyze

both offline (historical data) and near-real time data

Machine and Sensor data

Page 23: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Proposed End-to-End Infrastructure

Industry 4.0 Business Analytics – A case-study in the manufacturing

unit of a leading packaging and processing company

Solution Overview

Sensor

logs

Alert toOperator

Page 24: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Why Apache Hadoop?

Temperature

Pressure

Water temp

Speed

Humidity

Ambient concentration

MES

(Manufacturing

Execution

System)

Shop floor system

Data Acquisition

System

SQL Server

Data reflected

after 24 hours

- Bounded by the actual size of the database

- May need to perform truncation

- Cannot support unstructured data

- Scope for analytics is reduced

Appropriately routed to reach HDFS cluster in Hadoop• Scalable

• No license fees

• Distributed storage

• Supports structured and

unstructured data

Current Infrastructure

Proposed Infrastructure

Page 25: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

What is Apache Hadoop?

What is - Apache Hadoop?

• Apache Hadoop is an open source framework which was built for:

– Distributed Storage – HDFS (Hadoop Distributed File System)

– Distributed Processing – Map / Reduce

• HDFS stores large files (structured and unstructured data) across several machines (laptops), PC’s and

commodity servers

• Even though the data is spread across several machines in the cluster, the user is still guaranteed a

“universal” view of the data – this is possible via a single management interface

Inexpensive Storage

• Without the hassle of purchasing or licensing specialized hardware

• Having the capacity to store structured and unstructured data originating from sensors

Ability to scale easily

• No compromise on data storage

• No truncation of data points originally captured

Preliminary Analysis

• Seamless integration with developer systems

• Universal access of stored data within the cluster

Page 26: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Analytics Infrastructure – Why Hadoop?

• The existing infrastructure cannot store petabytes of sensor data (there are about 15-20 sensor

tags in each part of the equipment on the shop floor with the sensors emitting data signals for

every 500 milliseconds)

• As a result, it becomes challenging to perform even simple off-line analysis on ALL the sensor data

• In addition, the organization does not want to incur additional licensing fees and while cloud

subscription fees are more affordable, data security concerns and latency of the needed data

upstreaming to perform analytics are caveats

Proposed Solution

Problem Statement

• The COST of storing all the individual sensor values from all the factory equipment on the

shop floor needs an inexpensive storage which also has the ability to scale as and when

needed

• Moreover, since the organization does not want to incur additional costs of purchasing

special hardware to host petabytes of sensor data, the best solution would be to choose an

open source distributed data sink which can be easily installed on commodity, inexpensive

hardware and which can inherently scale up as more machines are added to the cluster

• Hence, Apache Hadoop was chosen as the data store to host sensor data and to perform

OFFLINE analytics

Page 27: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Analytics Infrastructure – Why Apache Spark?

• Hadoop can store petabytes of machine sensor data (both structured and unstructured) but

when it comes to complex data analytics over that massive VOLUME of distributed data, Hadoop

falls short in performing an efficient and quick computation

• The existing MapReduce Operations do not fare well when the user is joining two or more large

datasets with several complex join conditions. Overall, MapReduce tasks generate a lot of

overhead by re-reading and parsing data which reduces its overall computational

efficiency to a LARGE extent even for off-line analysis

• In addition, building complex models or applications in Hadoop requires deep Java programming

skillsets

Proposed Solution

Problem Statement

• Keeping the volume of distributed data in mind, the natural choice to pick an open source

framework to perform efficient, parallelized computing is APACHE SPARK

• The main advantage of adopting Spark is that it can also run on commodity hardware by

pooling ALL the memory of the available machines in the cluster and assigning jobs to them

and orchestrates the execution in parallel – thereby saving time, cost and improving its

efficiency and lowering its execution time.

Page 28: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Apache Spark for Business Analytics in Industry 4.0

What is Apache Spark?

• Apache Spark is an open source cluster-computing framework which was built to overcome the limitations

of Hadoop’s MapReduce computing framework

• Spark was mainly built to achieve:

– Parallelism in data operations

– Distributed computing across a cluster of RAM’s which are available in the cluster

– Fault tolerance

– Scalability

• “Blends” in with Hadoop

Apache Spark – Indispensable Components

Spark Core

Spark

StreamingSpark SQL Spark MLlib GraphX

Page 29: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Apache Spark Streaming – BA Infrastructure for Industry 4.0

• Data in a Streaming Analytics environment is processed (on-line) before it lands in a database

• Currently, in the manufacturing unit of a leading packaging and processing industry, machine

sensor data is being stored after a “significant change” is detected in their data acquisition system

• In addition, this data is truncated – sensor readings which may suggest the working status of the

equipment in the future are lost as a process when the data hits the database

• Lack of real-time streaming analytics to predict alerts with more anticipation

Proposed Solution

Problem Statement

• What if this sensor data is analyzed as it is streaming in? Spark Streaming

• And, what if decisions were made before it hits the database? Spark Streaming

• Analyzed “industrial big-data” can then be made to flow into a database of their choice (SQL

Server), a distributed database (Cassandra, HBase) or into a distributed file system (HDFS).

Page 30: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Apache Spark Streaming

Apache Spark Streaming

Analyzing data streams in real time, streams of real-time sensor data instead of large, data-

intensive batch jobs on a daily basis

Page 31: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Apache Kafka – BA Infrastructure for Industry 4.0

Problem Statement

Use-case in a real life scenario:

• The leading packaging and processing firm has about 200 sensors in every equipment and machinery

they own. These innumerable sensors emit data at every 500 millisecond interval – this data has to be

correctly captured, queued, analyzed and stored for further analysis to happen

• In cases there are many real-time applications that “consume” data from 1000s of these sensors for

reporting and analytics, it becomes a criss-crossed and random way of “requesting data” from sensors

• Add to that, there is a risk of losing data mid-way and listening to the “wrong” sensor reading or listening to

the messages which are coming out-of-order

Apache Kafka can simplify the current messaging architecture by using a Producer-Consumer

approach and orchestrates messaging services by acting as a broker.

Proposed Solution

The coordination, replication, fault-tolerance, partitioning and parallelism of this architecture

are taken care of by the Kafka server entirely.

Producers publish to TOPICS Kafka orchestrates messaging Subscribers listen to these TOPICS

Page 32: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Apache Kafka

Topic 1 Topic 2Producers Consumers

Analytics Modelling

Database

Database

Streaming

applications

Kafka Broker

Zookeeper

0 1 2

0 1

0 1 2 3

0 1 2

Partition 1

Partition 2

Partition 3

Partition 4

Page 33: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Apache Kafka – for Industry 4.0/ IoT

Apache Kafka’s role in Manufacturing

Use cases in IIoT (Industrial Internet of Things)

- Real time stream processing (coupled with Spark Streaming)

- General purpose message bus

- Collecting user activity data

- Collecting operational metrics from sensors, applications, servers or devices

- Log aggregation

- Change data capture

- Maintaining a commit log for distributed systems

Source: Confluent

Page 34: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Cloud Strategy

• Reduce cost-of-ownership, simplify management of IT operations, and shorten

the path from invention to delivery of new applications

• Develop a new business model opportunity by creating a domain-specific

service-suite accessible to subscribers or pay-per-use through APIs

Cloud for Analytics Capabilities is a very important option to manage the complexity of

Business Analytics infrastructure

Scalability, High Availability

Data Model Flexibility, Data Mobility

Seamless work with an ecosystem of apps and tools

Built-in analytical tools support for faster and efficient data analysis on-line and off-line

CRITICAL: smooth integration with ERP capabilities, thus facilitating better bridges

between process management and information life-cycle in the industry 4.0 enterprise

Page 35: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Offerings for key open source BA infrastructure

IBM BigInsightsAvailable on-premises, on-cloud, and integrated with other systems

in use today

Text Analytics

POSIX Distributed File

system

Multi-workload, multi-tenant

scheduling

IBM BigInsights

Enterprise Management

Machine Learning on

Big R

Big R (R support)

IBM Open Platform with Apache Hadoop

(HDFS, YARN, MapReduce, Ambari, Flume, HBase, Hive, Kafka, Knox, Oozie, Pig, Slider, Solr, Spark, Sqoop, Zookeeper)

IBM BigInsights

Data Scientist

IBM BigInsights

Analyst

Big SQL

BigSheets

Big SQL

BigSheets

Free Quick Start (non production):

• IBM Open Platform

• BigInsights Analyst, Data Scientist

features

• Community support

. . .

IBM BigInsights – On-premises version

Page 36: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

The Open Source on Cloud by SAP

SAP HANA Vora

SAP HANA + Vora + Hadoop

• SAP HANA Vora integrates SAP HANA data with data lakes(Hadoop)

• Seamless integration with HANA + Spark + Hadoop

• One can archive ERP data from HANA to Hadoop

• Integrated BI

(also on premises)

Page 37: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

MapR, Hortonworks, Cloudera

More ICT offerings for key BA infrastructure on Cloud

Microsoft

Azure HDInsight

On premises and

Cloud

On premises and

CloudOn premises and

Cloud

Cloud Service

Page 38: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Offerings for key Open Source BA infrastructure

Microsoft

Azure HDInsight:

Components offered:

Apache Hadoop/ YARN

Apache Tez

Apache Pig

Apache Hive

Apache Hbase

Apache Sqoop

Apache Oozie

Apache Zookeeper

Apache Storm

Apache Mahout

Apache Spark

Cloud Service

Page 39: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Amazon Web Services – Support for Open

Source Capabilities on the Cloud

Hadoop – Elastic Map Reduce Apache Spark – Elastic Map Reduce

HDFS is automatically installed with Hadoop

on Amazon’s EMR(Elastic Map Reduce) cluster

EMR = Managed service Hadoop Framework

by Amazon

• Amazon EMR is easy to tune in for clusters and

helps reduce infrastructure maintenance and

operational costs

• Supports multiple data stores

• Since it is elastic, one can provision 100s and

1000s of compute instances to process large

datasets

(increase and decrease the number of instances)

Source: https://aws.amazon.com/emr/

Spark is also supported by Amazon EMR

cluster

The in-memory caching, optimized execution,

general batch processing, streaming analytics,

machine learning, graph databases and ad-hoc

queries are all supported on cloud by Amazon

EMR and EC2(Elastic Cloud Compute)

• Amazon Elastic Cloud Compute(EC2) is a web

service that provides resizable compute

capacity in the cloud

Source: https://aws.amazon.com/emr/details/spark//

Page 40: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Commercial vehicles for delivering viable BA solutions

IBM Predictive Maintenance

Predict Asset Failure/

Extend Life:

• Determine failure based

on usage characteristics

• Identify conditions that

lead to high failure

Predict Part Quality

• Detect anomalies within

the process

• Conduct in-depth root

cause analysis

Page 41: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

IBM Watson Explorer On premises and in cloud

Analytics on dispersed sources of structured and unstructured data

Page 42: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Commercial vehicles for delivering viable BA solutions

IBM Streams

On premises and in cloud

Page 43: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Commercial vehicles for delivering viable BA solutions

SAP HANA and Analytics

SAP HANA Platform

JavaScript, SQLScript, SQL

Web Server

Spatial Search Text MiningStored Procedure &

Data Models

Application &

UI services

Business Function

LibraryPredictive Analytics

LibraryDatabase Services Planning Engine Rules Engine

Planning Engine

Transaction Unstructured Machine Hadoop Real-time Location Other Apps

Applications Cloud Applications Analytics Excel IoT Mobile/ Web API

SAP offers near-real time in-memory computing capabilities and efficient reporting through HANA

Page 44: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Commercial vehicles for delivering viable BA solutions

SAP Smart Data Streaming On premises

Page 45: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Opportunity for Luxembourg: High specialization in selected

capabilities leading to new ICT Solutions with impact to Industry 4.0

Architectures integrating process and big data. Achitectures for Cloud-based Applications and Services

Affordable options for big data and analytics needs

Create a sandbox of new and custom algorithms for quick PoCs. Define an API App Cloud for Industry 4.0

How to use data analytics for better security

GDPR and data confidentiality assurance in Industry 4.0

Programs in BA for executives, managers and engineers

Proper funding for collaboration between start-ups / R&D / industry

Business Analytics

Reference Architectures

Infrastructure

Research and Innovation

Security of Network Systems

Legal Framework

Training and Education

Others

Key Topics

Page 46: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Paths to make BA viable for Industry 4.0

1. Focus on specific areas where large data sources and analytics may lead to operational savings and new business

– Do not boil the ocean by making exotic mega-plans or tough ROI cases

– Start simple with quick wins for exec management buy-in

2. Get help to assess the viability of the initiative very quickly (technically and financially)

– If your organization does not have the specific skills, do not rely on internal-only assessments (good IT or Engineering does not mean that they will know BA)

– Partnership with an R&D organization that can help you assess (for example, the FEDER project in LIST)

3. Use ICT third-party rented infrastructure to discover and validate solutions, architecture, options in depth

4. If you can afford to do some BA work based on internal infrastructure, test ideas by using simple tools

– Open tools are appealing for zero-cost license but the skills needed to use them properly and maintain an informal development environment are very specialized

– Partner with an organization that can help you define a good architecture for the solution toward a fast No / No-go PoC (i.e., fail fast and cheaply)

Page 47: Making Analytics Viable in Enterprises: Potential routes ... · Apache Spark for distributed computing Edge Analytics Apache Spark Streaming for low latency analytics Managing Industrial

Open source software - Considerations

Programming

Language Support

Runtime Considerations Platform

Support

OS

Hadoop Java • Distributed disk access

• No firewall between intended

systems

JDK

Java

Linux

Windows

Mac OS

Spark Scala, Java, Python, R

via SparkR

• Distributed memory access

• No firewall between intended

systems

• ICMP protocol should not be

blocked

JDK

Java

Linux

Windows

Mac OS

Spark

Streaming

Scala, Java, Python • Distributed memory access

• System ports should not be

blocked by firewall

JDK

Java

Linux

Windows

Mac OS

Kafka Several including Scala,

Java, Python,

stdin/stdout

System ports should not be

blocked by firewall

JDK

Java

Zookeeper

Linux

Windows

Mac OS

SAP

Smart

Data

Streaming

CCL Smart Data Streaming should be

hosted on a separate server

[TBD] Linux