ibm-cern - workshop sept 29 2016 - v2.pdf

IBM Machine Learning and Data Analytics

Collaboration Opportunities

Graham Mackintosh

IBM Emerging Technology – Project Executive

28 Sept 2016

Topics

IBM Emerging Technology – Quick Introduction

Workshop Context

Machine Learning and Deep Learning

Apache Spark

“open” CERN – openLabs, openData, SWAN, …

A few ideas to kick things off

IBM jStart – IBM Emerging Technology

jStart is the IBM Emerging

Technologies client engagement

team (ibm.com/jstart)

Solutions for global customers

using open & emerging

technologies.

Knowledge & experience transfer

through customer engagements

to IBM organizations and

products.

Two examples of our active projects:

- Spark Machine learning for signal

classification - NASA, SETI, Stanford

- Predictive analytics and real time

streaming with the US Cycling Team

jStart Projects and POC Process

Requirements driven - start with a simple use case and iterate

Low-friction PoC process to explore options & ideas - in-kind contribution

Every jStart engagement has an assigned jStart Project Manager and an

experienced Architect with ML experience

Development Labs and Cloud-based POC environments

Experience with a variety of

ML technologies (scikit-learn,

MLLib, Keras, etc.)

Leverage third party & open

source packages (e.g. HEP_ML

for high energy physics)Detailed

Design

Solution Drivers &

Boundaries

Requirements &

Solution ScopeIterative

Development

Deployment &

Skills Transfer

Constant feedback on Business & Technology

The jStart Engagement Process

Workshop Context

1. openLabs is promoting the use of Machine Learning at CERN in

collaboration with external companies and research institutions

CERN openlab Machine Learning and Data Analytics workshop – April 2016

2. Apache Spark enables interesting analytic capabilities and is well

accepted by the global data science community

SWAN – open service for interactive analysis in the cloud

CERN evaluation of Spark to predict CMS data set popularity

Interest in MLLib, scikit-learn, Keras distributed deep learning, etc.

3. CERN is increasingly open to external citizen scientists

Collaboration with LAL for the Higgs ML Challenge in 2014

openData portal – access with controls for data embargoes

Workshop Context


collaboration with external companies, and research institutions








openData portal – controlled access that respects data embargoes

• IBM Watson - $1B investment in deep learning and cognitive computing

• IBM DataWorks launched (Watson, Spark, Data Science Experience)

• IBM SystemML now an open Apache incubator project

• IBM is a core contributor to MLLib• IBM Cognitive Compute Cluster for

Deep Learning

Workshop Context











• IBM has announced strategic investment in Spark – now #2 contributor to Spark open source

• IBM Spark Technology Center opened in the heart of Silicon Valley

• Spark is linked to hundreds of other cloud services on IBM BlueMix

• Multiple Spark deployments and active POCs

“IBM will open source its breakthrough IBM SystemML machine learning technology and

collaborate with Databricks to advance Spark’s machine learning capabilities.”

“IBM will commit more than 3,500 researchers and developers to work on Spark-related projects at more than a dozen labs worldwide, and open a

Spark Technology Center in San Francisco for the Data Science and Developer community to foster design-led innovation in intelligent applications.”

“IBM will educate more than 1 million data scientists and data engineers on Spark through extensive partnerships with AMPLab, DataCamp, MetiStream, Galvanize and

Big Data University MOOC.”

Workshop Context











• IBM Data Science Experience• IBM Data Exchange• IBM collaboration with NASA

Advanced Super Computer Division to create training/test sets for ML models

• Example: Spark@SETI …

SETI Institute Backgrounder Headquartered in Mountain View, CA. Founded 1984. 150 Scientists,

researchers and staff.

The mission of the SETI Institute is to explore the potential for extra-terrestrial

life…. search for narrow band radio signals in the frequency range of 1GHz to

10GHz which could be evidence of intelligence outside our solar system.

Allen Telescope Array (ATA) – Phased Array Synthetic Dish – 3 Beams

42 Receiving DishesEach 6m diameter1GHz to 10GHz

The Allen Telescope Array

Only the data with detected signals is saved for later analysis

4.5TB data every hour

For example…. Spark@SETI

Spark@SETI

jStart project in collaboration with NASA and the SETI Institute

IBM Apache Spark Services allows large volumes of radio signal data to be

analyze in news ways

Deep data mining the SETI 10-Year data archives

Spark-enable analysis of long-duration observations (~5TB each)

Intelligent signal classification with deep learning (Cognitive Compute Cluster)

Open environment to allow other institutions and world-experts to participate

NASA Space Science Division

Stanford University – Multiple concurrent research teams

Swinburne University, Australia – Wide-band signal detection experts

IBM Research Johannesburg – Square Kilometer Array research team

Import of signal data from SETI radio telescope data archives ~ 10 years

Shared repository of SETI data in Object Store• 200M rows of signal event data• 360,000 raw recordings of “signals of interest”• Large “long duration” observations (~5TB each)• ~20TB accessible data in storageIBM

Object Storage

SWIFT

• IBM Spark@SETIGitHub repository

• Python Jupyter notebooks• Python code install packages• Standard GitHub

Collaboration functions

Spark@SETI

Spark@SETI

Example Notebook

Jupyter notebook showing complex radio signals being classified based on morphology and other features.

Neural net model was developed on the IBM Cognitive Compute Cluster (GPU enhanced) and ported IBM Spark on the cloud for use by other researchers

Spark@SETI – Technical pathfinder

Multi-terabyte data sources – 100’s of millions of records, millions of

binary files ranging from 5MB to 5TB – hardened SWIFT connectivity

from Spark to Object Store

CPU intensive algorithms for multi-variant data processing – hardened

Spark services for multi-day wall time workloads

Multi-terabyte Ground-to-Cloud uploads … IBM TS2270 tapes,

Softlayer Data Transfer Services, etc.

Advanced data visualization and notebook distribution

Integration with the IBM Cognitive Compute Cluster

Leverage deep learning models for real-time signal triage

Cluster availability monitoring and support

PUBLIC-Spark@SETI

Open invitation for external

researchers and citizen

scientists to analyze ATA

signal data

Gallery of “greatest hits”

and github of notebooks

for collaborative outcomes

Analytic challenges and

“hackathons”

Review of results for

potential use by the SETI

Institute on the internal

Spark environment

PUBLIC-Spark@SETI – Stanford University

Signal classification based on morphology and selected scalar metrics

PUBLIC-Spark@SETI – Stanford University

Signal classification based on morphology and selected scalar metrics

Example: Randomly (?) modulated signals which are occasionally

detected… signal of interest? faulty equipment?

The scalarinvariant feature transform (SIFT)

Fisher Vector – “Squiggle Fingerprint”

Getting back to the context of this workshop…

IBM experience is that these three are tightly linked

IBM is investing strategically in both Spark and DL

Spark community is hotbed of ML and DL activity

IBM DSX and Spark Services are ideally suited to

support public-facing initiatives

Externally contributed innovations can be leveraged

for internal use (which is often the motivator)

This convergence is the basis for proposing that

CERN & IBM should collaborate in these areas











Ideas: Two parallel work streams

1. POC for Internal Use Case – many possibilities from April workshop

jStart collaboration – no-charge exploration of the potential, iterative

development/demos, begin knowledge transfer

Leverage of IBM Cognitive Compute Cluster and access to IBM Spark and

DSX, Softlayer, Object Store, BlueMix services.

2. POC for Public facing Use Case – Spark@CERN

IBM Data Science Experience – Spark@CERN

Fully support IBM cloud infrastructure 24x7

Expand and extend the reach of SWAN

Controlled access to CMS data

Hack-a-thons and ML challenges

Thank you

Supporting Material

Clear understanding of business problemto be solved

Business and

technical

management

commitment

Funding in place

Right skills identified

and committed to

project

Decision making

context

Detailed schedule

Finalize scope

Final technology

selections

Deliverables

Design documents

Project schedule

Early prototyping

Regular code drops

Testing throughout

cycle

Constant feedback

from users

Modifications via

change request

Solution deployment

Customer self-

sufficiency

Reusable assets

Other business

areas or

technology

Solution definition

Small team

Define scope

Map business

needs and

technology

Deliverables

Use cases

Preliminary

design

Tentative

schedule

Initial sizing

Detailed

Design

Solution Drivers &

Boundaries

Requirements &

Solution ScopeIterative

Development

Deployment &

Skills Transfer

The jStart Engagement Process

jStart and Apache Spark

Apache Foundation open source project

In-memory compute engine that works with

data; not a data store

Enables highly iterative analysis on large

volumes of data at scale

Unified rapid dev environment for developers

and data engineers

Greatly simplifies the development of intelligent

apps fueled by data

Ideal for Rapid

ResultsPOCs!

Thank You!

ibm-cern - workshop sept 29 2016 - v2.pdf

Documents