ibm-cern - workshop sept 29 2016 - v2.pdf
TRANSCRIPT
IBM Machine Learning and Data Analytics
Collaboration Opportunities
Graham Mackintosh
IBM Emerging Technology – Project Executive
28 Sept 2016
Topics
IBM Emerging Technology – Quick Introduction
Workshop Context
Machine Learning and Deep Learning
Apache Spark
“open” CERN – openLabs, openData, SWAN, …
A few ideas to kick things off
IBM jStart – IBM Emerging Technology
jStart is the IBM Emerging
Technologies client engagement
team (ibm.com/jstart)
Solutions for global customers
using open & emerging
technologies.
Knowledge & experience transfer
through customer engagements
to IBM organizations and
products.
Two examples of our active projects:
- Spark Machine learning for signal
classification - NASA, SETI, Stanford
- Predictive analytics and real time
streaming with the US Cycling Team
jStart Projects and POC Process
Requirements driven - start with a simple use case and iterate
Low-friction PoC process to explore options & ideas - in-kind contribution
Every jStart engagement has an assigned jStart Project Manager and an
experienced Architect with ML experience
Development Labs and Cloud-based POC environments
Experience with a variety of
ML technologies (scikit-learn,
MLLib, Keras, etc.)
Leverage third party & open
source packages (e.g. HEP_ML
for high energy physics)Detailed
Design
Solution Drivers &
Boundaries
Requirements &
Solution ScopeIterative
Development
Deployment &
Skills Transfer
Constant feedback on Business & Technology
The jStart Engagement Process
Workshop Context
1. openLabs is promoting the use of Machine Learning at CERN in
collaboration with external companies and research institutions
CERN openlab Machine Learning and Data Analytics workshop – April 2016
2. Apache Spark enables interesting analytic capabilities and is well
accepted by the global data science community
SWAN – open service for interactive analysis in the cloud
CERN evaluation of Spark to predict CMS data set popularity
Interest in MLLib, scikit-learn, Keras distributed deep learning, etc.
3. CERN is increasingly open to external citizen scientists
Collaboration with LAL for the Higgs ML Challenge in 2014
openData portal – access with controls for data embargoes
Workshop Context
1. openLabs is promoting the use of Machine Learning at CERN in
collaboration with external companies, and research institutions
CERN openlab Machine Learning and Data Analytics workshop – April 2016
2. Apache Spark enables interesting analytic capabilities and is well
accepted by the global data science community
SWAN – open service for interactive analysis in the cloud
CERN evaluation of Spark to predict CMS data set popularity
Interest in MLLib, scikit-learn, Keras distributed deep learning, etc.
3. CERN is increasingly open to external citizen scientists
openData portal – controlled access that respects data embargoes
• IBM Watson - $1B investment in deep learning and cognitive computing
• IBM DataWorks launched (Watson, Spark, Data Science Experience)
• IBM SystemML now an open Apache incubator project
• IBM is a core contributor to MLLib• IBM Cognitive Compute Cluster for
Deep Learning
Workshop Context
1. openLabs is promoting the use of Machine Learning at CERN in
collaboration with external companies, and research institutions
CERN openlab Machine Learning and Data Analytics workshop – April 2016
2. Apache Spark enables interesting analytic capabilities and is well
accepted by the global data science community
SWAN – open service for interactive analysis in the cloud
CERN evaluation of Spark to predict CMS data set popularity
Interest in MLLib, scikit-learn, Keras distributed deep learning, etc.
3. CERN is increasingly open to external citizen scientists
openData portal – controlled access that respects data embargoes
• IBM has announced strategic investment in Spark – now #2 contributor to Spark open source
• IBM Spark Technology Center opened in the heart of Silicon Valley
• Spark is linked to hundreds of other cloud services on IBM BlueMix
• Multiple Spark deployments and active POCs
“IBM will open source its breakthrough IBM SystemML machine learning technology and
collaborate with Databricks to advance Spark’s machine learning capabilities.”
“IBM will commit more than 3,500 researchers and developers to work on Spark-related projects at more than a dozen labs worldwide, and open a
Spark Technology Center in San Francisco for the Data Science and Developer community to foster design-led innovation in intelligent applications.”
“IBM will educate more than 1 million data scientists and data engineers on Spark through extensive partnerships with AMPLab, DataCamp, MetiStream, Galvanize and
Big Data University MOOC.”
Workshop Context
1. openLabs is promoting the use of Machine Learning at CERN in
collaboration with external companies, and research institutions
CERN openlab Machine Learning and Data Analytics workshop – April 2016
2. Apache Spark enables interesting analytic capabilities and is well
accepted by the global data science community
SWAN – open service for interactive analysis in the cloud
CERN evaluation of Spark to predict CMS data set popularity
Interest in MLLib, scikit-learn, Keras distributed deep learning, etc.
3. CERN is increasingly open to external citizen scientists
openData portal – controlled access that respects data embargoes
• IBM Data Science Experience• IBM Data Exchange• IBM collaboration with NASA
Advanced Super Computer Division to create training/test sets for ML models
• Example: Spark@SETI …
SETI Institute Backgrounder Headquartered in Mountain View, CA. Founded 1984. 150 Scientists,
researchers and staff.
The mission of the SETI Institute is to explore the potential for extra-terrestrial
life…. search for narrow band radio signals in the frequency range of 1GHz to
10GHz which could be evidence of intelligence outside our solar system.
Allen Telescope Array (ATA) – Phased Array Synthetic Dish – 3 Beams
42 Receiving DishesEach 6m diameter1GHz to 10GHz
The Allen Telescope Array
Only the data with detected signals is saved for later analysis
4.5TB data every hour
For example…. Spark@SETI
Spark@SETI
jStart project in collaboration with NASA and the SETI Institute
IBM Apache Spark Services allows large volumes of radio signal data to be
analyze in news ways
Deep data mining the SETI 10-Year data archives
Spark-enable analysis of long-duration observations (~5TB each)
Intelligent signal classification with deep learning (Cognitive Compute Cluster)
Open environment to allow other institutions and world-experts to participate
NASA Space Science Division
Stanford University – Multiple concurrent research teams
Swinburne University, Australia – Wide-band signal detection experts
IBM Research Johannesburg – Square Kilometer Array research team
Import of signal data from SETI radio telescope data archives ~ 10 years
Shared repository of SETI data in Object Store• 200M rows of signal event data• 360,000 raw recordings of “signals of interest”• Large “long duration” observations (~5TB each)• ~20TB accessible data in storageIBM
Object Storage
SWIFT
• IBM Spark@SETIGitHub repository
• Python Jupyter notebooks• Python code install packages• Standard GitHub
Collaboration functions
Spark@SETI
Spark@SETI
Example Notebook
Jupyter notebook showing complex radio signals being classified based on morphology and other features.
Neural net model was developed on the IBM Cognitive Compute Cluster (GPU enhanced) and ported IBM Spark on the cloud for use by other researchers
Spark@SETI – Technical pathfinder
Multi-terabyte data sources – 100’s of millions of records, millions of
binary files ranging from 5MB to 5TB – hardened SWIFT connectivity
from Spark to Object Store
CPU intensive algorithms for multi-variant data processing – hardened
Spark services for multi-day wall time workloads
Multi-terabyte Ground-to-Cloud uploads … IBM TS2270 tapes,
Softlayer Data Transfer Services, etc.
Advanced data visualization and notebook distribution
Integration with the IBM Cognitive Compute Cluster
Leverage deep learning models for real-time signal triage
Cluster availability monitoring and support
PUBLIC-Spark@SETI
Open invitation for external
researchers and citizen
scientists to analyze ATA
signal data
Gallery of “greatest hits”
and github of notebooks
for collaborative outcomes
Analytic challenges and
“hackathons”
Review of results for
potential use by the SETI
Institute on the internal
Spark environment
PUBLIC-Spark@SETI – Stanford University
Signal classification based on morphology and selected scalar metrics
PUBLIC-Spark@SETI – Stanford University
Signal classification based on morphology and selected scalar metrics
Example: Randomly (?) modulated signals which are occasionally
detected… signal of interest? faulty equipment?
The scalarinvariant feature transform (SIFT)
Fisher Vector – “Squiggle Fingerprint”
Getting back to the context of this workshop…
IBM experience is that these three are tightly linked
IBM is investing strategically in both Spark and DL
Spark community is hotbed of ML and DL activity
IBM DSX and Spark Services are ideally suited to
support public-facing initiatives
Externally contributed innovations can be leveraged
for internal use (which is often the motivator)
This convergence is the basis for proposing that
CERN & IBM should collaborate in these areas
1. openLabs is promoting the use of Machine Learning at CERN in
collaboration with external companies, and research institutions
CERN openlab Machine Learning and Data Analytics workshop – April 2016
2. Apache Spark enables interesting analytic capabilities and is well
accepted by the global data science community
SWAN – open service for interactive analysis in the cloud
CERN evaluation of Spark to predict CMS data set popularity
Interest in MLLib, scikit-learn, Keras distributed deep learning, etc.
3. CERN is increasingly open to external citizen scientists
openData portal – controlled access that respects data embargoes
Ideas: Two parallel work streams
1. POC for Internal Use Case – many possibilities from April workshop
jStart collaboration – no-charge exploration of the potential, iterative
development/demos, begin knowledge transfer
Leverage of IBM Cognitive Compute Cluster and access to IBM Spark and
DSX, Softlayer, Object Store, BlueMix services.
2. POC for Public facing Use Case – Spark@CERN
IBM Data Science Experience – Spark@CERN
Fully support IBM cloud infrastructure 24x7
Expand and extend the reach of SWAN
Controlled access to CMS data
Hack-a-thons and ML challenges
Thank you
Supporting Material
Clear understanding of business problemto be solved
Business and
technical
management
commitment
Funding in place
Right skills identified
and committed to
project
Decision making
context
Detailed schedule
Finalize scope
Final technology
selections
Deliverables
Design documents
Project schedule
Early prototyping
Regular code drops
Testing throughout
cycle
Constant feedback
from users
Modifications via
change request
Solution deployment
Customer self-
sufficiency
Reusable assets
Other business
areas or
technology
Solution definition
Small team
Define scope
Map business
needs and
technology
Deliverables
Use cases
Preliminary
design
Tentative
schedule
Initial sizing
Detailed
Design
Solution Drivers &
Boundaries
Requirements &
Solution ScopeIterative
Development
Deployment &
Skills Transfer
The jStart Engagement Process
jStart and Apache Spark
Apache Foundation open source project
In-memory compute engine that works with
data; not a data store
Enables highly iterative analysis on large
volumes of data at scale
Unified rapid dev environment for developers
and data engineers
Greatly simplifies the development of intelligent
apps fueled by data
Ideal for Rapid
ResultsPOCs!
Thank You!