scaling the data scientist
DESCRIPTION
Scaling the Data Scientist. Dr. Ira Cohen, Chief Data Scientist, HP Software. HP-Software and Data Science. HP-Software products collect huge amounts of IT data. Requirements. Changes. Defects. Security events. System Monitoring. Logs. - PowerPoint PPT PresentationTRANSCRIPT
Scaling the Data Scientist
Dr. Ira Cohen, Chief Data Scientist, HP Software
2 Data Science Office @ HPSW
HP-Software and Data Science
HP-Software products collect huge amounts of IT data
Customers want us to transform the data to actionable information
System Monitoring
Events
Defects
Incidents
Logs
Changes
ConfigurationTest data
Requirements
“Big Data & Predictive Analytics: The Future of IT Management” Mike Gualtieri, Forrester
Security events
Network dataApp Monitoring
3 Data Science Office @ HPSW
Need
Expertise
Expertise in machine learning
Expertise in the products domain
Infrastructure
Data platforms
Development Tools
4 Data Science Office @ HPSW
A tale of two worldsData Scientists
• Few• Limited domain knowledge• Tools: R, Matlab, Mahout, Knime,
Weka, Sas, …
Developers/SMEs• Plentiful• Limited data science knowledge• Tools: IDEs, Excel
5 Data Science Office @ HPSW
Developer Data analytics specialist
Our solution
6 Data Science Office @ HPSW
How?
• Training• Mentoring• Community
• Data infrastructure• New Dev tool
7 Data Science Office @ HPSW
Training: Practical Machine Learning• 4 day training• Commitment to complete first project
•Big data foundations
•Problem definition
Data
•Attribute construction
•Transformations
Processing•Attribute selection
•Dimensionality reduction
Filtering
•Supervised•Unsupervised
Learning
• Validation methods• Accuracy measures
Testing
Practical Machine LearningOhad Assulin, Efrat Egozi Levi, Ira Cohen
Automatic Event
Prioritization
Anat Levinger & Roy
Wallerstein
Automatic
Vulnerability
Categorization
Barak Raz & Ben
FeherClassifying Security
EventsYoni Roit & Omer Weissman
Early detection of anomalous behavior
in IT systems Yonatan Ben Simhon & Yaneeve Shekel
Cloud Delivery Optimization (CDO)
Ran, LeviURL to Action ClassificationBoaz Shor & Eyal Kenigsberg
Predictive Analytics in
Release ManagementSigalit Sade
Sales Pipeline Early Warning
Gabriel, Alvarado
Pushing My Buttons
Gil Zieder, Ofer Eliassaf, Boris Kozorovitzky
10 Data Science Office @ HPSW
The process @ work
•Problem definition
Data
•Attribute construction
•Normalization
Processing•Attribute selection
Filtering
•Supervised•Classification
Learning
• Minimize false negatives
Testing
9 open source projects, 8806 individual commitsGet labels of “good” or “bad” commit by running tests after each commit“good” – tests pass, “bad” – tests fail
As a Pusher or DevOps of a project you would like to know if the given change set is safe to push into the production branch.
80 attributes per commitsource control, previous commits, and code complexity based attributes:e.g., average change frequency, previous commit state, cyclomatic complexity
Rank based attribute selection
Classification algorithmsK-NN, SVM, Decision Tree, Random Forest, …
87% Accuracy with K-NN
11 Data Science Office @ HPSW
Analytic specialist program: Results
> 70 developers
trained
Before: 4
> 30 new capabilities since April
2013
Before: 1
1 Data scientist per
10 new capabilities
Before: 1:1
Development time
reduced by 70%
Before: 12 months
12 Data Science Office @ HPSW
Can we do better?• Yes. From months to days! • How? – Create a simple tool for analytic specialists– Automate the data scientist as much as possible
13 Data Science Office @ HPSW
Project Titan
14 Data Science Office @ HPSW
Titan: Demo
15 Data Science Office @ HPSW
Scaling the data scientist
Analytic specialists• Develops using
standard machine learning
• Uses simplified tool
Data Scientist• Provides expert
advice • Develops new types
of machine learning solutions