mlconf nyc ted willke
Post on 17-Oct-2014
720 views
DESCRIPTION
TRANSCRIPT
CONTEXT
SEMANTICS!
Danny : isBrotherOf : Nezihfood cart : uses : bicyclesFrank : isFriendsWith : MohitFrank : isFriendsWith : TedFrank : likes : bicyclesFrank : likes : food cartsIvy : isFriendsWith : KushalIvy : isFriendsWith : TedIvy : likes : bicyclesIvy : likes : food cartsKushal : isFriendsWith : MohitKushal : isFriendsWith : NezihNezih : is FriendsWith : TedTed : likes : bicycles
This model... ... infers this interest.
Ted Kushal
Mohit
Danny
Ivy
Frank
Nezih
friends
friends
friends
brothers
friends
friends
friends
friends
FoodCart
likes
likes likesBicycles
likes likes
likes
uses
Likes?
Virtuous cycle of data
CLOUD
Richer data to analyze CLIENTS
Richer data from devices
Richer user experiences
INTELLIGENT SYSTEMS
SEMANTIC INFORMATIONIS FUEL FOR THE CYCLE
1985 1995 2005 2015
enterpriseNoSQL
Docs+
SemanticsRDF
WIDESPREADMACHINE LEARNINGON THIS
IMAGINE THE POSSIBILITIES
Graph centrality
High
Program Importance(Centrality)
Low
Graph ofchannel viewingbehavior
Current popularsurfing patterns
SH002463130000 EP005544723744
Changes in surfing behavior may predict customer churn.
Preference and Similarity Recommendations
User
Movie
1.7MM Nodes23.9MM Edges
similar cast
prefers
similar topic
userId: A0A22A5
title: The Godfather genre: Crime dramacast: [M. Brando, Al Pacino]
title: Scarfacegenre: Crime dramacast: [Al Pacino, M. Pfeiffer]
title: The Departedgenre: Crime dramacast: [L. DiCaprio, M. Damon]
weight=11.8
weight=0.67
weight=0.03
weight=14.98
Min-cost path search
10
URL Ground-Truth Data
IP/Domain Reputations
420MM Records
74.5MM Nodes185MM Edges
URL
Domain
IP Address
Calculation of priorsLBP Messaging
Loopy Belief Propagation on the (semantic) web
84.231.82.93
86.39.155.137
forum.vsichko.com
hermansonskok.se
euskzzbz.nonetheups.com
keesenbep.spaces.live.com
Loopy Belief Propagation on the (semantic) web
A yogaball
graph.
Really!?!
You may actually need this
• When the problem is an information network
• When a graph is a natural way of expressing the algorithm
• When you want to study specific relationships
• When you want faster machine learning or solvers on sparse data
shortest path
central influence
sub networks
triangle count
But there are challenges.
Handling all that data.
Finding people good at both handling all that data and data analysis.
Putting exploratory work into production fast enough to keep up
with the competition.
14
Congratulations! Youare a
data scientist!
It’s a demanding job
Ingest & Clean
EngineerFeatures
StructureModel
TrainModel
Query & Analyze
Learn
Visualize
Skills shortage at intersection of
systems engineering and
data analysis
Painful data ingestion and preparation
Workflows that are not designedwith loopbacks in mind
Few tools for analyzingsemantics at scale
Composing pipeline is
DIY
Decomposingthe “data scientist”
Source: 2013 Report from Accenture Institute for High Performance
IMAGINE A PLATFORM FOR DATA SCIENTISTSDOCS + SEMANTICS + MACHINE LEARNING
Ease-of-use: Making big data familiar
Python
R
Dataflow GUI
...
Datacenter / Cloud
Network
Client
BIG DATA
API
ConnectManag
eSecure
Analyzedistributed and parallel
ManageSecure
Connect
Analyzelocal
Query
Big Data Java/Scala/C++ Computational
Frameworks
Big Data Algorithms
Cluster Workload Mgmt
Cluster Storage
Machine Learning & Statistics
Data WranglingAnalyst Skills
The Other Skills
Delivering it
FILESYSTEMS AND NOSQL STORAGE
HW PLATFORM
APACHE HADOOP APACHE SPARK
DATA WRANGLINGMACHINE LEARNING AND
STATISTICSGraphical
AlgorithmsClassical
Algorithms
Graph Construction
Tools
Useful String
Manipulation
Useful Math
Operators
BIG DATA API
DATA SCIENCE SERVER (Query and Scripting)
Intel Analytics Toolkit
A UNIFIED DOCUMENT + SEMANTIC STORE
The Ask
Approach Algorithm Category Applications/Use Cases
Loopy Belief Propagation (LBP) Structured Prediction Personalized recs, image de-noising
Label Propagation Structured Prediction Personalized recommendations
Alternating Least Squares (ALS) Collaborative Filtering
Recommenders
Conjugate Gradient Descent (CGD) Collaborative Filtering
Recommenders
Connected Components Graph Analytics Network manipulation, image analysis
Latent Dirichlet Allocation (LDA) Topic Modeling Document Clustering
Structure Attribute Clustering Network analysis, consumer seg
K-Truss Clustering Social network analysis
KNN* Clustering Recommenders
Logistic Regression* Classification Fraud detection
Random Forest* Classification Fraud detection, consumer seg
Generalized Linear Model (Binomial, Poisson)
Non-linear Curve Fitting
Forecasting, pricing, market mix models
Association Rule Mining Data Mining Market basket analysis, recommenders
Frequent Pattern Mining* Data Mining Pattern Recognition
Bringing a full spectrum of possibilitiesG
raph
21
Article Tagging Problem
• Articles are tagged by experts with MeSH terms, drawn from a hierarchical controlled vocabulary of 55,000 keywords• Process is resource-intensive – can we automate
it?• Categorize articles into a hierarchy that matches
the same categorization from the MeSH controlled vocabulary
Hierarchy Level
Article Count
Demo: Graph Analytics For Medical Journal Analysis
INGEST&
CLEAN
ENGINEERFEATURES
STRUCTUREGRAPH
QUERY & ANALYZE
LEARN
VISUALIZE
PARSE AND EXTRACT WORDS
CREATE ARTICLE/
WORD LISTBUILD GRAPH QUERY/
VISUALIZE DATA
DETECT CLUSTERS
USING LDA
• Medline™ XML• MeSH Ontology XML
• Create list of unique words
• Stemming and lemmatization
• Index word list• Transform articles
into list of article/word pairs
• Extract vertices• Assign id columns
to vertex property• Assign year and
count edge properties
• Gremlin query for each visual
• Python web server and other libraries
• Select optimization parameters
• Invoke LDA
The Playbook?
PARSE AND
EXTRACT WORDS
CREATE ARTICLE/
WORD LIST
BUILD GRAPH
QUERY/VISUALIZE
DATA
DETECT CLUSTERS
USING LDA
Parse Prepare graph dataBasic analysis Run LDA
INSIGHTFULRESULT
This never happens!
The Real Playbook
PARSE AND
EXTRACT WORDS
CREATE ARTICLE/
WORD LIST
BUILD GRAPH
QUERY/VISUALIZE
DATA
DETECT CLUSTERS
USING LDA
Parse
Correct mistake
Prepare graph data
Correct schema mistake
Correct aggregation mistake
Data validation
Correct dataset mistake
Guess LDA settings
Tune and re-run
Detect bias in dataset
WE NEED THE AGILITY OF INTERACTIVE SCRIPTINGANDTHE
BRAINS AND BRAWN OF SCALABLE GRAPH ANALYTICS
Build Frame
28
Build Graph
29
Query Vertices
30
LDA with 3 Topics
LDA with 5 Topics
LDA with 7 Topics
Query Vertices Again – Now with ML Properties
34
Following Analysis
Wakefulness
Sleep
Animals
Electroencephalography
Circadian Rhythm
Arousal
Sleep Stages
REM
Mental Recall
Attention
Rats
Child
Evoked Potentials
Aged
Schizophrenia
Ocular
Conditioning
Infant
Psychophysics
Dreams
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
Top MeSH terms that predict which category an article will be assigned
Reimagining 2014
New partnerships in big data
Contributions to the open source community
The Intel Analytics Toolkit – COMING SOON
SEMANTICS + MACHINE LEARNINGTOGETHER AT LAST!
INTERESTED IN THE INTEL ANALYTICS [email protected]
Legal DisclaimersAll products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number
Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM). Functionality, performance or other benefits will vary depending on hardware and software configurations. Software applications may not be compatible with all operating systems. Consult your PC manufacturer. For more information, visit http://www.intel.com/go/virtualization
No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology (Intel® TXT) requires a computer system with Intel® Virtualization Technology, an Intel TXT-enabled processor, chipset, BIOS, Authenticated Code Modules and an Intel TXT-compatible measured launched environment (MLE). Intel TXT also requires the system to contain a TPM v1.s. For more information, visit http://www.intel.com/technology/security
Intel, Intel Xeon, Intel Atom, Intel Xeon Phi, Intel Itanium, the Intel Itanium logo, the Intel Xeon Phi logo, the Intel Xeon logo and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
Other names and brands may be claimed as the property of others.Copyright © 2013, Intel Corporation. All rights reserved.