![Page 1: Use of standards and related issues in predictive analytics](https://reader031.vdocument.in/reader031/viewer/2022030305/58735fc41a28abe7648b54c9/html5/thumbnails/1.jpg)
Use of standards and related issues in predictive analytics
KDD 2016, SF 2016-08-16
Paco Nathan, @pacoid Dir, Learning Group @ O’Reilly Media
![Page 2: Use of standards and related issues in predictive analytics](https://reader031.vdocument.in/reader031/viewer/2022030305/58735fc41a28abe7648b54c9/html5/thumbnails/2.jpg)
PMML referenced by 86 publications in Safari, 2001-2016 https://www.safaribooksonline.com/search/?query=PMML
![Page 3: Use of standards and related issues in predictive analytics](https://reader031.vdocument.in/reader031/viewer/2022030305/58735fc41a28abe7648b54c9/html5/thumbnails/3.jpg)
Pattern: PMML for Cascading and Hadoop P Nathan, G Kathalagiri (2013-08-11) https://goo.gl/jk7829
![Page 4: Use of standards and related issues in predictive analytics](https://reader031.vdocument.in/reader031/viewer/2022030305/58735fc41a28abe7648b54c9/html5/thumbnails/4.jpg)
CustomerOrders
Classify ScoredOrders
GroupBytoken
Count
PMMLModel
M R
FailureTraps
Assert
ConfusionMatrix
Pattern – score a model, using pre-defined Cascading app
cascading.org/projects/pattern
![Page 5: Use of standards and related issues in predictive analytics](https://reader031.vdocument.in/reader031/viewer/2022030305/58735fc41a28abe7648b54c9/html5/thumbnails/5.jpg)
evaluationoptimizationrepresentationcirca 2010
ETL into cluster/cloud
datadata
visualize,reporting
Data Prep
Features
Learners, Parameters
UnsupervisedLearning
Explore
train set
test set
models
Evaluate
Optimize
Scoringproduction
datause
cases
data pipelines
actionable resultsdecisions, feedback
bar developers
foo algorithms
Algorithms and developer-centric template thinking only go so far in real-world workflows…
Results shown in blue, hard problems highlighted in red
Generalized Workflow for ML Use Cases in Big Data
![Page 6: Use of standards and related issues in predictive analytics](https://reader031.vdocument.in/reader031/viewer/2022030305/58735fc41a28abe7648b54c9/html5/thumbnails/6.jpg)
Portable Format for Analytics (PFA)
PFA updates the standards w.r.t. more contemporary issues of system architectures used for predictive analytics: distributed processing, in-memory computing, serialization, etc.
http://dmg.org/pfa/docs/motivation/
• much more support for distributed systems
• Avro data types
• forward-looking toward more streaming applications
• fits well with higher layers of abstraction, success of DSLs, etc.
![Page 7: Use of standards and related issues in predictive analytics](https://reader031.vdocument.in/reader031/viewer/2022030305/58735fc41a28abe7648b54c9/html5/thumbnails/7.jpg)
Tuning Spark Streaming for Throughput Gerard Maas, Virdata (2014-12-22)
“One Size Fits All” Doesn’t Anymore This common architectural pattern requires interchange…
![Page 8: Use of standards and related issues in predictive analytics](https://reader031.vdocument.in/reader031/viewer/2022030305/58735fc41a28abe7648b54c9/html5/thumbnails/8.jpg)
bits.blogs.nytimes.com/2013/06/19/g-e-makes-the-machine-and-then-uses-sensors-to-listen-to-it/
IoT alters “velocity” and “volume” dramatically This growing category of use cases requires interchange…
![Page 9: Use of standards and related issues in predictive analytics](https://reader031.vdocument.in/reader031/viewer/2022030305/58735fc41a28abe7648b54c9/html5/thumbnails/9.jpg)
Lessons from the success of Apache Spark…
interchange is necessary for the ecosystem
major use cases tend to build their own ML libraries – despite a case where a majority of committers tend to support a common vision and encourage use of a canonical library (MLLib with DataFrames)
when a successful business grows over time, challenges arise by definition: managing separated teams, mergers and acquisitions, increased audits, regulations, etc.
therefore, lack of interchange for analytics represents a serious technical debt and potential liability
![Page 10: Use of standards and related issues in predictive analytics](https://reader031.vdocument.in/reader031/viewer/2022030305/58735fc41a28abe7648b54c9/html5/thumbnails/10.jpg)
Tungsten Execution
Python SQL R Streaming
DataFrame
Advanced Analytics
Set Footer from Insert Dropdown Menu 27
Physical Execution: CPU Efficient Data Structures
Keep data closure to CPU cache Tungsten
Lessons from the success of Apache Spark…
direct use of “compilers” becomes atypical as abstraction layers become smarter for deferred optimization
![Page 11: Use of standards and related issues in predictive analytics](https://reader031.vdocument.in/reader031/viewer/2022030305/58735fc41a28abe7648b54c9/html5/thumbnails/11.jpg)
What to suggest for existing standards?
microservices: how to compose models + parameters from multiple/distinct services
support for API definitions in Swaggar http://swagger.io/
consider the benefits of Parquet, e.g., how pushdown predicates enable better optimization of workflows
![Page 12: Use of standards and related issues in predictive analytics](https://reader031.vdocument.in/reader031/viewer/2022030305/58735fc41a28abe7648b54c9/html5/thumbnails/12.jpg)
What to suggest for existing standards?
additional standards emerging for other aspects of workflow definition:
Jupyter http://jupyter.org/create and share documents that contain live code, equations, visualizations and explanatory text — a network protocol suite, at heart, for distributed REPL environments, often along with containerization
see usage in Oriole http://oreilly.com/oriole/index.html
Dat http://dat-data.com/
shares versioned data through a decentralized network
![Page 13: Use of standards and related issues in predictive analytics](https://reader031.vdocument.in/reader031/viewer/2022030305/58735fc41a28abe7648b54c9/html5/thumbnails/13.jpg)
What to suggest for existing standards?
other lingering issues:
• data lineage / provenance
• metadata drift
• public dialog and law: https://public.resource.org/about/
![Page 14: Use of standards and related issues in predictive analytics](https://reader031.vdocument.in/reader031/viewer/2022030305/58735fc41a28abe7648b54c9/html5/thumbnails/14.jpg)
presenter:
Just Enough Math O’Reilly (2014) justenoughmath.com
monthly newsletter for updates, events, conf summaries, etc.: liber118.com/pxn/