knime big data workshop...© 2016 knime.com ag. all rights reserved. 41 knime big data architecture...
TRANSCRIPT
© 2016 KNIME.com AG. All Rights Reserved.
KNIME Big Data Workshop
Tobias Kötter and Björn Lohrmann
KNIME
© 2016 KNIME.com AG. All Rights Reserved. 2
Variety, Volume, Velocity
• Variety:– integrating heterogeneous data…– .. and tools
• Volume:– from small files...– ...to distributed data repositories (Hadoop)– bring the computation to the data
• Velocity:– from distributing computationally heavy
computations...– ...to real time scoring of millions of
records/sec.
2
© 2016 KNIME.com AG. All Rights Reserved. 3
Variety
© 2016 KNIME.com AG. All Rights Reserved. 4
The KNIME Analytics Platform: Open for Every Data, Tool, and User
KNIME Analytics Platform
Business Analyst
Externaland Legacy
Tools
NativeData Access, Analysis,
Visualization, and Reporting
ExternalData
Connectors
Distributed / Cloud Execution
Data Scientist
© 2016 KNIME.com AG. All Rights Reserved. 5
Data Integration
© 2016 KNIME.com AG. All Rights Reserved. 6
Integrating R and Python
© 2016 KNIME.com AG. All Rights Reserved. 7
Modular Integrations
© 2016 KNIME.com AG. All Rights Reserved. 8
Other Programming/Scripting Integrations
© 2016 KNIME.com AG. All Rights Reserved. 9
Volume
© 2016 KNIME.com AG. All Rights Reserved. 10
Database Extension
• Visually assemble complex SQL statements
• Connect to almost all JDBC-compliant databases
• Harness the power of your database within KNIME
© 2016 KNIME.com AG. All Rights Reserved. 11
In-Database Processing
• Operations are performed within the database
© 2016 KNIME.com AG. All Rights Reserved. 12
Tip
• SQL statements are logged in KNIME log file
© 2016 KNIME.com AG. All Rights Reserved. 13
Database Port Types
Database JDBC Connection Port (light red)• Connection information
Database Connection Port (dark red)• Connection information• SQL statement
Database Connection Ports can be connected to
Database JDBC Connection Ports but not vice versa
© 2016 KNIME.com AG. All Rights Reserved. 14
Database Connectors
• Nodes to connect to specific Databases – Bundling necessary JDBC drivers– Easy to use– DB specific behavior/capability
• Hive and Impala connector part of the commercial KNIME Big Data Connectors extension
• General Database Connector– Can connect to any JDBC source– Register new JDBC driver via
preferences page
© 2016 KNIME.com AG. All Rights Reserved. 15
Register JDBC Driver
Open KNIME and go toFile -> Preferences
Increase connection timeout forlong running database operations
Register new JDBC drivers
© 2016 KNIME.com AG. All Rights Reserved. 16
Query Nodes
• Filter rows and columns
• Join tables/queries
• Extract samples
• Bin numeric columns
• Sort your data
• Write your own query
• Aggregate your data
© 2016 KNIME.com AG. All Rights Reserved. 17
Database GroupBy – Manual Aggregation
Returns number of rows per group
© 2016 KNIME.com AG. All Rights Reserved. 18
Database GroupBy – Pattern Based Aggregation
Tick this option if the search pattern is a
regular expression otherwise it is treated as string with wildcards ('*' and '?')
© 2016 KNIME.com AG. All Rights Reserved. 19
Database GroupBy – Type Based Aggregation
Matches all columns
Matches all numericcolumns
© 2016 KNIME.com AG. All Rights Reserved. 20
Database GroupBy – DB Specific Aggregation Methods
PostgreSQL 25 aggregation functions
SQLite 7 aggregation functions
© 2016 KNIME.com AG. All Rights Reserved. 21
Database GroupBy – Custom Aggregation Function
© 2016 KNIME.com AG. All Rights Reserved. 22
Database Writing Nodes
• Create table as select
• Insert/append data
• Update values in table
• Delete rows from table
© 2016 KNIME.com AG. All Rights Reserved. 23
Performance Tip
– Increase batch size in database manipulation nodes
Increase batch size forbetter performance
© 2016 KNIME.com AG. All Rights Reserved. 24
KNIME Big Data Connectors
• Package required drivers/libraries for HDFS, Hive, Impala access
• Preconfigured connectors
– Hive
– Cloudera Impala
– Extends the open source database integration
© 2016 KNIME.com AG. All Rights Reserved. 25
Hive/Impala Loader
• Batch upload a KNIME data table to Hive/Impala
© 2016 KNIME.com AG. All Rights Reserved. 26
HDFS File Handling
• New nodes
– HDFS Connection
– HDFS File Permission
• Utilize the existing remote file handling nodes
– Upload/download files
– Create/list directories
– Delete files
© 2016 KNIME.com AG. All Rights Reserved. 27
HDFS File Handling
© 2016 KNIME.com AG. All Rights Reserved. 28
KNIME Spark Executor
• Based on Spark MLlib
• Scalable machine learning library
• Runs on Hadoop
• Algorithms for– Classification (decision tree, naïve bayes, …)
– Regression (logistic regression, linear regression, …)
– Clustering (k-means)
– Collaborative filtering (ALS)
– Dimensionality reduction (SVD, PCA)
© 2016 KNIME.com AG. All Rights Reserved. 29
Familiar Usage Model
• Usage model and dialogs similar to existing nodes
• No coding required
© 2016 KNIME.com AG. All Rights Reserved. 30
MLlib Integration
• MLlib model ports for model transfer
• Native MLlib model learning and prediction
• Spark nodes start and manage Spark jobs
– Including Spark job cancelation
Native MLlib model
© 2016 KNIME.com AG. All Rights Reserved. 31
Data Stays Within Your Cluster
• Spark RDDs as input/output format
• Data stays within your cluster
• No unnecessary data movements
• Several input/output nodes e.g. Hive, hdfs files, …
© 2016 KNIME.com AG. All Rights Reserved. 32
Machine Learning – Unsupervised Learning Example
© 2016 KNIME.com AG. All Rights Reserved. 33
Machine Learning – Supervised Learning Example
© 2016 KNIME.com AG. All Rights Reserved. 34
Mass Learning – Fast Event Prediction
• Convert supported MLlib models to PMML
© 2016 KNIME.com AG. All Rights Reserved. 35
Sophisticated Learning - Mass Prediction
• Supports KNIME models and pre-processing steps
© 2016 KNIME.com AG. All Rights Reserved. 36
Closing the Loop
Apply modelon demand
Sophisticated model learning
Apply modelat scale
Learn modelat scale
PMML model MLlib model
© 2016 KNIME.com AG. All Rights Reserved. 37
Mix and Match
• Combine with existing KNIME nodes such as loops
© 2016 KNIME.com AG. All Rights Reserved. 38
Modularize and Execute Your Own Spark Code
© 2016 KNIME.com AG. All Rights Reserved. 39
Lazy Evaluation in Spark
• Transformations are lazy
• Actions trigger evaluation
© 2016 KNIME.com AG. All Rights Reserved. 40
Spark Node Overview
© 2016 KNIME.com AG. All Rights Reserved. 41
KNIME Big Data Architecture
Hadoop Cluster
Submit Spark jobsvia HTTP(S)
*Software provided by KNIME, based on https://github.com/spark-jobserver/spark-jobserver
Spar
k Jo
b
Serv
er *
Workflow Upload via HTTP(S)
Hiv
eser
ver
2Im
pal
a
Submit Hive queriesvia JDBC
Submit Impala queriesvia JDBC
KNIME Server with extensions: • KNIME Big Data Connectors • KNIME Big Data Executor for Spark
Scheduled executionand RESTful workflow
submission
KNIME Analytics Platform with extensions:• KNIME Big Data Connectors • KNIME Big Data Executor for Spark
Build Sparkworkflows graphically
© 2016 KNIME.com AG. All Rights Reserved. 42
Velocity
© 2016 KNIME.com AG. All Rights Reserved. 43
Velocity
• Computationally Heavy Analytics:
– Distributed Execution of one workflow branch
– Parallel Execution of workflow branches
– Parallel Execution of workflows
© 2016 KNIME.com AG. All Rights Reserved. 44
KNIME Cluster Executor: Distributed Data
© 2016 KNIME.com AG. All Rights Reserved. 45
KNIME Cluster Execution: Distributed Analytics
© 2016 KNIME.com AG. All Rights Reserved. 46
KNIME Batch Executors on Hadoop (via Oozie)
Hadoop Master Node Hadoop Worker Node
HDFS
YARN
Oozie Server
KNIME Workflow Designer
YARN Container
Executes
Designs and uploadsworkflow
Submit workflowparameters via REST
REST Client
© 2016 KNIME.com AG. All Rights Reserved. 47
Hadoop Worker Node
YARN Container
Hadoop Worker Node
YARN Container
KNIME Executors on Hadoop (via YARN)
KNIME Server
KNIME WorkflowDesigner
KNIME Workflow Users
ExecutesExecute
workflow Executes
YARN
RESTClient
© 2016 KNIME.com AG. All Rights Reserved. 48
Velocity
• Computationally Heavy Analytics:
– Distributed Execution of one workflow branch
– Parallel Execution of workflow branches
– Parallel Execution of workflows
• High Demand Scoring/Prediction:
– High Performance Scoring using generic Workflows
– High Performance Scoring of Predictive Models
© 2016 KNIME.com AG. All Rights Reserved. 49
High Performance Scoring via Workflows
• Record (or small batch) based processing
• Exposed as RESTful web service
© 2016 KNIME.com AG. All Rights Reserved. 50
High Performance Scoring using Models
• KNIME PMML Scoring via compiled PMML
• Deployed on KNIME Server
• Exposed as RESTful web service
• Partnership with Zementis
– ADAPA Real Time Scoring
– UPPI Big Data Scoring Engine
© 2016 KNIME.com AG. All Rights Reserved. 51
Velocity
• Computationally Heavy Analytics:– Distributed Execution of one workflow branch
– Parallel Execution of workflow branches
– Parallel Execution of workflows
• High Demand Scoring/Prediction:– High Performance Scoring using generic Workflows
– High Performance Scoring of Predictive Models
• Continuous Data Streams– Streaming in KNIME
– Spark Streaming
© 2016 KNIME.com AG. All Rights Reserved. 52
Streaming in KNIME
© 2016 KNIME.com AG. All Rights Reserved. 53
Spark Streaming
© 2016 KNIME.com AG. All Rights Reserved. 54
Big Data, IoT, and the three V
• Variety:– KNIME inherently well-suited: open platform– broad data source/type support– extensive tool integration
• Volume:– Bring the computation to the data– Big Data Extensions cover ETL and model learning
• Velocity:– Distributed Execution of heavy workflows– High Performance Scoring of predictive models– Streaming execution
© 2016 KNIME.com AG. All Rights Reserved. 55
Demo
© 2016 KNIME.com AG. All Rights Reserved. 56
Virtual Machines
• Hortonworks:
– http://hortonworks.com/products/hortonworks-sandbox/
• Cloudera:
– http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms.html
• KNIME VM based on Hortonworks sandbox with preinstalled Spark Job Server is coming soon
© 2016 KNIME.com AG. All Rights Reserved. 57
Resources
– SQL Syntax and Examples (www.w3schools.com)
– Apache Spark MLlib (http://spark.apache.org/mllib/)
– The KNIME Website (www.knime.org)• Database Documentation (https://tech.knime.org/database-
documentation)
• Big Data Extensions (https://www.knime.org/knime-big-data-extensions)
• Forum (tech.knime.org/forum)
• LEARNING HUB under RESOURCES (www.knime.org/learning-hub)
• Blog for news, tips and tricks (www.knime.org/blog)
– KNIME TV channel on
– KNIME on @KNIME
© 2016 KNIME.com AG. All Rights Reserved. 58
The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by KNIME.com AG under license from KNIME GmbH, and are registered in the United States.
KNIME® is also registered in Germany.