knime big data workshopapache hadoop •open-source framework for distributed storage and processing...
TRANSCRIPT
![Page 1: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/1.jpg)
© 2017 KNIME.com AG. All Rights Reserved.
KNIME Big Data Workshop
Tobias Kötter and Björn Lohrmann
KNIME
![Page 2: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/2.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 2
Variety, Velocity, Volume
• Variety:– Integrating heterogeneous data…– ... and tools
• Velocity:– Real time scoring of millions of
records/sec– Continuous data streams– Distributed computation
• Volume:– From small files...– ...to distributed data repositories– Moving computation to the data
2
![Page 3: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/3.jpg)
3© 2017 KNIME.com AG. All Rights Reserved.
Variety
![Page 4: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/4.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 4
The KNIME Analytics Platform: Open for Every Data, Tool, and User
KNIME Analytics Platform
Business Analyst
Externaland Legacy
Tools
NativeData Access, Analysis,
Visualization, and Reporting
ExternalData
Connectors
Distributed / Cloud Execution
Data Scientist
![Page 5: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/5.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 5
Data Integration
![Page 6: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/6.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 6
Integrating R and Python
![Page 7: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/7.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 7
Modular Integrations
![Page 8: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/8.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 8
Other Programming/Scripting Integrations
![Page 9: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/9.jpg)
9© 2017 KNIME.com AG. All Rights Reserved.
Velocity
![Page 10: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/10.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 10
Velocity
• High Demand Scoring/Prediction:
– High Performance Scoring using generic Workflows
– High Performance Scoring of Predictive Models
• Continuous Data Streams
– Streaming in KNIME
• Distributed Computation
– KNIME Cluster Executor
![Page 11: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/11.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 11
High Performance Scoring via Workflows
• Record (or small batch) based processing
• Exposed as RESTful web service
![Page 12: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/12.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 12
High Performance Scoring using Models
• KNIME PMML Scoring via compiled PMML
• Deployed on KNIME Server
• Exposed as RESTful web service
• Partnership with Zementis
– ADAPA Real Time Scoring
– UPPI Big Data Scoring Engine
![Page 13: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/13.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 13
Velocity
• High Demand Scoring/Prediction:
– High Performance Scoring using generic Workflows
– High Performance Scoring of Predictive Models
• Continuous Data Streams
– Streaming in KNIME
• Distributed Computation
– KNIME Cluster Executor
![Page 14: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/14.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 14
Streaming in KNIME
![Page 15: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/15.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 15
Velocity
• High Demand Scoring/Prediction:
– High Performance Scoring using generic Workflows
– High Performance Scoring of Predictive Models
• Continuous Data Streams
– Streaming in KNIME
• Distributed Computation
– KNIME Cluster Executor
![Page 16: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/16.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 16
KNIME Cluster Executor: Distributed Data
![Page 17: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/17.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 17
KNIME Cluster Execution: Distributed Analytics
![Page 18: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/18.jpg)
18© 2017 KNIME.com AG. All Rights Reserved.
Volume
![Page 19: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/19.jpg)
19© 2017 KNIME.com AG. All Rights Reserved.
Moving computation to the data
![Page 20: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/20.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 20
Volume
• Database Extension
• Introduction to Hadoop
• KNIME Big Data Connector
• KNIME Spark Executor
![Page 21: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/21.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 21
Database Extension
• Visually assemble complex SQL statements
• Connect to almost all JDBC-compliant databases
• Harness the power of your database within KNIME
![Page 22: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/22.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 22
In-Database Processing
• Operations are performed within the database
![Page 23: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/23.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 23
Tip
• SQL statements are logged in KNIME log file
![Page 24: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/24.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 24
Database Port Types
Database JDBC Connection Port (light red)• Connection information
Database Connection Port (dark red)• Connection information• SQL statement
Database Connection Ports can be connected to
Database JDBC Connection Ports but not vice versa
![Page 25: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/25.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 25
Database Connectors
• Nodes to connect to specific Databases – Bundling necessary JDBC drivers– Easy to use– DB specific behavior/capability
• Hive and Impala connector part of the commercial KNIME Big Data Connectors extension
• General Database Connector– Can connect to any JDBC source– Register new JDBC driver via
preferences page
![Page 26: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/26.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 26
Register JDBC Driver
Open KNIME and go toFile -> Preferences
Increase connection timeout forlong running database operations
Register single jar fileJDBC drivers
Register new JDBC driverwith companion files
![Page 27: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/27.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 27
Query Nodes
• Filter rows and columns
• Join tables/queries
• Extract samples
• Bin numeric columns
• Sort your data
• Write your own query
• Aggregate your data
![Page 28: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/28.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 28
Database GroupBy – Manual Aggregation
Returns number of rows per group
![Page 29: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/29.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 29
Database GroupBy – Pattern Based Aggregation
Tick this option if the search pattern is a
regular expression otherwise it is treated as string with wildcards ('*' and '?')
![Page 30: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/30.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 30
Database GroupBy – Type Based Aggregation
Matches all columns
Matches all numericcolumns
![Page 31: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/31.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 31
Database GroupBy – DB Specific Aggregation Methods
PostgreSQL 25 aggregation functions
SQLite 7 aggregation functions
![Page 32: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/32.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 32
Database GroupBy – Custom Aggregation Function
![Page 33: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/33.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 33
Database Writing Nodes
• Create table as select
• Insert/append data
• Update values in table
• Delete rows from table
![Page 34: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/34.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 34
Performance Tip
– Increase batch size in database manipulation nodes
Increase batch size forbetter performance
![Page 35: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/35.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 35
Volume
• Database Extension
• Introduction to Hadoop
• KNIME Big Data Connector
• KNIME Spark Executor
![Page 36: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/36.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 36
Apache Hadoop
• Open-source framework for distributed storage and processing of large data sets
• Designed to scale up to thousands of machines
• Does not rely on hardware to provide high availability
– Handles failures at application layer instead
• First release in 2006
– Rapid adoption, promoted to top level Apache project in 2008
– Inspired by Google File System (2003) paper
• Spawned diverse ecosystem of products
![Page 37: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/37.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 37
Hadoop Ecosystem
HDFS
YARN
SparkTezMapReduce
HIVE
Storage
Resource Management
Processing
Access
![Page 38: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/38.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 38
HDFS
• Hadoop distributed file system
• Stores large files across multiple machines
File File (large!)
Blocks (default: 64MB)
DataNodes
HDFS
YARN
SparkTezMapReduce
HIVE
![Page 39: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/39.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 39
HDFS – NameNode and DataNode
• NameNode– Master server that
manages file system namespace• Maintains metadata for
all files and directories in filesystem tree
• Knows on which datanode blocks of a given file are located
– Whole system depends on availability of NameNode
• DataNodes– Workers, store and
retrieve blocks per request of client or namenode
– Periodically report to namenode that they are running and which blocks they are storing
![Page 40: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/40.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 40
Client node
Reading Data from HDFS
NameNode
DataNode DataNode DataNode
HDFS ClientFSData
InputStream
Distributed FileSystem
1: open
3: read
6: close
4: read
5: read
2: get block locations
![Page 41: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/41.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 41
HDFS – Data replication and file size
Data Replication• All blocks of a file are
stored as sequence of blocks
• Blocks of a file are replicated for fault tolerance (usually 3 replicas)– Aims: improve data
reliability, availability, and network bandwidth utilization
B1
B2
B3
File 1
n1
n2
n3
n4
rack 1
NameNode n1
n2
n3
n4
rack 2
n1
n2
n3
n4
rack 3
B1
B1
B1
B2
B2
B2
B3
B3
B3
![Page 42: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/42.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 42
HDFS – Access and File Size
• Several ways to access HDFS data– FileSystem (FS) shell commands
• Direct RPC connection• Requires Hadoop client to be
installed
– WebHDFS• Provides REST API functionality, lets
external applications connect via HTTP
• Direct transmission of data from node to client
• Needs access to all nodes in cluster
– HttpFS• All data is transmitted to client via
one single node -> gateway
File Size• Hadoop is designed to handle
fewer large files instead of lots of small files
• Small file: File significantly smaller than Hadoop block size
• Problems:– Namenode memory– MapReduce performance
![Page 43: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/43.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 43
YARN
• Cluster resource management system
• Two elements– Resource manager (one per
cluster):• Knows where workers are located
and how many resources they have• Scheduler: Decides how to allocate
resources to applications
– Node manager (many per cluster):
• Launches application containers• Monitor resource usage and report
to Resource Manager
HDFS
YARN
SparkTezMapReduce
HIVE
![Page 44: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/44.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 44
YARN
ContainerAppl.
Master
Node Manager
Appl. Master
Container
Node Manager
Container
Node Manager
Resource Manager
Client
Client
Container
MapReduce Status
Job Submission
Node Status
Resource Request
![Page 45: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/45.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 45
Hive
• Infrastructure on top of Hadoop• Provides data summarization, query, and analysis• SQL-like language (HiveQL)• Converts queries to MapReduce, Apache Tez, and Spark
jobs• Supports various file formats:
– Text/CSV– SequenceFile– Avro– ORC– Parquet
HDFS
YARN
SparkTezMapReduce
HIVE
![Page 46: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/46.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 46
Spark
• Cluster computing framework for large-scale data processing
• Keeps large working datasets in memory between jobs
– No need to always load data from disk -> way (!) faster than MapReduce
• Great for:
– Iterative algorithms
– Interactive analysisHDFS
YARN
SparkTezMapReduce
HIVE
![Page 47: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/47.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 47
Spark – Basic Concepts
• SparkContext– Main entry point for Spark functionality– Represents connection to a Spark cluster– Create RDDs, accumulators, and broadcast variables on cluster
• RDD: Resilient Distributed Dataset– Read-only multiset of data items distributed over cluster of
machines– Fault-tolerant: Lost partition automatically reconstructed from
RDDs it was computed from– Lazy evaluation: Computation only happens when action is
required
![Page 48: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/48.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 48
Spark – DataFrame and Dataset
• DataFrame– Distributed collection of data organized in named columns
– Similar to table in relational database
– Can be constructed from many sources: structured data files, Hive table, RDDs...
• Dataset– Extension of DataFrame API
– Strongly-typed, immutable collection of objects mapped to a relational schema
– Catches syntax and analysis errors at compile time
![Page 49: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/49.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 49
Volume
• Database Extension
• Introduction to Hadoop
• KNIME Big Data Connector
• KNIME Spark Executor
![Page 50: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/50.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 50
KNIME Big Data Connectors
• Package required drivers/libraries for HDFS, Hive, Impala access
• Preconfigured connectors
– Hive
– Cloudera Impala
– Extends the open source database integration
![Page 51: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/51.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 51
Hive/Impala Loader
• Batch upload a KNIME data table to Hive/Impala
![Page 52: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/52.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 52
HDFS File Handling
• New nodes
– HDFS Connection
– HDFS File Permission
• Utilize the existing remote file handling nodes
– Upload/download files
– Create/list directories
– Delete files
![Page 53: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/53.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 53
HDFS File Handling
![Page 54: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/54.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 54
Volume
• Database Extension
• Introduction to Hadoop
• KNIME Big Data Connector
• KNIME Spark Executor
![Page 55: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/55.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 55
KNIME Spark Executor
• Based on Spark MLlib
• Scalable machine learning library
• Runs on Hadoop
• Algorithms for– Classification (decision tree, naïve bayes, …)
– Regression (logistic regression, linear regression, …)
– Clustering (k-means)
– Collaborative filtering (ALS)
– Dimensionality reduction (SVD, PCA)
![Page 56: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/56.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 56
Familiar Usage Model
• Usage model and dialogs similar to existing nodes
• No coding required
![Page 57: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/57.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 57
MLlib Integration
• MLlib model ports for model transfer
• Native MLlib model learning and prediction
• Spark nodes start and manage Spark jobs
– Including Spark job cancelation
Native MLlib model
![Page 58: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/58.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 58
Data Stays Within Your Cluster
• Spark RDDs as input/output format
• Data stays within your cluster
• No unnecessary data movements
• Several input/output nodes e.g. Hive, hdfs files, …
![Page 59: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/59.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 59
Machine Learning – Unsupervised Learning Example
![Page 60: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/60.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 60
Machine Learning – Supervised Learning Example
![Page 61: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/61.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 61
Mass Learning – Fast Event Prediction
• Convert supported MLlib models to PMML
![Page 62: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/62.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 62
Sophisticated Learning - Mass Prediction
• Supports KNIME models and pre-processing steps
![Page 63: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/63.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 63
Closing the Loop
Apply modelon demand
Sophisticated model learning
Apply modelat scale
Learn modelat scale
PMML model MLlib model
![Page 64: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/64.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 64
Mix and Match
• Combine with existing KNIME nodes such as loops
![Page 65: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/65.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 65
Modularize and Execute Your Own Spark Code
![Page 66: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/66.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 66
Lazy Evaluation in Spark
• Transformations are lazy
• Actions trigger evaluation
![Page 67: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/67.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 67
Spark Node Overview
![Page 68: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/68.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 68
KNIME Big Data Architecture
Hadoop Cluster
Submit Spark jobsvia HTTP(S)
*Software provided by KNIME, based on https://github.com/spark-jobserver/spark-jobserver
Spar
k Jo
b
Serv
er *
Workflow Upload via HTTP(S)
Hiv
eser
ver
2Im
pal
a
Submit Hive queriesvia JDBC
Submit Impala queriesvia JDBC
KNIME Server with extensions: • KNIME Big Data Connectors • KNIME Big Data Executor for Spark
Scheduled executionand RESTful workflow
submission
KNIME Analytics Platform with extensions:• KNIME Big Data Connectors • KNIME Big Data Executor for Spark
Build Sparkworkflows graphically
![Page 69: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/69.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 69
Executing KNIME Nodes on Spark
![Page 70: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/70.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 70
Behind the Scene
Cluster Worker NodeCluster Worker Node
KNIME Workflow
KNIME Analytics PlatformKNIME Server
Workflow Replica
Execute KNIME workflow on Spark
RDD Partition RDD Partition
Input RDD
RDD Partition RDD Partition
Output RDD
Spark Executor JVMSpark Executor JVM
KNIME Workflow
(OSGI)
KNIME Workflow
(OSGI)
![Page 71: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/71.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 71
Behind the Scene
Cluster Worker NodeCluster Worker Node
KNIME Workflow
KNIME Analytics PlatformKNIME Server
Workflow Replica
Execute KNIME workflow on Spark
RDD Partition RDD Partition
Input RDD
RDD Partition
Output RDD
Spark Executor JVMSpark Executor JVM
KNIME Workflow
(OSGI)
• Variation (1): Send RDD data through a single workflow replica
RDD Partition
![Page 72: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/72.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 72
Behind the Scene
Cluster Worker NodeCluster Worker Node
KNIME Workflow
KNIME Analytics PlatformKNIME Server
Workflow Replica
Execute KNIME workflow on Spark
RDD Partition
Input RDD
Spark Executor JVMSpark Executor JVM
KNIME Workflow
(OSGI)
• Variation (2): Send pre-grouped RDD data through workflow replicas RDD Partition
KNIME Workflow
(OSGI)
RDD Partition
Output RDD
RDD Partition
RDD Partition RDD Partition
![Page 73: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/73.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 73
Big Data, IoT, and the three V
• Variety:– KNIME inherently well-suited: open platform– broad data source/type support– extensive tool integration
• Velocity:– High Performance Scoring of predictive models– Streaming execution
• Volume:– Bring the computation to the data– Big Data Extensions cover ETL and model learning– Distributed Execution of KNIME workflows
![Page 74: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/74.jpg)
74© 2017 KNIME.com AG. All Rights Reserved.
Demo
![Page 75: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/75.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 75
Want to try it at home?
• Hadoop cluster– Use your own Hadoop cluster
– Use a preconfigured virtual machine• http://hortonworks.com/products/hortonworks-sandbox/
• http://www.cloudera.com/downloads/quickstart_vms.html
• Download and install compatible Spark Job Server– See installation steps at https://www.knime.org/knime-spark-
executor#install
• For a free 30-day Trial go tohttps://www.knime.org/knime-big-data-extensions-free-30-day-trial
![Page 76: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/76.jpg)
© 2017 KNIME.com AG. All Rights Reserved. 76
Resources
– SQL Syntax and Examples (www.w3schools.com)
– Apache Spark MLlib (http://spark.apache.org/mllib/)
– The KNIME Website (www.knime.org)• Database Documentation (https://tech.knime.org/database-
documentation)
• Big Data Extensions (https://www.knime.org/knime-big-data-extensions)
• Forum (tech.knime.org/forum)
• LEARNING HUB under RESOURCES (www.knime.org/learning-hub)
• Blog for news, tips and tricks (www.knime.org/blog)
– KNIME TV channel on
– KNIME on @KNIME
![Page 77: KNIME Big Data WorkshopApache Hadoop •Open-source framework for distributed storage and processing of large data sets •Designed to scale up to thousands of machines •Does not](https://reader030.vdocument.in/reader030/viewer/2022041017/5ec996d7e42f933a7879efab/html5/thumbnails/77.jpg)
77© 2017 KNIME.com AG. All Rights Reserved.
The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by KNIME.com AG under license from KNIME GmbH, and are registered in the United States.
KNIME® is also registered in Germany.