kettle – etl tool
DESCRIPTION
Pentaho Kettle ETL tools demostration and jest of the ETL processTRANSCRIPT
MaxQDPro TeamAnjan.K Harish.R
II Sem M.Tech CSE
04/10/23 MaxQDPro: Kettle- ETL Tool 1
A Pentaho Data Integration tool
Introduction◦ ETL Process◦ Pentaho’s Kettle
Data Integration Challenges Prerequisites and Recent Releases Pentaho DI Components JDBC Spoon
◦ Transformations◦ Jobs
04/10/23 2MaxQDPro: Kettle- ETL Tool
4 major components:◦ Extracting
Gathering raw data from source systems and storing it in ETL staging environment
Data profiling Identifying data that changed since last load
◦ Transforming- Cleaning and Conforming Processing data to improve its quality, format it, merge from
multiple sources, enforce conformed dimensions Data cleansing Recording error events Audit dimensions Creating and maintaining conformed dimensions and facts
04/10/23MaxQDPro: Kettle- ETL Tool 3
Data filtering◦ Is not null, greater than, less than, includes
Field manipulation◦ Trimming, padding, upper and lowercase conversion
Data calculations◦ + - X / , average, absolute value, arctangent, natural logarithm
Date manipulation◦ First day of month, Last day of month, add months, week of year,
day of year
Data type conversion◦ String to number, number to string, date to number
Merging fields & splitting fields
Looking up date◦ Look up in a database, in a text file, an excel sheet, …
04/10/23 4MaxQDPro: Kettle- ETL Tool
◦ Loading Loading data into data warehouse tables Managing hierarchies in dimensions Managing special dimensions such as date and
time, junk, mini, shrunken, small static, and user-maintained dimensions
Fact table loading Building and maintaining bridge dimension tables Handling late arriving data Management of conformed dimensions Administration of fact tables Building aggregations Building OLAP cubes Transferring DW data to other environment for
specific purposes
04/10/23MaxQDPro: Kettle- ETL Tool 5
04/10/23MaxQDPro: Kettle- ETL Tool 6
Complexity and significant operational problems.
Exceeds the designers expectations Data Profiling of a source. Data warehouses typically grow
asynchronously. Establishing the scalability of an ETL system
across the lifetime .
04/10/23MaxQDPro: Kettle- ETL Tool 7
Many off-the-shelf tools exist High-end tools may not justify value for
smaller warehouses Proprietary ETL
◦ High upfront cost◦ Long term maintenance
Custom Code◦ Low upfront cost◦ Support grows as business requirements changes
04/10/23 8MaxQDPro: Kettle- ETL Tool
04/10/23MaxQDPro: Kettle- ETL Tool 9
Tool Vendor
Oracle Warehouse Builder (OWB) Oracle
Data Integrator (BODI) Business Objects
IBM Information Server (Ascential) IBM
SAS Data Integration Studio SAS Institute
PowerCenter Informatica
Oracle Data Integrator (Sunopsis) Oracle
Data Migrator Information Builders
Integration Services Microsoft
Talend Open Studio Talend
DataFlow Group 1 Software (Sagent)
Data Integrator Pervasive
Transformation Server DataMirror
Transformation Manager ETL Solutions Ltd.
Data Manager Cognos
DT/Studio Embarcadero Technologies
ETL4ALL IKAN
DB2 Warehouse Edition IBM
Jitterbit Jitterbit
Pentaho Data Integration Pentaho
Kettle – Kettle Extraction Transformation Transportation & Loading tool
Its open source business intelligence suite for powerful data integration by Pentaho. Founded in 2004.
Products of Pentaho◦ Mondrain – OLAP server written in Java◦ Kettle – ETL tool◦ Weka – Machine learning and Data mining tool
04/10/23 10MaxQDPro: Kettle- ETL Tool
Data is everywhere Data is inconsistent
◦ Records are different in each system Performance issues
◦ Running queries to summarize data for stipulated long period takes operating system for task
◦ Brings the OS on max load Data is never all in Data Warehouse
◦ Excel sheet, acquisition, new application
04/10/23 11MaxQDPro: Kettle- ETL Tool
Meta data , model driven approach◦ What to do? And how to do?◦ Complex transformation with zero code◦ Graphically design data transformation and jobs
100% Java with cross-platform support Extensible architecture Repository-based Full featured ETL Integration with Pentaho Open BI Platform
04/10/23 12MaxQDPro: Kettle- ETL Tool
Prerequisites Recent Releases
Java Runtime Environment 1.5 and above
Compatible with almost any platform
Compatible with wide range of Databases technologies.
4/25 Data Integration 3.0.3 GA
4/18 Data Integration 3.1 Milestone
2/8 Data Integration 3.0.2 GA
12/12 Data Integration 3.0.1 GA
11/15 Data Integration 3.0 GA
10/31 Data Integration 3.0 RC2
10/24 Data Integration 2.5.2 GA
10/08 Data Integration 3.0 RC1
08/24 Data Integration 2.5.1 GA
04/10/23MaxQDPro: Kettle- ETL Tool 13
Pan◦ A program to execute transformations designed by Spoon
in XML or database repository. ◦ Transformations are scheduled in batch mode to be run
automatically at regular intervals Carte
◦ Simple web server to execute transformations and jobs remotely.
◦ Accept an XML (small servlet) that contains transformation to execute and the execution configuration.
◦ Allows to remotely monitor, start and stop the transformations and jobs
◦ Server running in Carte is a Slave Server
04/10/23MaxQDPro: Kettle- ETL Tool 14
Spoon◦ GUI that allows you to design transformations and
jobs that can be run with the Kettle tools — Pan and Kitchen
◦ Transformations and Jobs can describe themselves using an XML file or can be put in a Kettle database repository.
◦ Spoon is available as executable script and batch file to make use of tool in heterogeneous environment.
◦ Latest version of Spoon is 3.2 beta version. Kitchen
◦ Execute jobs designed by Spoon in XML or database repository
04/10/23MaxQDPro: Kettle- ETL Tool 15
Create Shortcut with spoon.ico pointing to bat file Works on most of OS
Installing◦ Ensure JRE 1.5 is
installed.◦ Unzip the binary
distribution in any folder Launching
◦ spoon.bat in windows platform
◦ spoon.sh in Unix like platform
Supported platform◦ Microsoft Windows
including Vista◦ Linux GTK: on i386 and
x86_64 processors ◦ Apple's OSX: works both
on PowerPC and Intel machines
◦ Solaris: using a Motif interface
◦ AIX, HP-UX, FreeBSD
04/10/23MaxQDPro: Kettle- ETL Tool 16
Latest JDBC 3.0
JDBC -Database connectivity Java tool.
Comes in four different types◦ Type1: JDBC-ODBC Bridge◦ Type 2 : Native API partial
Java driver◦ Type 3 : Middleware Java
Drivers◦ Type 4: Direct to DB Java
Drivers
Microsoft Based DB like MS Access rely on Type 1drivers
Oracle, Mysql can be connected with other types. But traditionally used is the Type 4 driver.
JDBC can also operate in Distributed environment.
04/10/23MaxQDPro: Kettle- ETL Tool 17
04/10/23MaxQDPro: Kettle- ETL Tool 18
04/10/23MaxQDPro: Kettle- ETL Tool 19
Key Improvement ◦ Execution Results Pane for logs, metrics and
performance graph◦ Improved Database Connection dialog◦ Snap to grid (graphical workspace)◦ Zoom (Graphical Workspace)◦ Easier to use left panel for the objects palette◦ Over 30 new or improved Transformation Steps◦ 13 new or improved Job Entries◦ Support for four new database types - MonetDB,
KingbaseES, Vertica, and HP NeoView◦ Improved translations
04/10/23MaxQDPro: Kettle- ETL Tool 20
Repository Connection establishment Auto login
◦ By setting manually KETTLE_REPOSITORY, KETTLE_USER and KETTLE_PASSWORD environmental variables.
Login◦ By default PDI provides login username and
password ad admin.◦ It strictly advised to change default password to
avoid any security vulnerablity.
04/10/23MaxQDPro: Kettle- ETL Tool 21
04/10/23MaxQDPro: Kettle- ETL Tool 22
04/10/23MaxQDPro: Kettle- ETL Tool 23
04/10/23MaxQDPro: Kettle- ETL Tool 24
Transformation ◦ Value: Values are part of a row
and can contain any type of data
◦ Row: a row exists of 0 or more values
◦ Output stream: an output stream is a stack of rows that leaves a step.
◦ Input stream: an input stream is a stack of rows that enters a step.
◦ Hop: A hop is a graphical representation of one or more data streams between 2 steps.
◦ Note: A note is a piece of information that can be added to a transformation
04/10/23MaxQDPro: Kettle- ETL Tool 25
Engine capable of performing a multitude of functions such as reading, manipulating and writing data to and from various data sources.
Jobs◦ Job Entry: A job entry is
one part of a job and performs a certain
◦ Hop: A hop is a graphical representation of one or more data streams between 2 steps
◦ Note: a note is a piece of information that can be added to a job
04/10/23MaxQDPro: Kettle- ETL Tool 26
A way of calling transformations and controlling the sequence of their execution. Usually jobs are scheduled in batch mode to be run automatically at regular intervals.
Input StepsOutput Steps
Lookup StepsTransformation Steps
Join StepsDW Steps
Mapping Steps
Job Steps
04/10/23 27MaxQDPro: Kettle- ETL Tool
04/10/23 28MaxQDPro: Kettle- ETL Tool
04/10/23 29MaxQDPro: Kettle- ETL Tool
04/10/23MaxQDPro: Kettle- ETL Tool 30
04/10/23 31MaxQDPro: Kettle- ETL Tool
04/10/23 32MaxQDPro: Kettle- ETL Tool
04/10/23 33MaxQDPro: Kettle- ETL Tool
04/10/23 34MaxQDPro: Kettle- ETL Tool
Table Output Step
04/10/23 35MaxQDPro: Kettle- ETL Tool
Insert / Update Output Step
04/10/23 36MaxQDPro: Kettle- ETL Tool
Besides the execution order, it specifies the condition for next job entry
· “Unconditional” - next job entry will be executed regardless of the result of the originating job entry.
· “Follow when result is true” - next job entry will only be executed when the result of the originating job entry is true,
· “Follow when result is false” - next job entry will only be executed when the result of the originating job entry was false
04/10/23 37MaxQDPro: Kettle- ETL Tool
04/10/23 38MaxQDPro: Kettle- ETL Tool
04/10/23 39MaxQDPro: Kettle- ETL Tool
04/10/23MaxQDPro: Kettle- ETL Tool 40
Brief Introduction to ETL process JDBC Repository Connection Pentaho Data Integration Tool
◦ Components Pan Carte Kitchen Spoon
◦ Transformation with different Input Data Source◦ Jobs
04/10/23MaxQDPro: Kettle- ETL Tool 41
kettle.pentaho.org◦ Kettle project homepage
kettle.javaforge.com◦ Kettle community website: forum, source, documentation, tech tips,
samples, …
www.pentaho.org/download/◦ All Pentaho modules, pre-configured with sample data◦ Developer forums, documentation◦ Ventana Research Open Source BI Survey
www.mysql.com◦ White paper -
http://dev.mysql.com/tech-resources/articles/mysql_5.0_pentaho.html◦ Kettle Webinar - http://www.mysql.com/news-and-events/on-demand-
webinars/pentaho-2006-09-19.php ◦ Roland Bouman blog on Pentaho Data Integration and MySQL
http://rpbouman.blogspot.com/2006/06/pentaho-data-integration-kettle-turns.html
04/10/23 42MaxQDPro: Kettle- ETL Tool