working with pentaho data integration...
TRANSCRIPT
White Paper
Abstract
This white paper explains how Pentaho Data Integration
(Kettle) can be configured and used with Greenplum
database in three tier architectures. This allows a quick
verification and validation of connectivity and
interoperability of Pentaho Data Integration with
Greenplum.
September 2011
WORKING WITH PENTAHO DATA
INTEGRATION USING GREENPLUM The interoperability between Pentaho Data Integration
(Kettle) and Greenplum Database
2 Working with Pentaho Data Integration Using Greenplum
Copyright © 2011 EMC Corporation. All Rights Reserved.
EMC believes the information in this publication is
accurate of its publication date. The information is
subject to change without notice.
The information in this publication is provided “as is”.
EMC Corporation makes no representations or
warranties of any kind with respect to the information in
this publication, and specifically disclaims implied
warranties of merchantability or fitness for a particular
purpose.
Use, copying, and distribution of any EMC software
described in this publication requires an applicable
software license.
For the most up-to-date listing of EMC product names,
see EMC Corporation Trademarks on EMC.com.
VMware is a registered trademark of VMware, Inc. All
other trademarks used herein are the property of their
respective owners.
Part Number h8294
3 Working with Pentaho Data Integration Using Greenplum
Table of Contents
Executive summary ............................................................................................. 4
Audience ......................................................................................................................... 4
Organization of this paper .................................................................................. 4
Overview of Pentaho Data Integration .............................................................. 5
Overview of Greenplum Database .................................................................... 5
Integration of Pentaho PDI and Greenplum Database .................................... 6
Using JDBC drivers for Greenplum database connections ............................. 7
Installation of new driver ............................................................................................... 7
Configuration .................................................................................................................. 9
Usage ............................................................................................................................. 14
Simple Example Job that writes from and to Greenplum Database .................... 17
Future expansion and interoperability ............................................................ 21
Conclusion ......................................................................................................... 22
References ......................................................................................................... 23
4 Working with Pentaho Data Integration Using Greenplum
Executive summary
Pentaho Data Integration (PDI), a.k.a. Kettle, is one of the most popular
open source business intelligence data integration products available for
working with analytical databases such as the EMC Greenplum Database.
The EMC Greenplum Database is capable of managing, storing and
analyzing Terabytes to Petabytes of data in data warehouses. Pentaho
Data Integration unifies the ETL, modeling and visualization processes into a
single, integrated environment with the use of Greenplum to drive better
business decisions and speed up Business Intelligence development and
deployment for joint customers. Currently, Pentaho Data Integration is
connected to Greenplum through JDBC (Java Database Connectivity)
drivers. Greenplum Database can be used both on the source and target
sides in the Pentaho ETL transformations.
Audience
This white paper is intended for EMC field facing employees such as sales,
technical consultants, support, as well as customers who will be using
Pentaho Data Integration tool to integrate their ETL work. This is neither an
installation guide nor an introductory material on Pentaho. It documents
the Pentaho connectivity and operation capabilities, and shows the
readers how it can be used in conjunction with Greenplum database to
retrieve, transform and present data to users. Though the reader is not
expected to have any prior Pentaho knowledge, basic understanding of
data integration concepts and ETL tools would help them understand
better.
Organization of this paper
This paper covers the following topics:
Overview of Pentaho Data Integration
Overview of Greenplum database
Using JDBC drivers for Greenplum database connections
Future expansion and interoperability
5 Working with Pentaho Data Integration Using Greenplum
Overview of Pentaho Data Integration
Pentaho Data Integration (PDI) delivers comprehensive Extraction,
Transformation and Loading (ETL) capabilities using a meta-data driven
approach. It is commonly used in building data warehouses, designing
Business Intelligence applications, migrating data and integrating data
models. It consists of different components:
Spoon – Main GUI, Graphical Jobs/Transformation Designer
Carte – HTTP server for remote execution of Jobs/Transformations
Pan – Command line execution of Transformations
Kitchen – Command line execution of Jobs
Encr – Command line tool for encrypting strings for storage
Enterprise Edition (EE) Data Integration Server – Data Integration Engine,
Security integration with LDAP/Active Directory, Monitor/Scheduler,
Content Management
Pentaho is capable of loading huge data sets into Greenplum Database
taking full advantage of the massively parallel processing environment
provided by the Greenplum product family.
Overview of Greenplum Database
Greenplum Database is designed based on a share-nothing MPP
(Massively Parallel Processing) architecture which facilitates Business
Intelligence and analytical processing built on top of it using commodity
hardware. Data is distributed across multiple segment servers in the
Greenplum Database to achieve no disk-level sharing. The segment
servers are able to process queries in a parallel manner in order to
promote the high degree of parallelism and scalability.
Highlights of the Greenplum Database:
Dynamic Query Prioritization
6 Working with Pentaho Data Integration Using Greenplum
Provides continuous real-time balancing of the resources
across queries.
Self-Healing Fault Tolerance
Provides intelligent fault detection and fast online
differential recovery.
Polymorphic Data Storage-MultiStorage/SSD Support
Includes tunable compression and support for both row-and
column-oriented storage.
Analytics and Language Support
Supports analytical functions for advanced in-database
analytics.
Health Monitoring and Alerting
Provides integrated email and SNMP notification for
advanced support capabilities.
Integration of Pentaho PDI and Greenplum Database
7 Working with Pentaho Data Integration Using Greenplum
Using JDBC drivers for Greenplum database
connections
Pentaho Kettle ships with many different JDBC drivers that reside in a single
java archive (.jar) file that are present in the libext/JDBC directory. By
default, Pentaho PDI is shipped with a postgresql jdbc jar file, which is used
to connect to Greenplum.
Java 1.6 is required for the installation.
Fortunately, there is a startup script which adds all these .jars to classpath.
Installation of new driver
To add a new driver, simply drop the .jar file containing the driver into the
libext/JDBC directory. For example,
• For Data Integration Server: <Pentaho_installed_directory>/server/data-
integration-server/tomcat/lib/
• For Data Integration client: <Pentaho_installed_directory>/design-
tools/data-integration/libext/JDBC/
For BI Server: <Pentaho_installed_directory>/server/biserver-ee/tomcat/lib/
• For Enterprise Console: <Pentaho_installed_directory>/server/enterprise-
console/jdbc/
If you installed a new JDBC driver for Greenplum to the BI Server or DI
Server, you have to restart all affected servers to load the newly installed
database driver. In addition, if you want to establish a Greenplum data
source in the Pentaho Enterprise Console, you must install that JDBC driver
in both Enterprise Console and the BI Server to make it effective.
In brief, to update the driver, the user would need to update the jar file in
/data-integration/libext/JDBC/.
Assume that a Greenplum Database GPDB is installed and ready to use,
users can define the Greenplum database connections in the Database
Connection dialog. Users can give a connection name, choose
Greenplum as the Connection Type, choose “Native (JDBC)” as Access,
8 Working with Pentaho Data Integration Using Greenplum
and give the Host Name, Database Name, Port Number, User Name and
Password in the Setting section.
Special attention may be required to setup the host files and configuration
files in Greenplum database as well as the hosts in which Pentaho is
installed. For instance, in Greenplum database, the user may need to
configure pg_hba.conf with the IP address of the Pentaho host. In addition,
users may need to add the hostnames and the corresponding IP address in
both systems (i.e. Pentaho PDI server as well as the Greenplum Database)
in order to ensure both machines are able to communicate.
9 Working with Pentaho Data Integration Using Greenplum
Configuration
Detailed steps for JDBC configuration are self explanatory with screenshot
images showing how to choose the existing jdbc driver for Greenplum
Database as your data source or target database in Spoon:
1) After a user open a Job in spoon, on the right side View panel, there is a
folder called “Database Connection” for that corresponding job that
he/she is working on. The user can highlight and right click on
“Database Connections”, then choose either “New Connection” or
“New Connection Wizard”.
10 Working with Pentaho Data Integration Using Greenplum
2) If the user chooses the “New Connection” option, he/she would need
to input the Greenplum database as follows:
11 Working with Pentaho Data Integration Using Greenplum
3) If the user chooses the “New Connection Wizard” option, the wizard will
guide the user through the JDBC definition process.
First, select the database name and type:
12 Working with Pentaho Data Integration Using Greenplum
Second, Set the JDBC settings and click on “Next”:
13 Working with Pentaho Data Integration Using Greenplum
Third, input the username and password and click “Finish”:
Last, test the database connection, if the input details are correct, there
should be a prompt like this:
14 Working with Pentaho Data Integration Using Greenplum
Usage
There are a few ways to apply Greenplum database connections, such as:
1) The following diagram shows how to apply the newly defined
Greenplum database connection as the data source:
15 Working with Pentaho Data Integration Using Greenplum
2) The following diagrams show how to apply the newly defined
Greenplum database connection to the target tables to be loaded:
Example 1: Dimension Lookup/Update:
16 Working with Pentaho Data Integration Using Greenplum
Example 2: Insert/Update for loading the target table:
17 Working with Pentaho Data Integration Using Greenplum
Simple Example Job that writes from and to Greenplum Database
Here is a simple new transformation to test out the JDBC connectivity that we defined
before. (Assumption: gpadmin is the user.)
A source table is being created called Category in which contains:
CREATE TABLE category
(
category_id serial NOT NULL,
name varchar(25) NOT NULL,
last_update timestamp without time zone NOT NULL DEFAULT now(),
CONSTRAINT category_pkey PRIMARY KEY (category_id)
) DISTRIBUTED BY (category_id);
ALTER TABLE category OWNER TO gpadmin;
Populating the data to table category by either INSERT or COPY commands to
category table:
insert into category (category_id, name, last_update) values (1,'Action','2006-02-15
09:46:27');
insert into category (category_id, name, last_update) values (2,'Animation','2006-02-
15 09:46:27');
insert into category (category_id, name, last_update) values (3,'Children','2006-02-15
09:46:27');
insert into category (category_id, name, last_update) values (4,'Classics','2006-02-15
09:46:27');
insert into category (category_id, name, last_update) values (5,'Comedy','2006-02-15
09:46:27');
insert into category (category_id, name, last_update) values (6,'Documentary','2006-
02-15 09:46:27');
insert into category (category_id, name, last_update) values (7,'Drama','2006-02-15
09:46:27');
insert into category (category_id, name, last_update) values (8,'Family','2006-02-15
09:46:27');
insert into category (category_id, name, last_update) values (9,'Foreign','2006-02-15
09:46:27');
insert into category (category_id, name, last_update) values (10,'Games','2006-02-15
09:46:27');
insert into category (category_id, name, last_update) values (11,'Horror','2006-02-15
09:46:27');
18 Working with Pentaho Data Integration Using Greenplum
insert into category (category_id, name, last_update) values (12,'Music','2006-02-15
09:46:27');
insert into category (category_id, name, last_update) values (13,'New','2006-02-15
09:46:27');
insert into category (category_id, name, last_update) values (14,'Sci-Fi','2006-02-15
09:46:27');
insert into category (category_id, name, last_update) values (15,'Sports','2006-02-15
09:46:27');
insert into category (category_id, name, last_update) values (16,'Travel','2006-02-15
09:46:27');
Then, a target table is created called:
CREATE TABLE category_demo_target
(
category_id integer,
showname varchar(25),
last_update timestamp without time zone
) DISTRIBUTED BY (category_id);
ALTER TABLE category_demo_target OWNER TO gpadmin;
A similar sample job can be created as following:
19 Working with Pentaho Data Integration Using Greenplum
Sample Configuration of the source table:
Sample Configuration of the target table:
20 Working with Pentaho Data Integration Using Greenplum
Click the GREEN arrow on the top left corner to execute the transformation/job.
Now, check the target table category_demo_target to see if data is loaded into this
target Greenplum database table.
User can add different components in this transformation or incorporate into a well
developed job for transforming the data from source to target.
21 Working with Pentaho Data Integration Using Greenplum
Future expansion and interoperability
Both Greenplum and Pentaho are increasing their capability to adopt the
BIG DATA trends as there is a significant growth in data sizes in the industry.
Therefore, both companies are expanding their interoperability to adopt the
upcoming demands. For example:
One of the latest enhancements that Pentaho did for expanded
support for OLAP includes a native bulk loader integration with EMC
Greenplum to improve the data loading process and overall
performance. Pentaho is offering a native adaptor support for
Greenplum GPLoad capability (bulk loader), which enables joint
customers to leverage data integration capabilities to capture,
transform and quickly load massive amounts of data into Greenplum
Databases, especially in form of data warehouses.
Pentaho has certified support of Pentaho Data Integration (PDI)
working with EMC Greenplum Hadoop and EMC Greenplum Database
and corresponding data warehouse products. Future collaborations will
be established between Pentaho and EMC Greenplum Hadoop HD
solutions. When Pentaho complements the Greenplum distribution of
Hadoop, it provides an end-to-end Data Integration and Business
Intelligence suite that improves data movement into and out of
Hadoop and the cost advantages of commercial open source. This will
benefit the customers by providing more choices with enhanced
performance and better cost-saving options.
The EMC Greenplum Data Integration Accelerator (DIA) will be
integrated with Pentaho PDI (with the fast loading adapter invoking
gpfdist utility) to bring the best of performance of data staging and
data loading. To meet the challenges of fast data loading, the EMC
Data Integration Accelerator (DIA) is purpose-built for batch loading,
and micro-batch loading, and leverages a growing number of data
integration applications such as Pentaho.
22 Working with Pentaho Data Integration Using Greenplum
Conclusion
In this white paper, the process of how to create and apply JDBC driver to
connect Pentaho Data Integration with Greenplum Database is discussed
using JDBC driver in particular. It only covers the preliminary interoperability
between both Pentaho PDI and Greenplum database for basic data
integration and business intelligence projects.
It also discussed briefly the anticipated interoperability and integrations of
both technologies to accommodate the Big Data Trend in the coming
future, such as, the Greenplum native bulk loader, Pentaho Integration
with Greenplum Hadoop HD solutions and Greenplum DIA integration with
Pentaho BI/DI servers/tools. We will discuss those future expansions in
upcoming white papers.
23 Working with Pentaho Data Integration Using Greenplum
References
1) Pentaho Kettle Solutions – Building Open Source ETL Solutions with
Pentaho Data Integration (ISBN-10: 0470635177 / ISBN-13: 978-
0470635179)
2) Getting Started with Pentaho Data Integration guide from
www.pentaho.com
3) The PostgreSQL Pagila Schema for using in Greenplum database