pdi lab guide

44
Lab Guide for “Pentaho Data Integration (V-4.0.1)” Popularly known as “Kettle”

Upload: neelesh-kamath

Post on 17-Feb-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Used for analysis of business. This guide is very helpful for beginners.

TRANSCRIPT

Page 1: Pdi Lab Guide

Lab Guide for “Pentaho Data Integration (V-4.0.1)” Popularly known as “Kettle”

Page 2: Pdi Lab Guide

1

PDI Lab Guide

HOW TO INSTALL AND OPEN THE PENTAHO DATA INTEGRATION IDE: ................................. 2

HOW TO CREATE A NEW REPOSITORY:.......................................................................................... 2

HOW TO CONNECT TO A REPOSITORY: .......................................................................................... 8

CREATION OF AN ODBC CONNECTION: ....................................................................................... 9

CREATING THE FIRST TRANSFORMATION: ................................................................................. 10

USING THE “SELECT VALUES” TRANSFORMATION: ….. ........................................................ 29

CREATING A JOB: .............................................................................................................................. 42

QUESTIONS: ....................................................................................................................................... 43

Page 3: Pdi Lab Guide

2

How to Install and Open the Pentaho Data Integration IDE

Step 1: First install Java Runtime Environment 1.4 or a higher version in your system.

Step 2: Go to the website http://www.pentaho.com and download Pentaho Data Integration 4.0.1.

Step 3: Unzip the downloaded PDI zip file, open the data integration folder and double click on spoon.bat

file.

How to Create a New Repository

Concept of Repository: The Kettle repository is a workspace that the data integrator works on. This

workspace is a physical region of the hard-drive that is designated exclusively for Kettle. In the repository,

all information about transformations, jobs, schedules, etc. is stored. The repository concept promotes re-

usability, which in turn saves time and effort. We can create a new repository in two ways:

1. Kettle database repository: This repository uses a central relational database to store the ETL

(Extraction Transformation Load) metadata.

2. Kettle file repository: This is a repository stored in a file in a certain folder.

Creating a repository through Kettle database repository

Step 1: In the Repository Connection dialog box click on the button. The Select the repository type

dialog box will appear.

Step 2: Select Kettle database repository and click on OK. Step 3: Now in the Repository information dialog box click on the New button.

Page 4: Pdi Lab Guide

3

Step 4: In the Database Connection dialog box give a name in the Connection Name textbox, select MS

SQL Server in the Connection Type list, and select Native (JDBC) in the Access list. In the

Settings level, provide Host Name (Example: localhost/machine name/servername), Database

Name, Instance Name (optional), User Name and Password.

Step 5: Now click on the Test button to see the connection status.

Page 5: Pdi Lab Guide

4

Step 6: In the Repository information dialog box, select the database connection from the drop down

box, provide ID (to be displayed in spoon title bar) and Name (the actual name of the repository)

and click on the Create and Upgrade button.

Step 7: In the OK dialog box, in response to the question, “Are you sure you want to create the

repository on the specified database connection?,” if you click on the No button, everything will

be created in the database and if you click on the Yes button, it will ask for a dry run.

Step 8: In the Dry run dialog box click on Yes, if you want to see the SQL statements as well as make

some modification. Else click on the No button.

Page 6: Pdi Lab Guide

5

Page 7: Pdi Lab Guide

6

Creating a repository using the Kettle file repository

Step 1: In the Repository Connection dialog box click on the [ ] button. The Select the repository type

dialog box will appear.

Step 2: Select Kettle file repository and click on OK.

Step 3: Now in the File repository settings dialog box click on the Browse button, select a folder, fill the ID

and Name, and click on the OK button.

Page 8: Pdi Lab Guide

7

Click on the “+” button to create a new repository.

Select the second option (kettle file repository) and click ok.

Page 9: Pdi Lab Guide

8

How to Connect to a Repository

Step 1: Normally the Repository Connection dialog box automatically opens with the opening of the Spoon

window. If the Repository Connection dialog box does not open with the Spoon window or suppose

you want to open it manually or you want to change your Repository, then:

In the Spoon window from menu bar, click on Tools and select the Repository option and click on

Connect to open the Repository Connection dialog box.

OR

Page 10: Pdi Lab Guide

9

Press CTRL+R from your keyboard to open the Repository Connection dialog box.

Step 2: To connect with your desired repository, select the corresponding repository name from the

Repository Connection dialog box and click on OK.

Note: In case of a database repository connection, you have to give User Name and Password (both

are by default admin) and in case of a file repository connection, User Name and Password are not

required.

Creation of an ODBC Connection

Step 1: Click on Start Control Panel Administrative Tools Data Sources (ODBC). Then in the

ODBC Data Source Administrator dialog box select the User DSN tab.

Step 2: Click on the Add button, select SQL Server Native Client, and click on Finish. Then in the “Create

a New Data Source to SQL Server” dialog box provide a name for the data source, select server (“.”

for local host), click on the Next button twice, select the check box and select the Northwind

database. Then click on the Next button followed by the Finish button.

Page 11: Pdi Lab Guide

10

Creating the First Transformation

Learning Objective: To create a simple PDI transformation for transferring data from a database table to a

flat file.

Source: The Products table from the Northwind database (MS SQL SERVER 2008).

Destination: The Products flat file.

Step 1: First connect to a repository (either to the database repository or file repository).

Step 2: From the menu bar click on File New Transformation.

Page 12: Pdi Lab Guide

11

Step 3: In the Design mode click on the input folder, drag the Table input item and drop it on the center

pane/Transformation page.

Step 4: On the Transformation page, double click on the Table input to open its properties/configuration

dialog box.

Step 5: In the name text box rename Table input to PRODUCTS_TABLE and click on the New button for

establishing the connection with the MS SQL server.

Step 6: Then in the Database Connection dialog box, select the General tab from the left pane, type a name

in the Connection Name text box, select MS SQL Server (Native) from the list in the Connection

Type list box, select ODBC from the Access list box, type the DSN name in the ODBC DSN

Source Name, provide the User Name and Password, click on the Test button to check the status

and click on the OK button twice to return to the Database Connection pane. (You can also

establish database connection through Native (JDBC).)

Step 7: Now click on the “Get SQL select statement” button, select the Products table from the database

explorer, click on OK to close the database explorer, click on Yes from question dialog box to

Page 13: Pdi Lab Guide

12

include the field-names in SQL and click on OK to close the properties dialog box of the Table

input.

Page 14: Pdi Lab Guide

13

Click on the Preview button to see the contents of the table.

Page 15: Pdi Lab Guide

14

Step 8: Click on the Output folder in left pane, drag the Text file output item and drop it on the

transformation pane.

Step 9: For transferring data from source to destination, Hop is used.

To create a hop:

a) Click on the PRODUCTS_TABLE and press the <SHIFT> key and draw a line to the Text

file output.

OR

b) Place your mouse pointer on the PRODUCTS_TABLE until the hover menu appears. Then

drag the hop output connector to Text file output.

OR

c) Place your mouse pointer on PRODUCTS_TABLE, press the middle button of the mouse and

drag the hop pointer and release on Text file output.

Step 10: Double click on Text file output. The following window will appear:

Page 16: Pdi Lab Guide

15

Step 11: In the Text file output properties window, type PRODUCTS_TEXTFILE in the Step name text

box, and click on the Browse button to set the path of the text file.

Step 12: Click on the Content tab (fields under the Content tab allow you to define how your data is

formatted).

Step 13: Click on the Fields tab to get fields and set their length, type, format, etc., and finally click on OK.

Page 17: Pdi Lab Guide

16

Step 14: Save the transformation by clicking File Save from the menu bar. The Transformation

Properties window will appear.

Step 15: Type a name in the Transformation name text box, click on (folder icon) to select a repository

folder (you can also create a new directory by right clicking on the root directory) where you will

save your transformation.Then click on OK twice.

Page 18: Pdi Lab Guide

17

Step 16: After saving, execute by pressing F9 or click on Action Run from menu bar. Now the Execute

a transformation window will appear. Click on the Launch button.

Click on this button for execution of your transformation

Page 19: Pdi Lab Guide

18

This panel shows the details of your transformation.

Transformation has finished.

Step 17: Open the text file and you will observe the following output.

Page 20: Pdi Lab Guide

19

Learning Objective: To create a simple PDI transformation for transferring data from a database table to

an Excel file.

Source: The Products table from the Northwind database (MS SQL SERVER 2008)

Destination: The Products Excel file.

Step 1: Follow steps 1–6 from the previous transformation for the PRODUCTS_TABLE source.

Step 2: After executing steps 1–6, click on the output folder in the left pane, drag the Excel output item and

drop it on the transformation pane.

Step 3: Now drag the hop from PRODUCTS_TABLE to the Excel output and double click on Excel output.

Step 4: In the Excel output configuration window, type PRODUCTS_EXCEL in the step name text field,

then click on the browse button to set the path of the Excel file.

Page 21: Pdi Lab Guide

20

You can also give a name to the sheet in the Excel file (by default, it is sheet1).

Step 5: Now click on the Fields tab to get fields and set their type, format, etc., and click on OK.

Page 22: Pdi Lab Guide

21

Step 6: Save your transformation and go for execution.

Step 7: Open the Excel file and you will observe the following output:

Learning Objective: To create a simple PDI transformation for transferring data from an Excel file to a

database table.

Source: The PRODUCTS_EXCEL Excel file.

Page 23: Pdi Lab Guide

22

Destination: The TBL_PRODUCTS table in the Northwind database (MS SQL SERVER 2008).

Step 1: From the menu bar click on File New Transformation. In the Design tab click on the input

folder, drag the Excel input item and drop it in the transformation window.

Step 2: Double click on Excel input to open the Excel input configuration window.

Step 3: In the Excel configuration window, type PRODUCTS_EXCELFILE in the Step name text box,

click on the Browse button to select the PRODUCTS_EXCEL.xls and then click on the Add button.

Step 4: Now click on the Get sheetnames(s) button under the Sheets tab, select PRODUCTS from the

available list and click on OK. Then, to get fieldnames from the header row, click on the Get fields

from header row button under the Fields tab.

Step 5: Under the design tab, expand the output folder, drag the Table output item and drop it in the

transformation pane.

Step 6: Create a hop in between PRODUCTS_EXCELFILE and Table output then double click on Table

output.

Page 24: Pdi Lab Guide

23

Page 25: Pdi Lab Guide

24

Click on this button to see the contents of the file

Step 7: In the Table output configuration dialog box, type PRODUCTS in the Step name text box, establish

the connection with the MS SQL server in the Connection text box, type TBL_PRODUCTS in the

Target table text box, click on the SQL button to open the Simple SQL editor dialog box, click on

the Execute button for the creation of a table in the MS SQL server and click on the Close button.

Step 8: Now click on the OK button in table output configuration dialog box, save your transformation and

go for execution.

Page 26: Pdi Lab Guide

25

Learning Objective: To create a simple PDI transformation for transferring data from a text file to a

database table.

Source: PRODUCTS.txt

Destination: TABLE_PRODUCTS table in the Northwind database.

Step 1: Create a new transformation by expanding the input folder under the Design Tab. Drag the Text file

input item and drop it in the transformation pane.

Step 2: Double click on Text file input to open its configuration window, rename Text file input to

PRODUCTS_TEXTFILE in the Step name text box, then click on the Browse button to select the

PRODUCTS.txt file and click on the Add button.

Step 3: Now click on the Content tab and observe the Separator and Enclosure text fields.

Page 27: Pdi Lab Guide

26

Step 4: Then open the Fields tab, click on Get fields and enter 0 to see the scan results of the flat file and

click on the Close button.

Step 5: You can also see the text file contents by clicking on the Preview rows button.

Step 6: Now from the Design tab expand the output folder, drag and drop the Table output in the

transformation pane and so on.

Step 7: Finally save the transformation and execute it.

Page 28: Pdi Lab Guide

27

Page 29: Pdi Lab Guide

28

These are not the desired outputs.

Well the transformation has been successfully executed but the output is not the desired output!!!

Page 30: Pdi Lab Guide

29

Using the “Select Values” Transformation

Step 1: Now delete the hop in between PRODUCTS_TEXTFILE and TABLE_PRODUCTS by right

clicking on the hop and selecting Delete hop.

Step 2: From the design tab expand the Transform folder, drag the Select values item and drop it in the

transformation pane.

Step 3: Create a hop in between PRODUCTS_TEXTFILE and select values then double click on Select

values to open its configuration dialog box.

Step 4: In select/rename dialog box rename select values to CHANGE_TYPE and click on meta-data tab.

Step 5: In meta-data tab click on Get fields to change, then set each fieldname type accordingly and click

on OK button.

Step 6: Now create a hop in between CHANGE_TYPE item and TABLE_PRODUCTS item.

Step 7: Then delete the TABLE_PRODUCTS table from the database and configure the

TABLE_PRODUCTS once again.

Step 8: Finally save the transformation, execute it and open the TABLE_PRODUCTS table from the

database.

Page 31: Pdi Lab Guide

30

Learning Objective: To use the Filter rows for dividing the source data into two destinations.

Step 1: Here, we will split the data coming from products table and place it in two different files. We will

divide products based on price.

Step 2: From the Design tab expand the Flow folder, drag the Filter rows item and drop it in the database

to Excel transformation.

Step 3: Join the hop in between PRODUCTS_EXCEL and Filter rows. Then double click on Filter rows to

open its configuration dialog box.

Step 4: In the Filter rows dialog box, rename Filter rows to filter rows on UnitPrice, then in the condition

pane set the condition shown in the following snapshot.

Page 32: Pdi Lab Guide

31

Step 5: Now from the Design tab expand the Output folder, drag the two text file outputs and drop in

the transformation pane.

Step 6: Drag the hop from Filter rows on UnitPrice and drop on a text file output with the result being

TRUE. Similarly drag another hop from same Filter rows on UnitPrice and drop it on second

text file with result being FALSE.

Step 7: Now open the text file configuration dialog box for both these files and set their properties.

Step 8: Then save your transformation and go for execution.

Note: In the Filter rows item either both or none of the TRUE and the FALSE steps need to be

provided, otherwise it will show an error.

Page 33: Pdi Lab Guide

32

Learning Objective: To use the Add constants for adding another field in the destination file.

Step 1: In the transformation pane add PRODUCTS_TABLE as source and Excel output as destination.

Step 2: Then from the Design pane expand the Transformation folder, drag the Add constants item and

drop it in the transformation window.

Step 3: Join the hop in between the PRODUCTS_TABLE and Add constants.

Step 4: Now double click on the Add constants item to open its configuration window.

Step 5: In Add constants value configuration window, rename Add constants to ADD_NEW_FIELD and

type QUALITY in the Name column, String in the Type column, 7 in Length, GOOD in Value and

click on OK.

Step 6: Join the hop in between ADD_NEW_FIELD and Excel file output.

Step 7: Now configure the Excel output item, save the transformation and go for execution.

Page 34: Pdi Lab Guide

33

Page 35: Pdi Lab Guide

34

Learning Objective: To use the calculator for calculation of source data and to display it in destinations.

Step 1: Drag a Table input from the input folder, drop it in the transformation pane and configure it with the

order details table from the Northwind database.

Step 2: From the Design pane expand the Transform folder, drag the calculator item and drop it in the

Transformation pane.

Step 3: Join the hop in between ORDER_DETAILS_INPUT and calculator. Then double click on the

calculator item to open its configuration window.

Step 4: In the calculator dialog box, type TOTAL PRICE under New Field and click on the calculator field.

Step 5: Now, the Select the calculation type dialog box will appear. Here, select A*B and click on OK.

Step 6: Next in the Field A column, select UnitPrice, and in the Field B column select Quantity. In the

Value type column select Number and click on OK.

Page 36: Pdi Lab Guide

35

Step 7: Now drag an Excel output item,drop it in the transformation pane, join the hop in between the

calculator and Excel output item and configure it.

Step 8: Finally save the trnasformation and execute it.

Page 37: Pdi Lab Guide

36

Learning Objective: To implement a lookup using the Lookup Transformation

To implement a lookup transformation:

Step 1: Create a new Transformation with Table input as its source.

Step 2: Connect the Table input with the order details table through its properties dialog box.

Page 38: Pdi Lab Guide

37

Step 3: Now, in the Design pane expand the Lookup folder, drag the Database lookup item and drop it in

the Transformation pane.

Step 4: Then connect the hop in between ORDER_DETAILS and Database lookup. Now double click on

the Database lookup.

Step 5: In the Database value lookup editor establish the ODBC connection in the Connection text box. Go

to the Lookup table text box, click on the Browse button to select the Orders table for lookup, then

in the key(s) to look up select OrderID in both Table field and Field1 with comparator as “=”.

Page 39: Pdi Lab Guide

38

Step 6: Now in the Values to return from the lookup table area, select the Field column and set the

data types in Type field or click on the Get lookup fields button and click on OK.

Step 7: Then join the hop in between the “Database lookup” and “Excel output” with selection of

“Main output of step”.

Step 8: Now configure the Excel output, save the transformation and go for execution.

Learning Objective: Merge two record sets.

Source 1: Emp.xls

Source 2: Emp.txt

Destination: Employee table in MS SQL SERVER 2008

Emp_ID F_Name M_Name L_Name Gender

12345 Shailendra Narayan Nayak M

12346 Jairam K M

Page 40: Pdi Lab Guide

39

Emp.xls

Emp.txt

Employee Table

Step 1: Create a new Transformation with one Text file input (for Emp.txt) and another Excel input

(Emp.xls).

Step 2: Configure both the inputs, drag the Select values item from the Transform folder and drop it in the transformation pane.

Step 3: Connect the hops and configure both the Select Values (Field Name, Data types).

Step 4: Using Sort rows from the Transform folder, sort both the input records (because before merging both the records must be sorted).

Step 5: Drag one Sorted Merge item from the Joins folder, place it in the transformation pane and

configure it accordingly.

Step 6: For changing “M” to “Male” or “F” to “Female”, use one Value Mapper item from the Transform folder and configure it accordingly.

Step 7: Now put one Table output item from the Output folder and configure it.

Step 8: Save the transformation in the repository and go for execution.

12347 Vandana Sharma F

12348 Hema A. F

Page 41: Pdi Lab Guide

40

[Select Values] [Value Mapper]

Learning Objective: Take data from the flat file and perform basic data quality operation on the same and load it back to a flat file (data quality operations such as trimming spaces and checking for null values).

Source file: Account.txt

[Account.txt]

Step 1: Create a new transformation with Text file input as its source.

Step 2: Connect the Text file input with Account.txt through its properties dialog box.

Step 3: Expand the scripting folder, drag the formula item and drop it in the Transformation pane.

Step 4: Join the hop in between the text file input and formula, and double click on formula to open its

dialog box.

Page 42: Pdi Lab Guide

41

Step 5: In the formula dialog box type Account_ID in New field, click on the formula column, type TRIM

[acct_id] in the upper part of the dialog box, click on OK and also set the value type as String.

Similarly, do this for TXN_ID and AMOUNT as shown in the below snapshots:

Delete these three fields(acct_id,txn_id,amt)

Step 6: Add a Text file output in the pane and configure it, save the transformation and go for execution.

Page 43: Pdi Lab Guide

42

Creating a Job

Step 1: First connect to a repository (either to a database repository or a file repository).

Step 2: From the menu bar click on File New Job.

Step 3: In the Design mode, click on the General folder, drag the START item and drop it on the centre

pane.

Step 4: In the centre pane, double click on START to open its scheduling dialog box and configure the Job

Scheduling.

Step 5: Click on the General folder in the left pane, drag the Transformation item and drop it on the

Transformation pane.

Step 6: Now drag the hop from START to Transformation and double click on the transformation item to

open its configuration dialog box.

Step 7: In the Job entry details for this transformation dialog box, select the right radio button and select the

transformation file or directory and click on OK.

Page 44: Pdi Lab Guide

43

Step 8: Save your job in a database repository or file repository (depending on your repository connection).

Step 9: Now press F5 or click on the execute button. Depending on your schedule as already set in START

the job will be executed (either once or multiple times).

Questions

1. What is the difference between a Database Repository and a File Repository in Kettle?

2. What is the difference between a Transformation and a Job in Kettle (PDI)?

3. What is the difference between connecting a database through JDBC and connecting it through

ODBC?

4. Create a transformation to import a table from MS SQL Server database to MS Access database.

5. Create a transformation to concatenate FirstName and LastName from the EMPLOYEES table in

the Northwind database and store it in a database in a field EMP_NAME.