pentaho data integration overview rev...

34
Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief Data Integration at Pentaho Kettle project founder

Upload: others

Post on 25-Feb-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

Cloud ComputingWith MySQL and Pentaho Data Integration

Matt CastersChief Data Integration at PentahoKettle project founder

Page 2: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-2

© 2006-2007 Pentaho Corporation. All Rights Reserved

AgendaIntroduction to Kettle

IntroductionUse-cases + load demo

Performance / Scalability

Kettle Slave ServersWhat & How

Kettle Cluster SchemasWhat & How

The Cloud

Cloud Examples

Q & A

Page 3: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-3

© 2006-2007 Pentaho Corporation. All Rights Reserved

Pentaho Data Integration - Kettle

PDI is the product associated with the KETTLE open source project

KETTLE provides open source software

PDI is a “whole product”

Member of the Pentaho BI Suite

Kettle = PDI CE

PDI EEManagement Services ConsoleKnowledge BaseDocumentationPortalSupportLicense Indemnification...

Page 4: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-4

© 2006-2007 Pentaho Corporation. All Rights Reserved

Introduction – Kettle : Kettle

Kettle

Extraction

Transportation

Transformation

Loading

Environment

Page 5: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-5

© 2006-2007 Pentaho Corporation. All Rights Reserved

Introduction - Kettle : Extraction

Extract data from :35+ database types

MySQL, PostgreSQL, SQLite, ...Oracle, SQL Server, etc

Text filesXML filesXLS filesXbase files (dBase, Foxpro, etc)File systems informationGenerated dataMS Access filesLDAPGeo-data...

Page 6: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-6

© 2006-2007 Pentaho Corporation. All Rights Reserved

Introduction - Kettle : Transportation

Transportation of dataEngine based data transfer (no code generator)Very flexible pathways:

splittingpartitioningmergingjoiningduplicatingclustering (MPP)

Page 7: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-7

© 2006-2007 Pentaho Corporation. All Rights Reserved

Introduction - Kettle : Transformation

Flexibly transform dataLooking up data

databasesfilesmemory...

CalculatingScripting

JavaScript, SQL, RegExpSplittingMappingSelectingFilteringPivotting ...

Page 8: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-8

© 2006-2007 Pentaho Corporation. All Rights Reserved

Introduction - Kettle : Loading

Load data into a target formatDatabase loadsData warehouse populationPartitioned loadingBulk loadingParallel loadingClustering

Page 9: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-9

© 2006-2007 Pentaho Corporation. All Rights Reserved

Introduction - Kettle : Environment

Full GUI called “Spoon” to edit every option in KettleDrag & DropDebuggerRich GUI

Command line toolsexecute jobsexecute transformations

Web serverclusteringremote execution

Programming API for Java

Plugin eco-system

...

Page 10: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-10

© 2006-2007 Pentaho Corporation. All Rights Reserved

Introduction – Kettle : Conceptual model

Page 11: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-11

© 2006-2007 Pentaho Corporation. All Rights Reserved

Introduction – Kettle : Conceptual model

Page 12: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-12

© 2006-2007 Pentaho Corporation. All Rights Reserved

Introduction – Kettle : User community

Paying Pentaho customers

Large and small corporationsAll possible sectors

Lone rangers & Hobbiests

All regions on Earth

Meet on our Forum : +30,000 posts in 3 years

Use our JIRA case tracking systems

Download more than 10,000 copies of Kettle per month

http://www.ohloh.net/projects/3624?p=Kettle

http://www.softpedia.com/progClean/Kettle-Clean-80094.html

Page 13: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-13

© 2006-2007 Pentaho Corporation. All Rights Reserved

Typical use-cases

Load data from text files and store it into a database [demo]

Export data from database to text-file or more other databases

Data migration between database applications

Exploration of data in existing databases (tables, views, etc.)

Information improvement using lookups

Data cleaning

Application integration

Data warehouse population

Application integration

Report data generation

...

Page 14: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-14

© 2006-2007 Pentaho Corporation. All Rights Reserved

AgendaIntroduction to Kettle

IntroductionUse-cases + load demo

Performance / Scalability

Kettle Slave ServersWhat & How

Kettle Cluster SchemasWhat & How

The Cloud

Cloud Examples

Q & A

Page 15: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-15

© 2006-2007 Pentaho Corporation. All Rights Reserved

Party time!!!!You're preparing for a big event

You bought plenty of food:500 hot-dogs500 burgers

Distaster strikes: invitees are not showing up!

What do you do with all that food?

Page 16: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-16

© 2006-2007 Pentaho Corporation. All Rights Reserved

Call Joey Chestnut!!Fastest eater in the world **

Local boy from Vallejo

103 burgers in 8 minutes

66 hot dogs in 12 minutes

** Ranked #1 by the International Federation of Competitive Eating

Page 17: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-17

© 2006-2007 Pentaho Corporation. All Rights Reserved

Problems with the chosen solutionYou would need at least 10 Joey Chestnuts to get the “job” done

Getting Joeys to show up costs money $

It's more fun with a crowd!

Page 18: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-18

© 2006-2007 Pentaho Corporation. All Rights Reserved

Performance / ScalabilityDemo: what about MySQL limits? [loading data continued]

Page 19: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-19

© 2006-2007 Pentaho Corporation. All Rights Reserved

Performance / ScalabilityMySQL bulk loading limitations...

Memory backed B-tree grows and need to swap out to diskKills performanceMySQL Performance Blog (Percona)

Predicting how long data load would take Predicting performance improvements from memory increase

Page 20: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-20

© 2006-2007 Pentaho Corporation. All Rights Reserved

Performance / ScalabilityLimits

Single threaded limitsMulti-threaded limitsThe weakest link

Solutions?OptimizingTweakingProddingRemoving bottlenecks...Scaling out

Scaling out with Pentaho Data IntegrationClusteringPartitioningDatabase sharding/partitioning

Page 21: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-21

© 2006-2007 Pentaho Corporation. All Rights Reserved

Performance / ScalabilityThe “Ideal” architecture has linear scalability

Nsystems

N times faster

Page 22: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-22

© 2006-2007 Pentaho Corporation. All Rights Reserved

ScalabilityThe answer from PDI to this challenge is multi-layered:

Clustering : multiple servers running in parallelPartitioning : directing data Database sharding : scaling the database

Sales table

Year 2003 Partition

Year 2004 Partition

Year 2005 Partition

Year 2006 Partition

Sales table

2003

2004

2005

2006

Sales

2003

2004

2005

2006

Sales

2003

2004

2005

2006

Sales

2003

2004

2005

2006

Sales

DB1

DB2

DB3

DB4

Page 23: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-23

© 2006-2007 Pentaho Corporation. All Rights Reserved

AgendaIntroduction to Kettle

IntroductionUse-cases + load demo

Scalability

Kettle Slave ServersWhat & How

Kettle Cluster SchemasWhat & How

The Cloud

Cloud Examples

Q & A

Page 24: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-24

© 2006-2007 Pentaho Corporation. All Rights Reserved

Slave ServersBuilding block of the PDI clustering offering

Small embedded webserver (Jetty)

Controlled over HTTP

Spits out XML or HTTP [demo]

Easy to start / configure / use

Available HTTP services:Start / Stop transformation or JobPauze transformationAdding (posting) transformation or jobGet status of server, transformation or jobCleanup of transformationAllocate a socket portRegister slaveGet list of slaves

Page 25: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-25

© 2006-2007 Pentaho Corporation. All Rights Reserved

Slave Server ConfigurationSimple configuration : Hostname & HTTP Control port

sh carte.sh localhost 8080

XML configuration:Optionally look at network interface to grab addressOptionally report to a central serverEtc.

Page 26: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-26

© 2006-2007 Pentaho Corporation. All Rights Reserved

AgendaIntroduction to Kettle

IntroductionUse-cases + load demo

Scalability

Kettle Slave ServersWhat & How

Kettle Cluster SchemasWhat & How

The Cloud

Cloud Examples

Q & A

Page 27: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-27

© 2006-2007 Pentaho Corporation. All Rights Reserved

Cluster

Clustering SchemaConsists of a collection of one or more slave servers

Is made up ofMaster : at least one per clusterSlaves : parallel worker nodes

Simple local sample [demo]

SlaveSlave

Master

SlaveSlave

SlaveSlave

Page 28: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-28

© 2006-2007 Pentaho Corporation. All Rights Reserved

AgendaIntroduction to Kettle

IntroductionUse-cases + load demo

Scalability

Kettle Slave ServersWhat & How

Kettle Cluster SchemasWhat & How

The Cloud

Cloud Examples

Q & A

Page 29: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-29

© 2006-2007 Pentaho Corporation. All Rights Reserved

The CloudWikipedia:

“Cloud computing is a style of computing in which dynamically scalable and often virtualised resources are provided as a service over the Internet”

Page 30: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-30

© 2006-2007 Pentaho Corporation. All Rights Reserved

The Cloud : typesInfrastructure as a Service (IaaS) : Amazon EC2,

Eucalyptus, GoGrid, Nymbus, ...

Platform as a Service (PaaS) : Amazon Web Services,

AppJet (ex Google guys), Azure Services Platform (MS),

Force.com (SalesForce), ...

Software as a Service (SaaS) : SalesForce, Google,

Lucidera and many others

Page 31: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-31

© 2006-2007 Pentaho Corporation. All Rights Reserved

The Cloud : white paperBayon Technologies (Pentaho Partner)

http://www.bayontechnologies.com

Sorting 600M line-item rows from TCP-H

Page 32: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-32

© 2006-2007 Pentaho Corporation. All Rights Reserved

The Cloud : Demo timeStart a dynamic cluster on EC2

Job:Execute DDL on all slave servers

Design a first transformation :Load a file on the cloud in parallel in 10 MySQL dabases in 10 tables, split the rows

Page 33: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-33

© 2006-2007 Pentaho Corporation. All Rights Reserved

The Kettle Cloud : linksMy blog : http://www.ibridge.be

http://wiki.pentaho.com/display/EAI/Dynamic+clusters

Page 34: Pentaho Data Integration Overview Rev Oct-4-07ibridge.be/files/Cloud_Computing_with_MySQL_and_Kettle.pdf · Cloud Computing With MySQL and Pentaho Data Integration Matt Casters Chief

1-34

© 2006-2007 Pentaho Corporation. All Rights Reserved

Q & A

Our homepage: http://kettle.pentaho.org

Our Forum: http://forums.pentaho.org/forumdisplay.php?f=69

Our case tracker: http://jira.pentaho.org/browse/PDI

Our wiki : http://wiki.pentaho.org/http://wiki.pentaho.com/display/EAI

Our IRC Channel: ##pentaho (on Freenode)

Developers mailing list:

http://groups.google.com/group/kettle-developers

My humble blog: http://www.ibridge.be

My coordinates: [email protected]