talend data integration and management · talend open studio open source ... etl is a common...

28
Talend Data Integration and Management

Upload: phamtu

Post on 05-May-2018

232 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

Talend Data Integration and Management

Page 2: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Data Integration

Data Integration involves combining data residing in differente sources and providing the

user with a unified view of the data

Data Management combines different disciplines to manage data as a valuable resource

Page 3: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Talend

● Talend is a company focused on Data Integration and Data Management solutions

● Talend is a „Cool Vendor“ for Gartner (2010)● Present in more than 12 locations around the

World● Fast growing company

Page 4: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Talend Open Studio

Page 5: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Talend Open Studio

● Open Source, professional tool● Draw procedures linking components, each

component performs an operation● DB vendor-specific optimized components● Produces fully editable Java (or Perl) code● Deployment with small and fast compiled Java

or as Web Service● Eclipse based IDE, excellent flexibility● BI Platform indipendent, DB Vendor indipendent

Page 6: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Automatic code generation, diffent deployment

Page 7: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Extracion Transformation Loading

● ETL is a common process in Data Integration● Extract, reading data from different datasources

(database, flat files, spreadsheet files, web services, etc)

● Transfom, converting data in a form so that it can be placed in another container (database, web services, files, etc). Cleaning, computations and verifications are also performed

● Load, write the data in the target format

Page 8: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Tutorial, Source data

Page 9: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Tutorial, Destination data (Datawarehouse)

Page 10: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Tutorial, Metadata

● Talend requires a preliminary definition of the metadata

● Often a strong metadata definition means, as in programming languages, fast, robust and maintenable applications

● ..demo..

Page 11: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Tutorial, Talend jobs basics

● Place components on the designer● Link components to build a transformation● Main type of link: Rows flow● Schema metadata is propagated and must be

coherent● ..demo..

Page 12: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Tutorial, users_dimension

Page 13: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Test the job

Page 14: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Tutorial, accounts_dimension

Page 15: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Tutorial, dates_dimension

Page 16: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Tutorial, write a Java library

Page 17: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Tutorial, opportunities_fact

Page 18: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Tutorial, define a root job

Page 19: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Deploy and run

Page 20: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Extensibility, comunity plugins

● Many official components

● Components for every task released by the comunity

● Geospatial components, log analysis, Google analytics, data encryption, etc

Page 21: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Scheduler

Page 22: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

And now.. reports, dashboards, OLAP, Geoanalysis, KPIs..

Page 23: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Do you trust your data?

Page 24: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

What about data quality?

● Customer A is present 5 times with different names

● Null values can vary statistical indexes like mean calculation

● Duplicated records● Blank values● Some records can contain errors (es -1 field

values)● Some records can be garbage

Page 25: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Talend Open Profiler

Page 26: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

What abount data storage size?

● Some fields can be oversized for the data they contain

● Sometimes fields are related and can be calculated

● Some keys or values are never used● When data grow garbage grow● Data storage is not free (disks, electricity,

backups, DB licenses)

Page 27: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Data is „the black gold“ that can produce knowledge

● Data is a resource, you can extract knowledge● A lot of Data produces concise informations● Data storage is not free and a lot of data can

make system not fast● Data cleansing is a central process in statistical

analysis and Data Mining

Page 28: Talend Data Integration and Management · Talend Open Studio Open Source ... ETL is a common process in Data Integration Extract, reading data from different datasources (database,

www.robertomarchetto.com

Talend Master Data Management