data integration-on-hadoop

20
1 Data Integration on Hadoop Sanjay Kaluskar Senior Architect, Informatica Feb 2011

Upload: skaluska

Post on 29-Nov-2014

2.961 views

Category:

Technology


1 download

DESCRIPTION

Presentation at Hadoop user summit in Bangalore on Feb 16, 2011 by Sanjay Kaluskar

TRANSCRIPT

Page 1: Data integration-on-hadoop

11

Data Integration on HadoopSanjay KaluskarSenior Architect, Informatica

Feb 2011

Page 2: Data integration-on-hadoop

22

Introduction

• Challenges• Results of analysis or mining are only as good as the

completeness & quality of underlying data• Need for the right level of abstraction & tools

• Data integration & data quality tools have tackled these challenges for many years!

More than 4,200 enterprises worldwide rely on Informatica

Page 3: Data integration-on-hadoop

33

Files ApplicationsDatabases

Data sources

Hadoop

HDFS HBase

PIG Hive

Java

Sqoop

• Developer productivity• Vendor neutrality/flexibility

ODBC JMS

Access methods & languages

C/C++ OCIBAPI

SQL

WordExcel

Web services

XQueryNotepad

Javavi PL/SQL JDBC

CLI

Transact-SQL

Developer tools

Page 4: Data integration-on-hadoop

44

Lookup example

‘Bangalore’, …, 234, …‘Chennai’, …, 82, …‘Mumbai’, …, 872, …

‘Delhi’, …, 11, …‘Chennai’, …, 43, …

‘xxx’, …, 2, …

Dept id Name

82 Stationery

2 Clothing

11 Jewellery

HDFS file

Database table

Your choices• Move table to HDFS using Sqoop and join

• Could use PIG/Hive to leverage the join operator• Implement Java code to lookup the database table

• Need to use access method based on the vendor

Page 5: Data integration-on-hadoop

55

Or… leverage Informatica’s Lookup

a = load 'RelLookupInput.dat' as (deptid: double);b = foreach a generate flatten(com.mycompany.pig.RelLookup(deptid));store b into 'RelLookupOutput.out';

Page 6: Data integration-on-hadoop

66

Or… you could start with a mapping

STORE

Load Filter

Page 7: Data integration-on-hadoop

77

Goals of the prototype

• Enable Hadoop developers to leverage Data Transformation and Data Quality logic• Ability to invoke mapplets from Hadoop

• Lower the barrier to Hadoop entry by using Informatica Developer as the toolset• Ability to run a mapping on Hadoop

Page 8: Data integration-on-hadoop

88

Mapplet Invocation

• Generation of the UDF of the right type• Output-only mapplet Load UDF• Input-only mapplet Store UDF• Input/output Eval UDF

• Packaging into a jar• Compiled UDF• Other meta-data: connections, reference tables

• Invokes Informatica engine (DTM) at runtime

Page 9: Data integration-on-hadoop

99

Mapplet Invocation (contd.)

• Challenges• UDF execution is per-tuple; mapplets are optimized for batch execution• Connection info/reference data need to be plugged in• Runtime dependencies: 280 jars, 558 native dependencies

• Benefits• PIG user can leverage Informatica functionality

• Connectivity to many (50+) data sources• Specialized transformations

• Re-use of already developed logic

Page 10: Data integration-on-hadoop

1010

Mapping Deployment: Idea

• Leverage PIG• Map to equivalent operators where possible• Let the PIG compiler optimize & translate to Hadoop jobs

• Wraps some transformations as UDFs• Transformations with no equivalents, e.g., standardizer,

address validator• Transformations with richer functionality, e.g., case-

insensitive sorter

Page 11: Data integration-on-hadoop

1111

Leveraging PIG Operators

Page 12: Data integration-on-hadoop

1212

Leveraging Informatica Transformations

NativePIG

NativePIG

Native PIG Informatica Transformation (Translated to PIG UDFs)

SourceUDFs

LookupUDF

Target UDF

Case converter

UDF

Page 13: Data integration-on-hadoop

1313

Mapping Deployment

• Design• Leverages PIG operators where possible• Wraps other transformations as UDFs• Relies on optimization by the PIG compiler

• Challenges• Finding equivalent operators and expressions• Limitations of the UDF model – no notion of a user defined

operator

• Benefits• Re-use of already developed logic• Easy way for Informatica users to start using Hadoop

simultaneously; can also use the designer

Page 14: Data integration-on-hadoop

1414

HDFS

Data Node

HDFSName Node

HDFSJob Tracker

Hadoop Cluster

Weblogs

Enterprise Applications

Databases

Semi-structuredUn-structured

BI

DW/DM

Informatica & HadoopBig Picture

MetadataRepository

Graphical IDE for

Hadoop Development

Enterprise

Connectivity for

Hadoop programs Transformation

Engine for custom

data processing

Page 15: Data integration-on-hadoop

1515Developer tools

Files ApplicationsDatabases

Data sources

JMS

Access methods & languages

C/C++ OCIBAPI

SQL

WordExcel

Web services

XQueryNotepad

Javavi PL/SQL

Hadoop

HDFS HBase

PIG Hive

Java

Sqoop

• Developer productivity• Connectivity• Rich transforms• Designer tool

• Vendor neutrality/flexibility• Without losing

performance

Page 16: Data integration-on-hadoop

1616

Informatica Extras…

• Specialized transformations• Matching• Address validation• Standardization

• Connectivity

• Other tools• Data federation• Analyst tool• Administration• Metadata manager• Business glossary

Page 17: Data integration-on-hadoop

17

Page 18: Data integration-on-hadoop

1818

Hadoop Connector for Enterprise data access

• Opens up all the connectivity available from Informatica for Hadoop processing• Sqoop-based connectors• Hadoop sources & targets in mappings

• Benefits• Load data from Enterprise data sources into Hadoop• Extract summarized data from Hadoop to load into DW and

other targets• Data federation

Page 19: Data integration-on-hadoop

1919

HDFS

Data Node

Mapplets

PIGScript UDF

Complex Transformations: Addr Cleansing Dedup/Matching Hierarchical data parsing

Enterprise Data Access

Informatica eDTM

Informatica Developer tool for Hadoop

Informatica Hadoop Developer

Informatica Developer

Informatica developer builds hadoop mappings and deploys to Hadoop cluster

Mapping PIGscript

Mapplets etc PIG UDF

Hadoop Designer

MetadataRepository

eDTM

Page 20: Data integration-on-hadoop

2020

HDFS

Data Node

Mapplets

PIGScript UDF

Complex Transformations: Dedupe/Matching Hierarchical data parsing

Enterprise Data Access

Informatica eDTM

Invoke Informatica Transformations from your Hadoop

MapReduce/PIG scripts

Informatica Developer Tool

Hadoop developer invokes Informatica UDFs from PIG scripts

Mapplets PIG UDF

Reuse Informatica

Components inHadoop

MetadataRepository

Hadoop Developer