building a data lake: data integration between hadoop · pdf fileolh – hive partition...

32
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Constructing a Data Lake: Hadoop and Oracle Database United!

Upload: phungdan

Post on 06-Mar-2018

230 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Sharon Sophia Stephen

Big Data PreSales Consultant

February 21, 2015

Constructing a Data Lake: Hadoop and Oracle Database United!

Page 2: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Safe Harbor

The following is intended to outline our general product direction. It is

intended for information purposes only, and may not be incorporated

into any contract. It is not a commitment to deliver any material, code,

or functionality, and should not be relied upon in making purchasing

decisions. The development, release, and timing of any features or

functionality described for Oracle’s products remains at the sole

discretion of Oracle.

Page 3: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Program Agenda

Hadoop Overview

Data Lake

How the technologies can work together

Tools for Integration

Q&A

1

2

3

4

5

Oracle Confidential – Internal/Restricted/Highly Restricted 3

Exadata +

Oracle Database

Page 4: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

What Is Hadoop?

“The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”

Page 5: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

What is HDFS?

• HDFS is the primary storage system underlying Hadoop

• Fault tolerant, scalable, highly available

• Designed to be well-suited to distributed processing –Splits large files into blocks

–Multiple copies stored on different disks on separate nodes

• Is superficially structured like a UNIX file system

The Hadoop Distributed File System

Page 6: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

A MapReduce (True distributed computing) Analogy Going From Estimates To Actuals

Page 7: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

A MapReduce (True distributed computing) Example Putting The Analogy Into Practice

Page 8: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

MapReduce Phases – Map

• Each Map task usually works on a single input split

• Hadoop tries to run map tasks on the slave node that contains the stored HDFS data block (data locality)

• The input is presented to the Map phase as a key-value pair

– Shuffle and Sort • Groups all the values together for each key (using the intermediate output data from all of the

completed mappers)

• Sorts the keys

– Reduce • The intermediate output of the Shuffle and Sort phase is the input to the Reduce phase

• The Reduce function (developer) generates the final output

Page 9: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

What else is Hadoop

• Apache Hadoop

• Apache Hive

• Apache Pig

• Apache HBase

• Apache Zookeeper

• Apache Flume

• Apache Sqoop

• Apache Mahout

• Apache Whirr

• Apache Oozie

• Fuse-DFS

• Hue

• Plus Additional projects

• Impala

• BDR

• Sentry

Page 10: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Two Great Tastes That Go Great Together

Hadoop

Hadoop Database

A. Hadoop can do some things a Database reasonably does not.

Database

Things you can do (Big Oval)

Things you can do cost-effectively (Small Oval)

B. Hadoop expands the amount of things you can do cost-effectively.

Page 11: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Two Great Tastes That Go Great Together

Page 12: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Data lake

Oracle Confidential – Internal/Restricted/Highly Restricted 12

Page 13: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Data Lake

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.

Page 14: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Focus in this Session

14

Data organized for fast query

Structured schema

Complex programming models

Read, write, delete, update

Access specific record

Relational Hadoop

Data in files

Schema on read

Simple programming model for

large scale data processing

Append only

Sequential access of blocks

Page 15: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Oracle Big Data Connectors

Data Load Oracle Loader for Hadoop

Data Access Oracle SQL Connector for

HDFS

Oracle Data Integrator Knowledge Modules

R Analytics Oracle R Advanced Analytics

on Hadoop

XML/XQuery Oracle XQuery on Hadoop

XQuery R Client

Optimized for Hadoop: Maximize parallelism

Fast performance Analyze data on Hadoop using

familiar client tools

Oracle Bigdata SQL (for eng. System)

Page 16: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Integrating data – Hadoop and Oracle databases

Transferring from Hadoop to Oracle RDBMS Transferring from Oracle RDBMS to Hadoop

OSCH – For CSV Text

OSCH – Hive Non Partition

OLH – For CSV text

OLH – Hive Non-partition CopytoBDA (only on Engineered system platform)

OLH – Data pump (offline) Oracle Data Integrator

OLH – Hive partition Oracle Golden gate

OLH – Hive Partition parquet Sqoop

Sqoop – – For CSV text

Sqoop – Hive partition

Sqoop – Hive partition parquet

Sqoop 2 - Hive partition parquet

Page 17: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Oracle Loader for Hadoop

• Parallel load, optimized for Hadoop

• Automatic load balancing

• Convert to Oracle format on Hadoop

– Save database CPU

• Load specific Hive partitions

• Kerberos authentication

• Load directly into In-Memory table

JSON Log

files Hive

Text Parquet Avro Sequence

files

Compressed

files And more …

Page 18: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Oracle Loader for Hadoop

Performance • Extremely fast performance

• Sample numbers (on Oracle Engineered Systems)

– 4.4 TB/hour end-to-end (load + Hadoop process)

• Much higher than typical customer requirements

• Optimized for Oracle Big Data Appliance and Oracle Exadata: InfiniBand Connectivity

Page 19: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Oracle SQL Connector for HDFS

OSCH

Hive Text

OSCH

OSCH

OSCH

External

Table

create table customer_address

( ca_customer_id number(10,0)

, ca_street_number char(10)

, ca_state char(2)

, ca_zip char(10))

organization external (

TYPE ORACLE_LOADER

DEFAULT DIRECTORY DEFAULT_DIR

ACCESS PARAMETERS

(…)

PREPROCESSOR “HDFS_BIN_PATH:hdfs_stream”)

LOCATION (‘addr1’, ‘addr2’, ‘addr3’))

• Parallel query and load

• Load into database or query in place

• Access text or Hive over text

• Access compressed data

• Access specific Hive partitions

• Kerberos authentication

Compressed

files

Page 20: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Oracle SQL Connector for HDFS

• Includes tool to generate external table

• Performance on Engineered Systems

– 15 TB/hour load time

• Query and load Oracle Data Pump files

– Binary file in Oracle format

– Uses less database CPU cycles during query/load

Page 21: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Oracle SQL Connector for Hadoop works with multiple versions

Database versions (on any operating system*)

10.2.0.5 and greater

11.2.0.3 and greater

12c

Hadoop versions Certified by

Apache Hadoop 2.x Oracle

CDH 4.x (Cloudera) Oracle

CDH 5.x (Cloudera) Oracle

HDP 1.3 (Hortonworks) Hortonworks

HDP 2.1 (Hortonworks) Hortonworks *Oracle SQL Connector for HDFS requires

Hadoop client to be supported on the operating system

Page 22: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Test Case

Oracle Confidential – Internal/Restricted/Highly Restricted 22

Page 23: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Copyright © 2012, Oracle and/or its affiliates. All rights reserved. 23

Data Generation

• Data was generated using a tool

– Built by the product development team.

– Used for benchmarking.

– Generates data within HDFS.

– Random data.

• The data-files were a combination of various data types such as int, float, varchar, date and timestamp.

Page 24: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Integrating data – Oracle databases to Hadoop

Things to consider while moving data from Hadoop to Oracle RDBMS.

• OSCH has proven consistently to be the best tool if you have to move

massive amount of text data . OSCH moved 14 TB in an hour on

Engineered systems.

• OLH is the best tool if you are dealing with large amount of partitioned

tables on both Source and Target.

• Generally aligning # of files at the source to the # of partition at the target

and with DOP gives better performance.

Page 25: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Oracle Loader for Hadoop Vs Oracle SQL connector for Hadoop

Oracle Loader for

Hadoop

Oracle SQL

connector for

Hadoop

Page 26: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Oracle Data Integration for Big Data and Hadoop

Oracle Confidential 26

Comprehensive data integration platform designed to work with all data Oracle Data Integrator

(Data Transformation)

Enterprise Data Quality (Profile, Cleanse, Match and De-duplicate)

Fast Load

Oracle GoldenGate (Data Replication)

Enterprise Metadata Management (Lineage, Impact Analysis and Data Provenance)

Data Replication

Continuous data staging into Hadoop

Data Transformation

– Pushdown processing in Hadoop

Data Federation

– Query Hadoop SQL via JDBC

Data Quality

– Fix quality at the source or invoke Machine Learning in Hadoop

Metadata Management

– Lineage and Impact Analysis w/Hadoop

Data Service Integrator (Data Federation)

Synchronization

Realtime Staging

Pushdown Data Transformations

Page 27: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Oracle Big Data SQL Query All Data without Application Change or Data Conversion

Page 28: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Summary

• Fast, easy, integration of all data in your Big Data solution

• Oracle Big Data Connectors

• Oracle Big Data SQL (on Oracle Engineered Systems)

• Oracle Data Integrator

• Oracle Golden Gate

• Apache Sqoop

Page 30: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Page 31: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Page 32: Building a Data Lake: Data Integration between Hadoop · PDF fileOLH – Hive partition Oracle Golden gate ... Oracle SQL Connector for Hadoop works with ... Building a Data Lake: