copyright (c) 2014 steve o'hearn

37
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.c om

Upload: jaelyn-ivery

Post on 31-Mar-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Page 2: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Page 3: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Page 4: Copyright (c) 2014 Steve O'Hearn

1. Understand why you may want to establish data connections between the Oracle RDBMS and Hadoop

2. Review various techniques and tools for establishing data connections between the Oracle RDBMS and Hadoop’s HDFS

3. Understand the purpose, benefits, and limitations of the various techniques and tools

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Page 5: Copyright (c) 2014 Steve O'Hearn

1. Understand why you may want to establish data connections between the Oracle RDBMS and Hadoop

2. Review various techniques and tools for establishing data connections between the Oracle RDBMS and Hadoop’s HDFS

3. Understand the purpose, benefits, and limitations of the various techniques and tools

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Page 6: Copyright (c) 2014 Steve O'Hearn

What is the Oracle RDBMS? What is Hadoop?

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Page 7: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Source: Oracle Database Concepts, 12c Release 1 (12.1)

Page 8: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Framework of tools Open source - Java Apache Software Foundation projects Several tools

HDFS (storage) and MapReduce (analysis) HBase, Hive, Pig, Sqoop, Flume, more

Network sockets

Page 9: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Hadoop Distributed File System Text files, binary files Very large data blocks

64MB minimum 1GB or higher Typical is 128MB or 256MB

Replication – 3 copy default Namenode and Datanodes

Page 10: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Analytical Engine of Hadoop JobTracker TaskTracker Interprets data at runtime instead of

using predefined schemas

Page 11: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

HDFS

MapReduce

PigHive

HBase

HDFS API

Text Files

BinaryFiles

(Other)

NOTE: There are

file systems

other than HDFS.

Page 12: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Oracle RDBMS

Hadoop

Integrity

High Low

Schema Structured Unstructured

Use Frequent reads and writes

Write once, read many

Style Interactive and batch

Batch

Page 13: Copyright (c) 2014 Steve O'Hearn

1. Understand why you may want to establish data connections between the Oracle RDBMS and Hadoop

2. Review various techniques and tools for establishing data connections between the Oracle RDBMS and Hadoop’s HDFS

3. Understand the purpose, benefits, and limitations of the various techniques and tools

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Page 14: Copyright (c) 2014 Steve O'Hearn

Sample scenario: Move data from Oracle into Hadoop Perform MapReduce on datasets that include

Oracle data Move MapReduce results back into Oracle for

analysis, reporting, etc. Other uses:

Oracle queries that join with Hadoop datasets Scheduled batch MapReduce results to be

warehoused

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Page 15: Copyright (c) 2014 Steve O'Hearn

Oracle and Hadoop together form a comprehensive platform for managing all forms of data, both structured and unstructured.

Hadoop provides “big data” processing. Oracle provides analytic capabilities not

found in Hadoop. (NOTE: This is changing.)

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Page 16: Copyright (c) 2014 Steve O'Hearn

1. Understand why you may want to establish data connections between the Oracle RDBMS and Hadoop

2. Review various techniques and tools for establishing data connections between the Oracle RDBMS and Hadoop’s HDFS

3. Understand the purpose, benefits, and limitations of the various techniques and tools

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Page 17: Copyright (c) 2014 Steve O'Hearn

1. Understand why you may want to establish data connections between the Oracle RDBMS and Hadoop

2. Review various techniques and tools for establishing data connections between the Oracle RDBMS and Hadoop’s HDFS

3. Understand the purpose, benefits, and limitations of the various techniques and tools

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Page 18: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

SELECT SQL, PL/SQL,

Java / Program Units (such as Stored Procedures, etc.)

Custom Java w/JDBC, DBOutputFormat, FileSystem API, Avro

Sqoop

Custom Java w/JDBC, DBInputFormat, FileSystem API, Avro

Sqoop

Oracle Loader for Hadoop

Oracle SQL Connector for HDFS

PUSH

PUSH

PULLPULL

PUSH

PULL

Page 19: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

SQL’s SELECT statement Use Java to:

Spool output Control output

Use string concatenation to create delimited text HDFS files

Use Java Avro API to create serialized binary HDFS files

PUSH

Page 20: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

To connect to RDBMS JDBC

Interacts with RDBMS DBInputFormat: reading from a database DBOutputFormat: dumping to a database

Generates SQL Best for smaller amounts of data org.apache.hadoop.mapreduce.lib.db

To interact with HDFS Files FileSystem API Avro API (for binary files)

PUSH

PULL

Page 21: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Sqoop = “SQL to Hadoop” Command line Works with any JDBC compliant RDBMS Works with any external system that supports

bulk data transfer into Hadoop (HDFS, HBase, Hive)

Strength: transfer of bulk data between Hadoop and RDBMS environments

Read / Write / Update / Insert / Delete Stored Procedures (warning: parallel

processing)

PUSH

PULL

Page 22: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Open Source / Java Apache Top Level Project (Graduated from

incubator level March 2012) Bundled with:

Oracle Big Data Appliance CDH (Cloudera Distribution including Apache

Hadoop) Also available at Apache Software Foundation

http://sqoop.apache.org Latest version of Sqoop2: 1.99.3 (as of 4/15/14) Wiki: https://cwiki.apache.org/confluence/display/SQOOP

PUSH

PULL

Page 23: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Note:Sqoop cannot currently load

SequenceFile or Avro into Hive.

Text Human-readable

Binary Precision Compression Examples

SequenceFile (Java-specific) Avro

PUSH

PULL

Page 24: Copyright (c) 2014 Steve O'Hearn

Interacts with structured data stores outside of HDFS

Moves data from structured data stores into Hbase

Moves analytic results out of Hadoop to a structured data store

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

PUSH

PULL

Page 25: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Interrogates the RDBMS data dictionary for the target schema

Use MapReduce to import data into Hadoop Parallel Operation - configurable Fault Tolerance – configurable

Datatype mapping: Oracle SQL data types to Java data types (VARCHAR2 = String; INTEGER = Integer, etc.)

Generates Java class of structured schema Bean: “get” methods Write methods

public void readFields(ResultSet __dbResults) throws SQLException;

public void write(PreparedStatement __dbStmt) throws SQLException;

PUSH

PULL

Page 26: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

$ sqoop help

usage: sqoop COMMAND [ARGS]

Available commands: codegen Generate code to interact with database records create-hive-table Import a table definition into Hive eval Evaluate a SQL statement and display the results export Export an HDFS directory to a database table help List available commands import Import a table from a database to HDFS import-all-tables Import tables from a database to HDFS job Work with saved jobs list-databases List available databases on a server list-tables List available tables in a database merge Merge results of incremental imports metastore Run a standalone Sqoop metastore version Display version information

PUSH

PULL

Page 27: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

$ sqoop list-databases –connect "jdbc:mysql://localhost" -username steve –password changeme

14/04/24 15:35:21 INFO manager.SqlManager: Using default fetchSize of 1000

netcents2dt_siteexam_moduleteam_wiki

$

PUSH

PULL

Page 28: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Generic JDBC Connector Connectors for major RDBMS: Oracle,

MySQL, SQL Server, DB2, PostgreSQL Third party: Teradata, Netezza,

Couchbase Third party connectors may support

direct import into third party data stores

PUSH

PULL

Page 29: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

User Guides Installation Upgrade Five Minute Demo Command Line Client

Developer Guides Building Sqoop2 Development Environment Setup Java Client API Guide Developing Connector REST API Guide

PUSH

PULL

Page 30: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Essentially – “SQL Loader” for Hadoop Java MapReduce application Runs as a Hadoop Utility w/configuration file Extensible (Attention Java programmers) Command line or standalone process Online and offline modes Requires an existing target table (staging

table!) Loads data only, cannot edit Hadoop data Pre-partitions data if necessary Can pre-sort data by primary key or user-

specified columns before loading Leverages Hadoop’s parallel processing

PULLPULL

Page 31: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

PULLPULL

HDFS

MAPPER MAPPER

PRESORTING

RDBMSJDBC

MAPPER

REDUCER

Page 32: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Oracle Loader advantages Regular expressions (vs. as-is delimited file

import) Faster throughput (vs. Sqoop JDBC) Data dictionary interrogation during load Support for runtime rollback (Sqoop

generates INSERT statements with no rollback support)

Sqoop advantages One system for bi-directional transfer support

Page 33: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

Essentially the “external table” feature for Hadoop

Text files only – no binary file support Treats HDFS as an external table Read only (no import / transfer) No indexing No INSERT, UPDATE, or DELETE As is data import Full table scan

PULLPULL

Page 34: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2013 Steve O'Hearn http://www.databasetraining.com

CREATE TABLE CUSTOMER_LOGFILES ( LOGFILE_ID INTEGER(20) , LOG_NOTE VARCHAR2(120) , LOG_DATE DATE) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY log_file_dir ACCESS PARAMETERS ( records delimited by newline badfile log_file_dir:'log_bad' logfile log_file_dir:'log_records' fields terminated by ',' missing field values are null ( LOGFILE_ID , LOG_NOTE , LOG_DATE char date_format date mask "dd-mon-yyyy“ ) ) LOCATION ( 'log_data_1.csv‘ , 'log_data_2.csv') ) PARALLEL REJECT LIMIT UNLIMITED;

Page 35: Copyright (c) 2014 Steve O'Hearn

Oracle to CDH via Sqoop Freeware plug-in to CDH (Cloudera

Distribution including Apache Hadoop)

Copyright (c) 2013 Steve O'Hearn http://www.databasetraining.com

PULL

Java command-line utility Saves Hive HQL output to an Oracle database

Page 36: Copyright (c) 2014 Steve O'Hearn

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com

There is no one best solution Apache Sqoop and Java APIs

Bi-directional Read/Write/Insert/Update/Delete Limitation: JDBC and available connectors Requires knowledge of Java

Oracle Loader Offers preprocessing and speed Unidirectional

Oracle SQL Connector Integrates with existing SQL calls Limited to HDFS text files

Third party tools (Cloudera, Hortonworks, etc.) are adding features to Hadoop that may reduce demand for moving data back to Oracle

Page 37: Copyright (c) 2014 Steve O'Hearn

Steve O’HearnSteve O’[email protected]@corbinian.com

DatabaseTraining.comDatabaseTraining.comandand

Corbinian.comCorbinian.com

Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com