real-time data loading from oracle and mysql to data warehouses, analytics

32
© 2014 VMware Inc. All rights reserved. Real-time Data Loading from Oracle and MySQL to Data Warehouses and Oracle to Analytics MC Brown, Senior Product Line Manager

Upload: vmware-continuent

Post on 15-Jul-2015

114 views

Category:

Software


5 download

TRANSCRIPT

Page 1: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

© 2014 VMware Inc. All rights reserved.

Real-time Data Loading from Oracle and MySQL to Data Warehouses and Oracle to Analytics MC Brown, Senior Product Line Manager

Page 2: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Continuent Quick Introduction

2

History Products 2004 Continuent established in USA

2009 3rd Generation Continuent Tungsten (aka VMware Continuent) ships

2014 100+ customers running business-critical applications

Oct 2014 Acquisition by VMware: Now part of the vCloud Air Business Unit

Oct 2015 Continuent solutions available through VMware sales

Industry-leading clustering and replication for open source DBMS

Clustering – Commercial-grade HA, performance scaling, and data management for MySQL

Replication– Flexible, high-performance data movement

Page 3: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Business-Critical Deployment Examples

High Availability for MySQL

Largest cluster deployment performs 800M+ transactions/day on 275 TB of relational data

Business Continuity Cross-site cluster topologies widely deployed including primary/DR and multi-master

High Performance Replication

Largest installations transfer billions of transactions daily using high speed, parallel replication

Heterogeneous Integration

Customers replicate from MySQL to Oracle, Hadoop, Redshift, Vertica, and others

Real-time Analytics Optimized data loading for data warehouses with deployments of up to 200 MySQL masters feeding to Hadoop

Continuent Facts

3

Page 4: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Select Continuent Customers

4

Page 5: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Data Warehouse Integration is Changing •  Traditional data warehouse usage was based on dump from transactional store,

loads into data warehouse •  Data warehouse and analytics were done off historical data loaded •  Data warehouses often use merged data from multiple sources, which was hard to

handled •  Data warehouses are now frequently sources as well as targets for data, i.e.:

–  Export data to data warehouse –  Analyze data

–  Feed summary data back to application to display stats to users

Page 6: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Modern Data Warehouse Sequences

Page 7: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

How do we cope with that model •  Traditional Extract-Transform-Load (ETL) methods take too long •  Data needs to be replicated into a data warehouse in real-time •  Continuous stream of information •  Replicate everything •  Use data warehouse to provide join and analytics

Page 8: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Data Warehouse Choices •  Hadoop

–  General purpose storage platform –  Map Reduce for data processing –  Front-end interfaces for interaction in SQL-like (Hive, HBase, Impala) and non-

SQL (Pig, native, Spark)

–  JDBC/ODBC Interfaces improving •  Vertica

–  Massive cluster-based column store –  SQL and ODBC/JDBC Interface

•  Amazon Redshift –  Highly flexible column store –  Easy to deploy

Page 9: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

9

(software formerly known as Tungsten Replicator) is a fast,

open source, database replication engine

Designed for speed and flexibility

GPL V2 license 100% open source

Annual support subscription available

VMware Continuent for Replication/Data Warehouses

Page 10: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

10

Master

(Transactions + Metadata)

Slave

Replicator

(Transactions + Metadata)

Replicator

Download transactions via network

Apply using JDBC

Binlog

THL

THL

Continuent Master/Slave in Action (MySQL)

Page 11: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

11

Master

(Transactions + Metadata)

Slave

Replicator

(Transactions + Metadata)

Replicator

Download transactions via network

Apply using JDBC

CDC

THL

THL

Continuent Master/Slave in Action (Oracle)

Page 12: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

12

Transactional Store Data Warehouse

Dump/Provision

Transactions? X

Batch

The Data Warehouse Impedance Mismatch

Page 13: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Transactional and Data Warehouse Metadata •  Replicating data is not just about the data •  Table structures must be replicated too •  ddlscan handles the translation

–  Migrates an existing MySQL or Oracle schema into the target schema –  Template based

–  Handles underlying datatype matches –  Needs to be executed before replication starts

Page 14: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Replicating into Vertica

Replicator

Replicator

CSV

JS JDBC

cpimport

staging

base

merge

Page 15: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Vertica Demo

Page 16: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Replicating into Redshift

Replicator

Replicator

CSV

JS JDBC

s3cmd

staging

base

merge

COPY

Page 17: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Replicating into Hadoop

Replicator

Replicator

CSV

JS

hadoop fs

Page 18: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Initial Materialization within Hadoop

load-reduce-check

Migrate staging/base DDL

Hive materialization

CSV

StagingTable

Base Table

Page 19: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Hadoop Demo 1

Page 20: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Ongoing Materialization within Hadoop

materialize

Hive materialization

CSV

StagingTable

Base Table

Page 21: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Hadoop Demo 2

Page 22: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Provisioning Options (all data warehouses) •  MySQL

–  Traditional CSV export and import –  Dump and load through Blackhole engine –  Use tungsten_provision_thl

•  Oracle

–  Traditional CSV export and import –  Use parallel extractor

Page 23: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Provisioning Options (Hadoop) •  MySQL

–  Traditional CSV export and import –  Dump and load through Blackhole engine –  Use tungsten_provision_thl –  Use Sqoop

•  Oracle –  Traditional CSV export and import –  Use parallel extractor –  Use Sqoop

Page 24: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Comparing Loading Methods for Hadoop Manual via CSV Sqoop Tungsten

Replicator

Process Manual/Scripted Manual/Scripted Fully Automated

Incremental Loading

Possible with DDL changes

Requires DDL changes

Fully Supported

Latency Full-load Intermittent Real-time

Extraction Requirements

Full table scan Full and partial table scans

Low-impact CDC/binlog scan

Page 25: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Hadoop Demo 3

Page 26: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Sqoop and Materialization within Hadoop

Hive materialization

CSV

StagingTable

Base Table

Sqoop

Replicate

Page 27: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

27

Op Seqno ID Msg

I 1 1 Hello World!

I 2 2 Meet MC

D 3 1

I 3 1 Goodbye World

Op Seqno

ID Msg

I 2 2 Meet MC

I 3 1 Goodbye World

How the Materialization Works

Page 28: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

• Extract from MySQL or Oracle • Hadoop Support

– Cloudera (Certified), HortonWorks, MapR, Pivotal, Amazon EMR, IBM (Certified), Apache

• Provision using Sqoop or parallel extraction • Schema generation for Hive •  Tools for generating materialized views

• Parallel CSV file loading

• Partition loaded data by commit time

• Schema Change Notification

28

Replication support (Hadoop specific)

Page 29: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

29

1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233343536373839404142434445

Monday Wednesday Friday

Data Warehouse Possibilities: Point in Time Tables

Page 30: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

30

Op Seqno

ID Date Msg

I 1 1 1/6/14 Hello World!

I 2 2 2/6/14 Meet MC

I 3 1 2/6/14 Goodbye World

I 4 1 3/6/14 Hello Tuesday

I 4 2 3/6/14 Ruby Wednesday

I 5 1 4/6/14 Final Count

ID Date Msg

1 1/6/14 Hello World!

1 2/6/14 Goodbye World

1 3/6/14 Hello Tuesday

1 4/6/14 Final Count

Data Warehouse Possibilities: Time Series Generation

Page 31: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

•  Tungsten Replicator builds are available on code.google.com http://code.google.com/p/tungsten-replicator/ •  Replicator documentation is available on Continuent website http://docs.continuent.com/tungsten-replicator-3.0/deployment-hadoop.html •  Tungsten Hadoop tools are available on GitHub https://github.com/continuent/continuent-tools-hadoop

31

Contact Continuent for support

Getting Started!

Page 32: Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

For more information, contact us: Robert Noyes Alliance Manager, USA & Canada [email protected] +1 (650) 575-0958 Philippe Bernard Alliance Manager, EMEA & APAC [email protected] +41 79 347 1385

MC Brown Senior Product Line Manager [email protected] Eero Teerikorpi Sr. Director, Strategic Alliances [email protected] +1 (408) 431-3305