replicating in real-time from mysql to amazon redshift

©Continuent 2014

Replicating in Real-Time from MySQL to Amazon Redshift

Featuring Continuent Tungsten

Robert Hodges, CEO Jeff Mace, Director of Services

©Continuent 2014

Presentation Topics

• Introductions

• Tungsten Replicator Overview

• Real-Time Data Warehouse Loading

• Adding Redshift to the Mix

• Replication from Tungsten Clusters to Redshift

• Wrap-up

©Continuent 2014

Introducing Continuent

• The leading provider of clustering and replication for open source DBMS

• Our Product: Continuent Tungsten

• Clustering - Commercial-grade HA, performance scaling and data management for MySQL

• Replication - Flexible, high-performance data movement

©Continuent 2014©Continuent 2014

Continuent Tungsten Customers

©Continuent 2014

Quick Continuent Facts

• Largest Tungsten installation by data volume processes over 800 million transactions per day on 225 terabytes of relational data

• Largest installation by transaction volume handles up to 8 billion transactions daily

• Wide variety of topologies including MySQL, Oracle, Vertica, and Hadoop are in production now

• Amazon Redshift is the newest data warehouse target

©Continuent 2014 6

Tungsten Replicator Overview

©Continuent 2014

What is Tungsten Replicator?

A real-time, high-performance,

open source database replication engine

!GPL V2 license - 100% open source

Download from https://code.google.com/p/tungsten-replicator/ Annual support subscription available from Continuent

“Golden Gate without the Price Tag”®

©Continuent 2014

Tungsten Replicator Overview

Master

(Transactions + Metadata)

DBMS Logs

Replicator

(Transactions + Metadata)

THLReplicator

Extract transactions

from log

©Continuent 2014

Replication Services and Parallel Apply

Extract Filter Apply

StageExtract Filter Apply

StageStage

Pipeline

Master DBMS

Transaction History Log

Parallel Queue

Slave DBMS

Extract Filter ApplyExtract Filter ApplyExtract Filter Apply

©Continuent 2014

HeterogeneousOracleMySQL Oracle MySQL MySQL

master-slave

fan-in slave all-masters

MySQLOracle Oracle

©Continuent 2014

Real-Time Data Warehouse Loading

©Continuent 2014

Why MySQL Needs A Data Warehouse

id cust_id prod_id ...

1 335301 532 ...

2 2378 6235 ...

3 ... ... ...

Sales Table

id sku type

532 C00135 consumer

533 S09957 specialty

... ...

Product Tableprod_id id

6235 2

... ...

Prod_ID Index

Row format makes table scans very

Indexes slow OLTP

Low/no data compression

Limited index types

Limited join

Single-threaded query — no parallelization

©Continuent 2014

Current Data Warehouse Options

OLTP Data

MySQL Master

Scalable row store with star schema and materialized view support

CSV files stored on HDFS with access via Hive/MapReduce

Column storage with compression and built-in time series support

©Continuent 2014

Traditional ETL Approach

Sales Table

LoadTransferExtract

Date columns = intrusive

Batch-oriented = not timely

Scan for changes = performance hit

Sales Table

Data Warehouse

©Continuent 2014

Real-Time Data Replication

Sales Table

Fast propagation = timely

No SQL changes = transparent

Automatic change capture = low impact

DBMS Logs

Data Replication

Sales Table

Data Warehouse

©Continuent 2014

Loading to an Oracle Data Warehouse

MySQL Tungsten Master Replicator

oracle

MySQLExtractor Special Filters * Transform enum to string

binlog_format=row

Tungsten Slave Replicator

oracle

Special Filters * Ignore extra tables * Map names to upper case * Optimize updates to remove unchanged columns MySQL

Binlog

Data stored in rows like

©Continuent 2014

How about Loading to Hadoop?

Replication

CSV FilesCSV FilesBuffered

TransactionsBinlog

Provisioning

Data stored as append-only files

©Continuent 2014

Typical Raw Hadoop Data Layout

556,MALTESE HOPE,4.99,127\n 557,MANCHURIAN CURTAIN,3.99,177\n 558,MANNEQUIN WORST,2.99,71\n 559,MARRIED GO,2.99,114\n

field separator

file partitioning

record separator

compression type conversions

(CSV file)

©Continuent 2014

Loading to a Hadoop Data Warehouse

Replicator

hadoopTransactions from master

CSV FilesCSV FilesCSV Files

Staging TablesStaging TablesStaging “Tables”

Base TablesBase TablesMaterialized Views

Javascript load script

e.g. hadoop.js

Write data to CSV

Merge using external

MapReduce job

(Generate Hive table definitions)

Load using hadoop

command

©Continuent 2014

Hadoop Materialized View Generation

Transaction logs Snapshot

UNION ALL

Emit last row per key if not a delete

REDUCE

Materialized view including all updates

Sort by key(s), transaction orderSHUFFLE

©Continuent 2014

How about Loading to Vertica?

Replication

CSV FilesCSV FilesBuffered

TransactionsBinlog

Provisioning?

Data stored in column format

©Continuent 2014

Data Layout in a Column Store

Sales Table

cust_id

335301

prod_id

Fast scans on columns

Updates to single rows are hideously

quantity

Every column is an index

Good compression

C00135

S09957

consumer

specialty

Product Table

Fast joins with parallel

©Continuent 2014

Loading to a Vertica Data Warhouse

vertica

MySQLExtractor Special Filters * pkey - Fill in pkey info * colnames - Fill in names * replicate - Ignore tables

binlog_format=row

vertica

MySQL Binlog

CSV FilesCSV FilesCSV FilesCSV FilesCSV Files

Large transaction batches to leverage load parallelization

©Continuent 2014

Vertica Batch Loading--The Details

Replicator

verticaTransactions from master

Staging TablesStaging TablesStaging Tables

Base Tables

Merge Script

(or) COPY

directly to base tables

COPY to stage tables SELECT to

base tables

No external merge required!

©Continuent 2014

Provisioning Using a Sandbox Server

OLTP Server

Temporary Sandbox Server

Vertica Cluster

1. Restore logical backup

2. Replicate restored transactions

3. Replicate normally after restore loads

©Continuent 2014 26

Adding Amazon Redshift to the Mix

©Continuent 2014

Tungsten Support for Amazon Redshift

• Tungsten Replicator 3.0 supports real-time replication to Amazon Redshift

• Loading builds on data warehouse support for Vertica and Hadoop

• Beta program in progress now

• Production ready in September 2014

©Continuent 2014

Redshift = Cloud Data Warehouse

©Continuent 2014

Redshift Capabilities

• On-demand column store in Amazon Web Services

• Looks like a PostgreSQL 8.x DBMS

• $1,000 per terabyte per year or $0.25/hour

• Automatic backup and cluster management

• Integrated into Amazon VPC

• Data loading from Amazon S3 using SQL COPY command

Connecting to Amazon Redshift

$ psql -h redshift-webinar.cub79gczb9kq.us-east-1.redshift.amazonaws.com -U tungsten -d dev -p 5439 psql (9.2.6, server 8.0.2) WARNING: psql version 9.2, server version 8.0. Some psql features might not work. SSL connection (cipher: ECDHE-RSA-AES256-SHA, bits: 256) Type "help" for help. !dev=# select * from test.foo; id | data ----+-------------- 2 | salutations! 3 | bye! 1 | hello! (3 rows) !dev=# \q

Tungsten Loading to Redshift

Tungsten Slave

redshiftTransactions

from Tungsten Master

Staging TablesStaging Tables

RedShift Staging Tables

Base TablesBase TablesRedShift Base Tables

Javascript load script

e.g. hadoop.js

Write data to CSV

3. Merge using SQL to base

2. COPY to Redshift1. Load to S3

using s3cmd

S3 Storage

Prerequisites for Redshift Loading

redshift

binlog_format=row

redshift

MySQL Binlog

Allow access from Tungsten host

Java 6+, Ruby 1.8.5+

Amazon RedshiftJava 6+,

Ruby 1.8.5+ s3cmd 1.0+ S3 credentials

S3 Storage

Bucket for CSV uploads

Application Prerequisites

• Primary keys on all tables

• UTF-8 character set--or at least be consistent

• Use GMT timezone--or be very consistent about dates

Avoid “Cowboy” schema changes to MySQL master(s)

Generate Schema Using ddlscan

•Data types? •Column lengths? •Naming conventions? •Staging tables?

MySQL Tables

ddlscanAmazon Redshift

Staging Table Schema

$ bin/ddlscan -db test -template ddl-mysql-redshift-staging.vm!...!!CREATE SCHEMA test;!!DROP TABLE test.stage_xxx_foo;!CREATE TABLE test.stage_xxx_foo!(! tungsten_opcode CHAR(2),! tungsten_seqno INT,! tungsten_row_id INT,! tungsten_commit_timestamp TIMESTAMP,! id INT,! data VARCHAR(120) /* VARCHAR(30) */,! PRIMARY KEY (tungsten_opcode, tungsten_seqno, tungsten_row_id)!);

Base Table Schema

$ bin/ddlscan -db test -template ddl-mysql-redshift.vm!...!!CREATE SCHEMA test;!!DROP TABLE test.foo;!CREATE TABLE test.foo!(! id INT,! data VARCHAR(120) /* VARCHAR(30) */,! PRIMARY KEY (id)!);

Sandbox Provisioning for Redshift

OLTP Server

Temporary Sandbox Server

Redshift Cluster

1. Restore logical backup

2. Replicate restored transactions

3. Replicate normally after restore loads

Amazon Redshift

Setting up MySQL to Redshift Data Loading

Replicating from Tungsten Clusters to Redshift

60 Second Intro to Tungsten Clusters

Tungsten clusters combine off-the-shelf open source MySQL servers into data services with: !

• 24x7 data access • Scaling of load on replicas • Simple management commands !...without app changes or data migration

Amazon US West

apache /php

GonzoPortal.com

Connector Connector

Replicating from a Cluster to Redshift

Tungsten Cluster1

master

Tungsten 3.0 Replicator

cluster1 Amazon Redshift

Filters and Special Configuration * pkey - Fill in pkey info * colnames - Fill in names * fixmysqlstrings - Convert to UT8 * Replicate from both nodes for higher availability

Fan-in from Multiple Tungsten Clusters

Redshift Cluster

cluster2

Tungsten Replicator

cluster1 Amazon Redshift

Getting Started with MySQL to Redshift Loading

Where Is Everything?

• Tungsten Replicator 3.0 builds are available on code.google.com http://code.google.com/p/tungsten-replicator/

(Visit the nightly builds page for latest features)

• Replicator documentation is available on Continuent website https://docs.continuent.com/tungsten-replicator-3.0/deployment-redshift.html

Contact Continuent for support

In Conclusion…

• Tungsten Replicator 3.0 now supports realtime loading from MySQL to Amazon Redshift

• You can start using it yourself or join our beta program

• Software will be production-ready in September 2014

• Continuent has a wealth of data loading features on tap so stay tuned!

www.continuent.com Follow us on Twitter @continuent

Tungsten Replicator: http://code.google.com/p/tungsten-replicator

Our Blogs: http://scale-out-blog.blogspot.com http://datacharmer.org/blog http://www.continuent.com/news/blogs http://flyingclusters.blogspot.com/

560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009 e-mail: sales@continuent.com

replicating in real-time from mysql to amazon redshift

Software

high redshift starbursts

mysql time machine by replicating into hbase - slides from...

redshift draft

self-replicating machine

replicating portfolio discounted value replicating portfolio...

redshift graphic novel

aws helper tools… · redshift spectrum • query s3 data...

quines (self-replicating programs)

redshift questions

replicating lending circles - report

amazon redshift - cluster management guide · amazon...

redshift overview

self replicating robot

tuning your amazon redshift and tableau software...

replicating your data

hautelook + redshift

replicating portfolios in the insurance industry - soa ·...

redshift interpretation

sqlalchemy-redshift documentation - read the docs › pdf...

amazon redshift