mysql data warehousing – a survival guide

31
MySQL Data Warehousing Survival Guide Marius Moscovici ([email protected]) Steffan Mejia ([email protected])

Upload: oleksiy-kovyrin

Post on 14-Oct-2014

497 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Mysql Data Warehousing – A Survival Guide

MySQL Data Warehousing Survival Guide

Marius Moscovici ([email protected])Steffan Mejia ([email protected])

Page 2: Mysql Data Warehousing – A Survival Guide

Topics

• The size of the beast • Evolution of a Warehouse

 • Lessons Learned

 • Survival Tips

 • Q&A

Page 3: Mysql Data Warehousing – A Survival Guide

Size of the beast

• 43 Serverso 36 activeo 7 standby spares

• 16 TB of data in MySQL • 12 TB archived (pre S3 staging)• 4 TB archived (S3) • 3.5B rows in main warehouse • Largest table ~ 500M rows (MySQL)

Page 4: Mysql Data Warehousing – A Survival Guide

Warehouse Evolution - First came slaving

Problems:

• Reporting slaves easily fall    behind

 • Reporting limited to one-pass

SQL

Page 5: Mysql Data Warehousing – A Survival Guide

Warehouse Evolution - Then came temp tables

Problems: • Easy to lock replication with

temp table creation  • Slaving becomes fragile

Page 6: Mysql Data Warehousing – A Survival Guide

Warehouse Evolution - A Warehouse is Born

Problems: • Warehouse workload limited

by what can be performed by a single server

Page 7: Mysql Data Warehousing – A Survival Guide

Warehouse Evolution - Workload DistributedProblems: • No Real-Time

Application integration support

Page 8: Mysql Data Warehousing – A Survival Guide

Warehouse Evolution - Integrate Real Time Data

Page 9: Mysql Data Warehousing – A Survival Guide

Lessons Learned - Warehouse Design

Workload exceeds available memory

Page 10: Mysql Data Warehousing – A Survival Guide

Lessons Learned - Warehouse Design

• Keep joins < available memory

• Heavily Denormalize data for effective reporting

• Minimize joins between large tables

• Aggressively archive historical data 

Page 11: Mysql Data Warehousing – A Survival Guide

Lessons Learned - Data Movement

• Mysqldump is your friend

• Sequence parent/child data loads based on ETL assumptionso Orders without order lineso Order lines without orders

• Data Movement Use Caseso Fullo Incrementalo Upsert (Insert on duplicate key update)

Page 12: Mysql Data Warehousing – A Survival Guide

Full Table Loads

• Good for small tables

• Works for tables with no primary key 

• Data is fully replaced on each load

Page 13: Mysql Data Warehousing – A Survival Guide

Incremental Loads

• Table contains new rows but no updates

• Good for insert-only tables

• High-water mark level included in Mysqldump where clause

Page 14: Mysql Data Warehousing – A Survival Guide

Upsert Loads

• Table contains new and updated rows

• Table must have primary key

• Can be used to update only subset of columns

Page 15: Mysql Data Warehousing – A Survival Guide

Lessons Learned - ETL Design

• Avoid large joins like the plague

• Break out ETL jobs into bite-size-bites

• Ensure target data integrity on ETL failure

• Use memory staging tables to boost performance 

Page 16: Mysql Data Warehousing – A Survival Guide

ETL Design - Sample Problem

Build a daily summary of customer event log activity

Page 17: Mysql Data Warehousing – A Survival Guide

ETL Design - Sample Solution

Page 18: Mysql Data Warehousing – A Survival Guide

ETL Pseudo code - Step 1

1) Create staging table & Find High Water Mark:

SELECT IFNULL(MAX(calendar_date),'2000-01-01') INTO @last_loaded_date FROM user_event_log_summary;

set max_heap_table_size = <big enough number to hold several days data>

CREATE TEMPORARY TABLE user_event_log_summary_staging (.....)ENGINE = MEMORY;

CREATE INDEX user_idx  USING HASH on user_event_log_summary_staging(user_id);

Page 19: Mysql Data Warehousing – A Survival Guide

ETL Pseudo code - Step 2

2) Summarize events:

INSERT INTO user_event_log_summary_staging (calendar_date, user_id, event_type, event_count)

SELECT DATE(event_time), user_id, event_type, COUNT(*)FROM event_logWHERE event_time > CONCAT(@last_loaded_date, '23:59:59')GROUP BY 1,2,3;

Page 20: Mysql Data Warehousing – A Survival Guide

ETL Pseudo code - Step 3

3) Set denormalized user columns:

UPDATE user_event_log_summary_staging log_summary,              userSET log_summary.type = user.type,     log_summary.status = user.statusWHERE user.user_id = log_summary.user_id;

Page 21: Mysql Data Warehousing – A Survival Guide

ETL Pseudo code - Step 4

3) Insert into Target Table:

INSERT INTO user_event_log_summary(...)SELECT ...FROM user_event_log_summary_staging;

Page 22: Mysql Data Warehousing – A Survival Guide

Functional Partitioning

• Benefits depend on

o Partition Execution Times

o Data Move Times

o Dependencies between functional partitions

Page 23: Mysql Data Warehousing – A Survival Guide

Functional Partitioning

Page 24: Mysql Data Warehousing – A Survival Guide

Job Management

• Run everything single-threaded on a server

• Handle dependencies between jobs across servers

• Smart re-start key to survival

• Implemented 3-level hierarchy of processingo Process (collection of build steps and data moves)o Build Steps (ETL 'units of work')o Data Moves

Page 25: Mysql Data Warehousing – A Survival Guide

DW Replication

• Similar to other MySQL environmentso Commodity hardware o Master-slave pairs for all databases

• Mixed environments can be difficulto Use rsync to create slaveso But not with ssh (on private network)

 • Monitoring 

o Reporting queries need to be monitored Beware of blocking queries Only run reporting queries on slave (temp table issues)

o Nagioso Gangliao Custom scripts

Page 26: Mysql Data Warehousing – A Survival Guide

Infrastructure Planning

• Replication latencyo Warehouse slave unable to keep up o Disk utilization > 95%o Required frequent re-sync

  • Options evaluated

o  Higher speed conventional diskso  RAM increaseo  Solid-state-disks

Page 27: Mysql Data Warehousing – A Survival Guide

Optimization

• Check / reset HW RAID settings• Use general query log to track ETL / Queries• Application timing 

o isolate poor-performing parts of the build• Optimize data storage - automatic roll-off of older data

Page 28: Mysql Data Warehousing – A Survival Guide

Infrastructure Changes

• Increased memory 32GB -> 64GB• New servers have 96GB RAM

• SSD Solutiono 12 & 16 disk configurationso RAID6 vs. RAID10o 2.0T or 1.6TB formatted capacityo SATA2 HW BBU RAID6o ~ 8 TB data on SSD

Page 29: Mysql Data Warehousing – A Survival Guide

Results

• Sometimes it pays to throw hardware at a problem o 15-hour warehouse builds on old systemo 6 hours on optimized systemo No application changes

Page 30: Mysql Data Warehousing – A Survival Guide

Finally...Archive

Two-tiered solution• Move data into archive tables in separate DB• Use select to dump data - efficient and fast• Archive server handles migration

o Dump datao GPGo Push to S3

Page 31: Mysql Data Warehousing – A Survival Guide

Survival Tips

• Efforts to scale are non-linearo As you scale, it becomes increasingly difficult to manageo Be prepared to supplement your warehouse strategy

Dedicated appliance Distributed processing (Hadoop, etc)

• You can gain a great deal of headroom by optimizing I/Oo Optimize current disk I/O path o Examine SSD / Flash solutionso Be pragmatic about table designs

• It's important to stay ahead of the performance curveo Be proactive - monitor growth, scale early

• Monitor everything, including your userso Bad queries can bring replication down