building high performance mysql query systems and analytic applications

38
1 2009 Calpont Corporation – Confidential & Proprietary Building High-Performance MySQL Query Systems and Analytic Applications Robin Schumacher

Upload: guest40cda0b

Post on 19-Nov-2014

5.541 views

Category:

Technology


2 download

DESCRIPTION

This presentation gives practical advice and tips on how to build high-performance read intensive databases, and discusses innovations such as column-oriented databases

TRANSCRIPT

Page 1: Building High Performance MySql Query Systems And Analytic Applications

1

2009 Calpont Corporation – Confidential & Proprietary

Building High-Performance MySQL Query Systems and Analytic Applications

Robin Schumacher

Page 2: Building High Performance MySql Query Systems And Analytic Applications

2

2009 Calpont Corporation – Confidential & Proprietary

Agenda

• The importance of query and analytic applications• Core recommendations for building fast query /

analytic applications• Practical techniques for creating high-performance

query / analytic systems• Conclusions

Page 3: Building High Performance MySql Query Systems And Analytic Applications

3

2009 Calpont Corporation – Confidential & Proprietary

What are we talking about?

• We’re talking about databases that are used – primarily – for servicing read-intensive applications

• These systems could be 100% devoted to query activity or a hybrid application that services both read-intensive work and traditional OLTP activities

• The design and performance of these systems differ greatly from traditional OLTP databases

Page 4: Building High Performance MySql Query Systems And Analytic Applications

4

2009 Calpont Corporation – Confidential & Proprietary

Reporting and Business Intelligence DB’s

• All companies recognize the need for BI• Challenges come in the forms of large data

volumes, performance, and cost• Staffing and lack of experience can also cause

issues

Page 5: Building High Performance MySql Query Systems And Analytic Applications

5

2009 Calpont Corporation – Confidential & Proprietary

Data Warehouses/Marts/Analytic DB’s

OLTP

Files/XML

Log Files

Operational

Source Data

Stag

ing

or O

DS

ETL

Fina

l ET

L

Rep

ortin

g, B

I, N

otifi

catio

n La

yer Ad-Hoc

Dashboards

Reports

Notifications

Users

Staging

Area

Data

Warehouse

Warehouse

Archive

Purge/Archive

Data Warehouse and Metadata Management

Page 6: Building High Performance MySql Query Systems And Analytic Applications

6

2009 Calpont Corporation – Confidential & Proprietary

Reporting Databases

OLTP Database Read Shard OneReporting Database

Application Servers

End Users

ETL

Data Archiving Link

Replication

Page 7: Building High Performance MySql Query Systems And Analytic Applications

7

2009 Calpont Corporation – Confidential & Proprietary

Application Sharding / Partitioning

• Read ‘sharding’ or partitioning becoming very popular, especially in high-traffic Web environments

• Basic tactic is to direct all read/query traffic to one set of databases and OLTP work to a different database

• Sometimes involves a fair amount of application work to ensure traffic is directed to the proper databases, but many find it not all that difficult

Page 8: Building High Performance MySql Query Systems And Analytic Applications

8

2009 Calpont Corporation – Confidential & Proprietary

Read Sharding / Partitioning

Page 9: Building High Performance MySql Query Systems And Analytic Applications

9

2009 Calpont Corporation – Confidential & Proprietary

What are the core rules to follow in order to avoid anxiety over building fast read-intensive, reporting,

and analytic databases?

Page 10: Building High Performance MySql Query Systems And Analytic Applications

10

2009 Calpont Corporation – Confidential & Proprietary

• Not as easy as it sounds in a legacy RDBMS• Indexing is not always the answer (and can actually make

things worse in some cases)• All I/O is important – excessive logical I/O can cripple a system

every bit as fast as too much physical I/O can• Data ‘traffic jams’ oftentimes occur because of unnecessary

I/O

#1 Only Read the Data You Need

Page 11: Building High Performance MySql Query Systems And Analytic Applications

11

2009 Calpont Corporation – Confidential & Proprietary

• Use all available CPUs/cores• Some MySQL storage engines (InnoDB, Cluster) now

scale beyond 4 CPUs but they aren’t the best for query activity

• Use all available disk/storage devices• Look into distributed caching architectures

#2 Exploit Modern Hardware

Page 12: Building High Performance MySql Query Systems And Analytic Applications

12

2009 Calpont Corporation – Confidential & Proprietary

• This generally equates to parallel processing and application partitioning in hybrid systems

• Queries should be parallelized across CPUs/cores on single box

• Queries should be parallelized across multiple nodes in MPP fashion

• Only way to truly tackle large volumes of data

#3 Divide and Conquer

Page 13: Building High Performance MySql Query Systems And Analytic Applications

13

2009 Calpont Corporation – Confidential & Proprietary

• Divide and conquer applies to both the I/O layer and user connectivity layer

• Distribution of I/O via MPP allows linear performance gains when properly done

• Even idle user connections can eat up resources fast; should have way of scaling concurrency

#4 Scale both I/O and User Connections

Page 14: Building High Performance MySql Query Systems And Analytic Applications

14

2009 Calpont Corporation – Confidential & Proprietary

• Should have way to increase capacity and resources without stopping query activity

• For critical systems, will need to have way to failover to stand-by servers if primary fails

• Both should be as transparent to the end user as possible

#5 Provide Transparent Expansion and Failover

Page 15: Building High Performance MySql Query Systems And Analytic Applications

15

2009 Calpont Corporation – Confidential & Proprietary

• For real time or near real time applications, data loads must have minimal impact on query activity

• Data (obviously…) should be loaded as quickly as possible, which means parallel load processing

• Scheduled loads and ETL operations should be auto-monitored

• Watch impact of new data on query times

#6 Load New Data with Minimal Impact

Page 16: Building High Performance MySql Query Systems And Analytic Applications

16

2009 Calpont Corporation – Confidential & Proprietary

• Must have way to exonerate the innocent and implicate the guilty – in other words, is it the database or not?

• Bad database design is the #1 cause of poor performance

• Poorly coded SQL is the #2 cause

#7 Quickly Troubleshoot Poor Read Performance

Page 17: Building High Performance MySql Query Systems And Analytic Applications

17

2009 Calpont Corporation – Confidential & Proprietary

Good suggestions, but how can I practically do all these things…?

Page 18: Building High Performance MySql Query Systems And Analytic Applications

18

2009 Calpont Corporation – Confidential & Proprietary

What is Calpont’s InfiniDB?

InfiniDB is an open source, column-oriented database architected to handle data warehouses, data marts, analytic/BI systems, and other read-intensive applications. It delivers true scale up (more CPU’s/cores, RAM) and massive parallel processing (MPP) scale out capabilities for MySQL users. Linear performance gains are achieved when adding either more capabilities to one box or using commodity machines in a scale out configuration.

Scale up Scale Out

Page 19: Building High Performance MySql Query Systems And Analytic Applications

19

2009 Calpont Corporation – Confidential & Proprietary

• Column databases only read the columns needed to satisfy a query vs. full rows

• Column databases (most of them…) remove the need for indexing because the column is the index

• Column databases automatically eliminate unnecessary I/O both logically and physically

• As a rule of thumb, column databases provide 5-10x the query performance of legacy RDBMS’s

• InfiniDB has a column-oriented architecture

#1 Only Read the Data You Need

Recommendation: Start using a column-oriented database

Caveat: if you are reading all (select *) or most of the columns in a table, then a column database may not be right for your application.

Page 20: Building High Performance MySql Query Systems And Analytic Applications

20

2009 Calpont Corporation – Confidential & Proprietary

Column vs. Row Orientation

A column-oriented architecture looks the same on the surface, but stores data differently than legacy/row-based databases…

Page 21: Building High Performance MySql Query Systems And Analytic Applications

21

2009 Calpont Corporation – Confidential & Proprietary

#2 Exploit Modern Hardware

• A column-based database with scale up capabilities is a great combination – not only do you read only the data that’s needed but it is accelerated by all a machine’s processing power

• Scale up abilities generally equates to having a multi-threaded database architecture

• Currently, the only internal MySQL engines that offer scale up are InnoDB and MySQL Cluster, neither of which are optimal for complex, analytic queries.

• InfiniDB from Calpont is both column-oriented and multi-threaded

Recommendation: Use databases/storage engines that scale up (i.e. use available CPU’s/cores)

Page 22: Building High Performance MySql Query Systems And Analytic Applications

22

2009 Calpont Corporation – Confidential & Proprietary

InfiniDB Community – Scale Up

InfiniDB Community edition is a FOSS, multi-threaded database server that is capable of using a machine’s CPUs/cores to process queries

SSB Query (@100 scale)

InfiniDB 1 Core(elapsed time in

seconds)

InfiniDB 8 cores(elapsed time in

seconds)

Overall Percent Reduction with

additional cores

Q2.1 210.21 44.65 79%

Q2.2 151.20 19.70 87%

Q2.3 121.33 15.94 87%

Q3.1 316.79 55.04 83%

Q3.2 164.12 22.14 87%

Page 23: Building High Performance MySql Query Systems And Analytic Applications

23

2009 Calpont Corporation – Confidential & Proprietary

#3 Divide and Conquer

• For Web and general purpose database applications, look into read sharding/partitioning to service queries. Can be done via replication or ETL

• For data warehousing/analytic databases, domain or time-based partitioning across multiple machines via ETL can help

• Memcached usage can help in certain cases• InfiniDB provides true MPP query capabilities to

deliver a real divide-and-conquer strategy

Recommendation: Use Scale-Out in addition to Scale-up

Page 24: Building High Performance MySql Query Systems And Analytic Applications

24

2009 Calpont Corporation – Confidential & Proprietary

InfiniDB Enterprise – Scale Up and Out

User Connections

User Module

1

User Module

n

Performance Module 1

Performance Module n

Performance Module 2

Shared StorageDatabase files, System Catalog

Page 25: Building High Performance MySql Query Systems And Analytic Applications

25

2009 Calpont Corporation – Confidential & Proprietary

#3 Divide and Conquer

SSB Query@1000

1PM(elapsed time in seconds)

2PM(elapsed time in seconds)

4PM(elapsed time in seconds)

8PM(elapsed time in seconds)

Overall Percent Reduction from

1 – 8PM’s

Q2.1 531.34 261.35 129.90 68.21 87%

Q2.2 430.25 214.87 106.37 56.41 87%

Q2.3 386.66 192.03 96.03 51.36 87%

Q3.1 848.79 425.25 316.50 134.21 84%

Q3.2 597.97 297.46 148.49 77.74 87%

InfiniDB also ‘divides and conquers’ by:

• Shared nothing data cache provides distributed data cache across all nodes

• Distributed hash joins, which are tailor-made for large join operations

Page 26: Building High Performance MySql Query Systems And Analytic Applications

26

2009 Calpont Corporation – Confidential & Proprietary

#4 Scale both I/O and User Connections

Recommendation: Use modular architecture

User Connections

User Module

1

User Module

n

Performance Module 1

Performance Module n

Performance Module 2

Shared StorageDatabase files, System Catalog

Add more Performance Modules to scale I/O

Add more User Modules to scale concurrency

Page 27: Building High Performance MySql Query Systems And Analytic Applications

27

2009 Calpont Corporation – Confidential & Proprietary

#5 Provide Transparent Expansion and Failover

• A combination of replication and application sharding / partitioning can provide for capacity expansion and failover

• Failover is not built in to MySQL but can be implemented via replication and floating IP’s or other products like DRBD

• InfiniDB allows new Performance Module nodes to be transparently added and removed. Failover is automatically handled at the performance module level

• InfiniDB allows new User Modules to be added and configured to an existing setup. Failover involves aiming existing users at other participating nodes from a failed user module

Recommendation: Use either replication or MPP

Page 28: Building High Performance MySql Query Systems And Analytic Applications

28

2009 Calpont Corporation – Confidential & Proprietary

#5 Provide Transparent Expansion and Failover

Cust_id 1-999

Cust_id 1000-1999

Cust_id 2000-2999

Web/AppServers Sharding Architecture

Browsers

MyS

QL

Rep

lication

Page 29: Building High Performance MySql Query Systems And Analytic Applications

29

2009 Calpont Corporation – Confidential & Proprietary

User Connections

User Module

1

User Module

n

Performance Module 1

Performance Module n

Performance Module 2

Shared StorageDatabase files, System Catalog

If one Performance Module fails, traffic resumes with the remaining nodes

User queries can be redirected to other User Modules if one fails

#5 Provide Transparent Expansion and Failover

Page 30: Building High Performance MySql Query Systems And Analytic Applications

30

2009 Calpont Corporation – Confidential & Proprietary

#6 Load New Data with Minimal Impact

• For incremental data feeds, you can use ETL tools to write new data to flat files on read database host and then load them with high-speed loader vs. incremental inserts

• Storage engines supporting MVCC should be able to support concurrent loads/queries

• InfiniDB supports MVCC• InfiniDB has high-speed, multi-threaded, non-blocking loader

that loads data and simply moves a table’s high-water mark once the load has been completed

Recommendation: Use two-step ETL feed with non-blocking load utilities and/or MVCC database engine

Page 31: Building High Performance MySql Query Systems And Analytic Applications

31

2009 Calpont Corporation – Confidential & Proprietary

OLTP

Files/XML

Log Files

Operational

Source Data

Stag

ing

or O

DS

ETL

Hig

h-sp

eed

Load

Util

ity

Ad-Hoc

Dashboards

Reports

Notifications

Users

Staging

Area

Data

Warehouse

Data Warehouse and Metadata Management

#6 Load New Data with Minimal Impact

Page 32: Building High Performance MySql Query Systems And Analytic Applications

32

2009 Calpont Corporation – Confidential & Proprietary

#7 Quickly Troubleshoot Poor Read Performance

• MySQL 5.1 and above ships with mysqlslap utility which can help do load testing; others 3rd party tools exist as well

• MySQL 5.1 SQL profiler a good utility to examine SQL performance

• InfiniDB offers both a SQL statement diagnostic utility as well as a more detailed trace utility for troubleshooting slow running code

• InfiniDB removes the need for indexing, partitioning, and most other database design tuning; no heavy-duty expertise required to build a very fast database

Recommendation: Proactively use load testing; reactively use SQL analysis and tracing

Page 33: Building High Performance MySql Query Systems And Analytic Applications

33

2009 Calpont Corporation – Confidential & Proprietary

InfiniDB Extent Map – No Indexing Needed

Ext 2Min 101Max 200Ext 3Min 201Max 300Ext 4Min 301Max 400

Col1

Ext 1Min 1Max 100

Ext 2Min 10100Max 20000Ext 3Min 20100Max 30000Ext 4Min 30100Max 40000

Col2

Ext 1Min 100Max 10000

If a column WHERE filter of “COL1 BETWEEN 220 AND 250 AND COL2 < 10000” is specified, InfiniDB will eliminate extents 1, 2 and 4 from the first column filter, then,

looking at just the matching extents for COL2 (i.e. just extent 3), it will determine that no extents match and return zero rows without doing any I/O at all.

Extent Map

Also enables logical rangepartitioning of data…

Page 34: Building High Performance MySql Query Systems And Analytic Applications

34

2009 Calpont Corporation – Confidential & Proprietary

Summary

Recommendation General Technique InfiniDB

Only read the data you need

Use column database Is column-oriented

Exploit modern hardware Use DB’s/storage engines that are multi-threaded

Is multi-threaded and uses multiple CPUs / Cores

Divide and Conquer Spread load via replication or MPP

Supports MPP scale out

Scale concurrency and I/O Application partition Modular architecture for scaling both concurrency and I/O

Provide transparent expansion and failover

Use replication and load balancers

Does transparent failover for I/O and manual for connectivity

Load data with minimal impact

Use two-step ETL and bulk load process

Has high-speed loader with no blocking and MVCC

Method for troubleshooting poor read performance

Use load testing and SQL analysis tools

Provides both diagnostic and tracing tools; no major design tuning efforts

Page 35: Building High Performance MySql Query Systems And Analytic Applications

35

2009 Calpont Corporation – Confidential & Proprietary

Calpont Solutions

Calpont Analytic Database Server EditionsCalpont Analytic Database Solutions

InfiniDB Community Server

Column-OrientedMulti-threaded

Terabyte CapableSingle Server

InfiniDBEnterprise Server

Scale out /Parallel Processing Automatic

Failover

InfiniDBEnterprise Solution

Monitoring

24x7Support

Auto PatchManagement

Alerts & SNMPNotifications

Hot FixBuilds

ConsultativeHelp

Page 36: Building High Performance MySql Query Systems And Analytic Applications

36

2009 Calpont Corporation – Confidential & Proprietary

InfiniDB Community & Enterprise Server Comparison

Core Database Server Features InfiniDB

Community

InfiniDB

Enterprise

MySQL front end Yes Yes

Column-oriented Yes Yes

Logical data compression Yes Yes

High-Speed bulk loader w/ no blocking queries while loading Yes Yes

Crash-recovery Yes Yes

Transaction support (ACID compliant) Yes Yes

INSERT/UPDATE/DELETE (DML) support Yes Yes

Multi-threaded engine (queries/writes will use all CPU’s/cores on box) Yes Yes

No indexing necessary Yes Yes

Automatic vertical (column) and logical horizontal partitioning of data Yes Yes

MVCC support – snapshot read (readers don’t block writers) Yes Yes

Alter Table with online add column capability Yes Yes

High concurrency supported Yes Yes

Terabyte database capable Yes Yes

Multi-Node, MPP scale out capable w/ failover No Yes

Support Forums Only Formal Production

Support

Page 37: Building High Performance MySql Query Systems And Analytic Applications

37

2009 Calpont Corporation – Confidential & Proprietary

For More Information

• Download InfiniDB Community Edition• Download InfiniDB documentation• Read the InfiniDB technical white paper• Read InfiniDB intro articles on MySQL dev zone• Visit InfiniDB online forums• Trial the InfiniDB Enterprise Edition: http://www.calpont.com

www.infinidb.org

Page 38: Building High Performance MySql Query Systems And Analytic Applications

38

2009 Calpont Corporation – Confidential & Proprietary

Building High-Performance MySQL Query Systems and Analytic Applications

Thanks…!