mariadb columnstore - london mysql meetup

23
ColumnStore OpenSource Engine for Analytics/BI Bruno Šimić Solutions Engineer

Upload: ivan-zoratti

Post on 11-Apr-2017

125 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

Title Slide (Blue)

ColumnStoreOpenSource Engine for Analytics/BI

Bruno ŠimićSolutions Engineer

Page 2: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

Section Header (Blue)

Data and big data

OLTP vs. OLAP

Page 3: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

ComparisonPowerPoint Default

OLTPOn-line Transaction Processing• large number of short on-line

transactions (INSERT, UPDATE, DELETE)

• very fast query processing, maintaining data integrity in multi-access environments

• effectiveness measured by number of transactions per second

• operational (detailed and current) data• OLTPs are the original data source

OLAPOn-line Analytical Processing • characterized by low volume of

concurrent transactions • complex queries, often involving

aggregations• response time is effectiveness measure• data is aggregated, historical and stored

in multi-dimensional schemas• OLAP data comes from the various

OLTP Databases

Rows/DataSize Scope

1 100 10,000 1,000,000 100,000,000 10,000,000,000 100,000,000,00010-100GB 100-1000GB 1-10TB 10-100TB...PB

MariaDB OLTP MariaDB ColumnStore OLAP

Page 4: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

Chart (Text Bottom)

Right click and select “edit data” to change numbers.

1. Descriptive AnalyticsWhat is Happening?

Traditional OLAP

2. Diagnostic AnalyticsWhy did it Happen?

3. Predictive AnalyticsWhat is likely to happen?

4. Prescriptive AnalyticsWhat should I do about it?

Big Data Analytics

Social Media

Sensors

Node 1

Biometrics

Mobile

Data Collection

MariaDB ColumnStore

Data Processing

BI Tools, Data Science Applications

Connectors,SPARK Integration etc

MariaDB MaxScale

Transactional, Operational

Analytics Insight

Page 5: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

Title OnlyPowerPoint Default

MariaDB ColumnStore Architecture

Columnar Distributed Data StorageLocal Storage | SAN | EBS | Gluster FS

BI Tool SQL Client Custom Big Data

App

Application

MariaDB SQL Front End

Distributed Query Engine

Data Storage

User Module (UM)

Performance Module (PM)

Page 6: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

ComparisonPowerPoint Default

User Modules• mysqld - The MariaDB server

• ExeMgr - MariaDB’s interface to ColumnStore

• cpimport - high-performance data import

Query Processing - UM• SQL Operations are translated into

thousands of Primitives• Parallel/Distributed SQL• Extensible with Parallel/Distributed

UDFs• Query is parsed by mysqld on UM node• Parsed query handed over to ExeMgr

on UM node• ExecMgr breaks down the query in

primitive operations

MariaDB SQL Front End

User Module (UM)

Page 7: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

ComparisonPowerPoint Default

Performance Modules• PrimProc - Primitives Processor

• WriteEngineServ - Database file writing processor

• DMLProc - DML writes processor

• DDLProc - DDL processor

Query Processing - PM• Primitives are processed on PM• One thread working on a range of rows• Typically 1/2 million rows, stored in a

few hundred blocks of data• Execute all column operations required

(restriction and projection)• Execute any group by/aggregation

against local data• Each primitive executes in a fraction of

a second• Primitives are run in parallel and fully

distributed

Distributed Query Engine

Performance Module (PM)

Page 8: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

Two ContentPowerPoint Default

Storage Architecture

• Columnar storage– Each column stored as separate file– No index management for query

performance tuning– Online Schema changes: Add new column

without impacting running queries

• Automatic horizontal partitioning– Logical partition every 8 Million rows– In memory metadata of partition min and max– No partition management for query

performance tuning

• Compression– Accelerate decompression rate– Reduce I/O for compressed blocks

Column 1

Extent 1 (8 million rows, 8MB~64MB)

Extent 2 (8 million rows)

Extent M (8 million rows)

Column 2 Column 3 ... Column N

Data automatically arranged by • Column – Acts as Vertical Partitioning• Extents – Acts as horizontal partition

Vertical Partition

Horizontal Partition

...

Vertical Partition

Vertical Partition

Vertical Partition

Horizontal Partition

Horizontal Partition

Page 9: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

Two ContentPowerPoint Default

High Performance Data Ingestion

• Fully parallel high speed data load

– Parallel data loads on all PMs simultaneously

– Multiple tables in can be loaded simultaneously

– Read queries continue without being blocked

• Micro-batch loading for real-time data flow

Column 1

Extent 1 (8 million rows, 8MB~64MB)

Extent 2 (8 million rows)

Extent M (8 million rows)

Column 2 ... Column N

Horizontal Partition

...

Horizontal Partition

Horizontal Partition

High Water Mark

New Data being loaded

Dat

a ac

cess

ed b

y ru

nnin

g qu

erie

s

Page 10: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

Section Header (Blue)

Column-oriented storage

Differences to row-oriented storage

Page 11: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

Title OnlyPowerPoint Default

Row oriented:rows stored sequentially in a file.

Column oriented:Each column is stored in a separate file. Each column for a given row is at the same offset.

Row-oriented vs. Column-oriented formatKey Fname Lname State Zip Phone Age Sex

1 Bugs Bunny NY 11217 (718) 938-3235 34 M2 Yosemite Sam CA 95389 (209) 375-6572 52 M3 Daffy Duck NY 10013 (212) 227-1810 35 M4 Elmer Fudd ME 04578 (207) 882-7323 43 M5 Witch Hazel MA 01970 (978) 744-0991 57 F

Key12345

FnameBugsYosemiteDaffyElmerWitch

LnameBunnySamDuckFuddHazel

StateNYCANYMEMA

Zip1121795389100130457801970

Phone(718) 938-3235(209) 375-6572(212) 227-1810(207) 882-7323(978) 744-0991

Age3452354357

SexMMMMF

Page 12: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

Title OnlyPowerPoint Default

Row oriented:new rows appended to the end.

Column oriented:new value added to each file.

Single-Row Operations - InsertKey Fname Lname State Zip Phone Age Sex

1 Bugs Bunny NY 11217 (718) 938-3235 34 M2 Yosemite Sam CA 95389 (209) 375-6572 52 M3 Daffy Duck NY 10013 (212) 227-1810 35 M4 Elmer Fudd ME 04578 (207) 882-7323 43 M5 Witch Hazel MA 01970 (978) 744-0991 57 F6 Marvin Martian CA 91602 (818) 761-9964 26 M

Key12345

FnameBugsYosemiteDaffyElmerWitch

LnameBunnySamDuckFuddHazel

StateNYCANYMEMA

Zip1121795389100130457801970

Phone(718) 938-3235(209) 375-6572(212) 227-1810(207) 882-7323(978) 744-0991

Age3452354357

SexMMMMF

6 Marvin Martian CA 91602 (818) 761-9964 26 M

Columnar insert not efficient for singleton insertions (OLTP). Batch loads touches row vs. column. Batch load on column-oriented is faster (compression, no indexes).

Page 13: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

Title OnlyPowerPoint Default

Row oriented:new rows deleted

Column oriented:value deleted from each file

Single-Row Operations - DeleteKey Fname Lname State Zip Phone Age Sex

1 Bugs Bunny NY 11217 (718) 938-3235 34 M2 Yosemite Sam CA 95389 (209) 375-6572 52 M3 Daffy Duck NY 10013 (212) 227-1810 35 M4 Elmer Fudd ME 04578 (207) 882-7323 43 M5 Witch Hazel MA 01970 (978) 744-0991 57 F6 Marvin Martian CA 91602 (818) 761-9964 26 M

Key12345

FnameBugsYosemiteDaffyElmerWitch

LnameBunnySamDuckFuddHazel

StateNYCANYMEMA

Zip1121795389100130457801970

Phone(718) 938-3235(209) 375-6572(212) 227-1810(207) 882-7323(978) 744-0991

Age3452354357

SexMMMMF

6 Marvin Martian CA 91602 (818) 761-9964 26 M

Recommended Partition Drop to allow dropping columns in bulk.

Page 14: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

Title OnlyPowerPoint Default

Row oriented:Update 100% of rows means change 100% of blocks on disk.

Column oriented:Just update the blocks needed to be updated

Single-Row Operations - UpdateKey Fname Lname State Zip Phone Age Sex

1 Bugs Bunny NY 11217 (718) 938-3235 34 M2 Yosemite Sam CA 95389 (209) 375-6572 52 M3 Daffy Duck NY 10013 (212) 227-1810 35 M4 Elmer Fudd ME 04578 (207) 882-7323 43 M5 Witch Hazel MA 01970 (978) 744-0991 57 F

Key12345

FnameBugsYosemiteDaffyElmerWitch

LnameBunnySamDuckFuddHazel

StateNYCANYMEMA

Zip1121795389100130457801970

Phone(718) 938-3235(209) 375-6572(212) 227-1810(207) 882-7323(978) 744-0991

Age3452354357

SexMMMMF

Page 15: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

Title OnlyPowerPoint Default

Row oriented:requires rebuilding of the whole table

Column oriented:Create new file for the new column

Changing the table structureKey Fname Lname State Zip Phone Age Sex Active

1 Bugs Bunny NY 11217 (718) 938-3235 34 M Y2 Yosemite Sam CA 95389 (209) 375-6572 52 M N3 Daffy Duck NY 10013 (212) 227-1810 35 M N4 Elmer Fudd ME 04578 (207) 882-7323 43 M Y5 Witch Hazel MA 01970 (978) 744-0991 57 F N

Key12345

FnameBugsYosemiteDaffyElmerWitch

LnameBunnySamDuckFuddHazel

StateNYCANYMEMA

Zip1121795389100130457801970

Phone(718) 938-3235(209) 375-6572(212) 227-1810(207) 882-7323(978) 744-0991

Age3452354357

SexMMMMF

ActiveYNNYN

Column-oriented is very flexible for adding columns, no need for a full rebuild required with it.

Page 16: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

ComparisonPowerPoint Default

Horizontal Partition:

8 Million RowsExtent 2

Horizontal Partition:

8 Million RowsExtent 3

Horizontal Partition:

8 Million RowsExtent 1

Storage Architecture reduces I/O

• Only touch column files that are in projection, filter and join conditions

• Eliminate disk block touches to partitions outside filter and join conditions

Extent 1: Min State: CA, Max State: NY

Extent 2: Min State: OR, Max State: WY

Extent 3: Min State: IA, Max State: TN

SELECT Fname FROM Table 1 WHERE State = ‘NY’

High Performance Query Processing

ID

1

2

3

4

...

8M

8M+1

...

16M

16M+1

...

24M

Fname

Bugs

Yosemite

Daffy

Hazel

...

...

Jane

...

Elmer

Lname

Bunny

Sam

Duck

Fudd

...

...

...

State

NY

CA

NY

ME

...

MN

WY

TX

OR

...

VA

TN

IA

NY

...

PA

Zip

11217

95389

10013

04578

...

...

...

Phone

(718) 938-3235

(209) 375-6572

(212) 227-1810

(207) 882-7323

...

...

...

Age

34

52

35

43

...

...

...

Sex

M

M

M

F

...

...

...

Vertical Partition

Vertical Partition

Vertical Partition

Vertical Partition

Vertical Partition

ELIMINATED PARTITION

Page 17: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

Two ContentPowerPoint Default

Analytics

– In-database distributed analytics with complex join, aggregation, window functions

– Extensible UDF for custom analytics

– Cross Engine Join with other storage engines

Window functions

– PARTITION BY / ORDER BY

– Aggregate functions: MAX, MIN, COUNT, SUM, AVG STD, STDDEV_SAMP, STDDEV_POP, VAR_SAMP, VAR_POP

– Ranking: ROW_NUMBER, RANK, DENSE_RANK, PERCENT_RANKCUME_DIST, NTILE, PERCENTILE, PERCENTILE_CONTPERCENTILE_DISC, MEDIAN

Daily Running Average Revenue for each item

SELECT item_id, server_date, daily_revenue, AVG(revenue) OVER (PARTITION BY item_id ORDER BY server_date RANGE INTERVAL '1' DAY PRECEDING ) running_avgFROM web_item_sales

Item ID Server_date Revenue

1 02-01-2014 20,000.00

1 02-02-2014 5,001.00

2 02-01-2014 15,000.00

2 02-04-2014 34,029.00

2 02-05-2014 7,138.00

3 02-01-2014 17,250.00

3 02-03-2014 25,010.00

3 02-04-2014 21,034.00

3 02-05-2014 4,120.00

Running Average

20,000.00

12,500.50

15,000.00

34,209.00

20,583.50

17,250.00

250,100.00

12,577.00

20,583.50

Page 18: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

Section Header (Blue)

MariaDB ColumnStore

Best practices

Page 19: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

ComparisonPowerPoint Default

General

• Not suited for OLTP, needs big data to process fast (millions of records)

• Micro-batch load allows near real-time behaviour

• Infrequently used columns do not impact other queries

• Columnar suitable for sparse columns (nulls compress nicely)

Query Modeling

• Star-schema optimizations are generally a good idea

• Conservative data typing is important– fixed-length vs. dictionary boundary (8

bytes)– IP Address vs. IP Number

• Break down compound fields into individual fields– Trivializes searching for sub-fields– Can avoid dictionary overhead– Cost to re-assemble is generally small

Best Practices

Page 20: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

ComparisonPowerPoint Default

Cpimport

• Fastest way to load data from CSV file, standard input, binary source file

• Multiple tables in can be loaded in parallel by launching multiple jobs

• Read queries continue without being blocked

• Successful cpimport is auto-committed

• In case of errors, entire load is rolled back

LOAD DATA INFILE

• Traditional way of importing data into any MariaDB storage engine table

• Up to 2 times slower than cpimport for large size imports

• Either success or error operation can be rolled back

Data Ingestion

Page 21: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

ComparisonPowerPoint Default

HA at UM node

• When one UM node goes down, another UM node takes over

HA at Data Storage

• AWS EBS (Elastic Block Store)• GlusterFS - Multiple copy of data block

across storage. If a disk on a PM node fails, another PM node will have access to the copy of the data

High Availability

HA at PM node

• SAN/AWS EBS - When a PM node goes down, the data volumes attached to the failed PM node gets attached to another PM

• Local Disks -If a PM node goes down, the data on its disks are not available, though queries continue on the remaining data set

Page 22: MariaDB ColumnStore - LONDON MySQL Meetup

LAYOUT

Title OnlyPowerPoint Default

Where to find MariaDB ColumnStore?

SOFTWARE DOWNLOAD https://mariadb.com/downloads/columnstore

SOURCE https://github.com/mariadb-corporation/mariadb-columnstore-engine

DOCUMENTATION https://mariadb.com/kb/en/mariadb/mariadb-columnstore/

BLOGS https://mariadb.com/blog-tags/columnstore

</>

Page 23: MariaDB ColumnStore - LONDON MySQL Meetup

Thank you

LAYOUT

Thank You (Blue)

Bruno Šimić[email protected]