wed 1030 mc_knight_william_color

24
Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved Confidential and Proprietary Slide 1 Unlock Potential Columnar Databases: Data Does the Twist and Analytics Shout William McKnight, President, McKnight Consulting Group

Upload: dataversity

Post on 20-Aug-2015

587 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 1

Unlock Potential

Columnar Databases:

Data Does the Twist and Analytics Shout

William McKnight, President, McKnight Consulting Group

Page 2: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 2

William McKnight,

www.mcknightcg.com

Helping organizations adopt business-effective information management practices and technologies.

Page 3: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 3

Agenda

• Row-Wise Design

• Columnar Storage

• Materialization

• Wrap-Up

Page 4: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 4

Unlock Potential

Row-Wise Design

© McKnight Consulting Group, 2010

Page 5: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 5

DBMS Design over the years

RDBMS design is virtually unchanged, except for

parallelism

Hardware, however:

Disk capacity has increased tremendously

(and got far cheaper)

CPU performance has improved too, but…

Transfer rates and seek times have increased

modestly

Page 6: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 6

L2 Cache Misses

CPU

L1

L2

Memory

Disk

Page 7: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 7

Row-Wise DBMS Stores Data in Rows

CustomerIDCompanyName ContactFirstName ContactLastName ContactTitle PhoneNumber

1119 m4ii dhamotharan achaiyan solutions architect 91222507176

1120 Aris Doug Johnson Practice Director 206-676-5636

1121 Stolt Offshore MS Ltd Craig Lennox Mr +66 1226 712519

1122 Medtronic, Inc. Mark Kohls Principle Database Administrator 763.516.2557

1123 Beckman Coulter Tim Parsons Business Systems Manager +61 22 996 0963

1124 Banco de Bogotá José Alfredo López Arias Administrador DWH 5713320032

1126 The Boeing Company Mike Roberts Senior Business Process Architect (206)655-7155

1127 IT/1 Consulting Leif B. Soerensen Data Warehouse Consultant +65 26236691

1128 Banco de Bogotá JOSE ALFREDO LOPEZ ARIAS Administrador DWH 5713320032

1133 The HArtford Jimmy Chen Business System Analyst 215-653-2662

1134 CGI Group Terry Petherick Senior Consultant 613-236-2155

1135 Metavante Corporation Ron Kundinger Assistant Vice President 616-577-9227

1138 CP Associates Wilson Mak Consultant 252-92593731

1142 PRSB Ming Long Wu Assistant Administrator 226-2-23931261 ext 719

1143 aft greg tanner cto 303.233.6122

1144 Zamba Solutions Jeff McCall Executive Vice President 602-626-6125

1146 MR Consultancy Mukesh Rughani Mr +66 (0)1379 662219

1147 Intellor Group Robin Martin Project Coordinator 301-202-6766

1148 Banco de Bogotá José Alfredo López Arias Administrador DWH 5713320032

Page 8: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 8

Data Page Layout

© McKnight Consulting Group, 2010

Page Header

Page

Footer

Row IDs

1120Aris Doug Johnson Practice

Director 206-676-5636

[email protected]

1121Stolt Offshore MS Ltd Craig Lennox Mr

+66 1226 71269

[email protected]

1122Medtronic, Inc. Mark Kohls Principle

Database Administrator

763.516.2557

[email protected]

Records

Page 9: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 9

Traditional databases

Date Store # State Class Sales Category …

3/1/2010 32 NY A 6 Gen

3/1/2010 35 CT A 9 Spec

3/1/2010 36 CT C 11 Gen

3/1/2010 39 SD D 8 Gen

3/1/2010 42 KY A 5 Spec

3/1/2010 43 VT C 14 Spec

3/1/2010 47 GA A 31 Gen

3/1/2010 51 MD A 4 Sub

3/1/2010 55 DC D 16 Gen

3/1/2010 59 NY B 7 Gen

3/1/2010 62 NJ C 9 Spec

Calculate the average

sales for the “A”

stores in “NY”

Traditional approach:

• Data stored by row using

small data pages (4K or 8K)

• For queries, select a ‘filter’ -Build B-tree index for filters, -BUT If filter is not selective enough then scan the table

-Go to selected pages and add up sales numbers

-Randomly distributed data will result in most pages being read -Still have to read irrelevant data in each page

Page 10: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 10

Unlock Potential

Columnar Storage

© McKnight Consulting Group, 2010

Page 11: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 11

Columnar DBMS Stores Data in

Columns

CustomerID 1119 1120 1121 1122 1123 1124 1126 1127 1128 1133 1134

CompanyName m4ii Aris Stolt Offshore MS LtdMedtronic, Inc. Beckman Coulter Banco de Bogotá The Boeing CompanyIT/1 Consulting Banco de Bogotá The HArtford CGI Group

ContactFirstName dhamotharan Doug Craig Mark Tim José Alfredo Mike Leif B. JOSE ALFREDO Jimmy Terry

ContactLastName achaiyan Johnson Lennox Kohls Parsons López Arias Roberts Soerensen LOPEZ ARIAS Chen Petherick

ContactTitle solutions architect Practice Director Mr Principle Database AdministratorBusiness Systems ManagerAdministrador DWHSenior Business Process ArchitectData Warehouse ConsultantAdministrador DWHBusiness System AnalystSenior Consultant

PhoneNumber 91222507176 206-676-5636 +66 1226 712519 763.516.2557 +61 22 996 0963 5713320032 (206)655-7155 +65 26236691 5713320032 215-653-2662 613-236-2155

Page 12: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 12

Columnar Data Page Layout

Page Header

Page

Footer

1120

1121

1122

1123

1124

1125

Records

Page 13: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 13

Vertical Partitioning of Data

Columnar -

Columns are

stored

independently

Date Store # State Class Sales Category …

3/1/2010 32 NY A 6 Gen

3/1/2010 35 CT A 9 Spec

3/1/2010 36 CT C 11 Gen

3/1/2010 39 SD D 8 Gen

3/1/2010 42 KY A 5 Spec

3/1/2010 43 VT C 14 Spec

3/1/2010 47 GA A 31 Gen

3/1/2010 51 MD A 4 Sub

3/1/2010 55 DC D 16 Gen

3/1/2010 59 NY B 7 Gen

3/1/2010 62 NJ C 9 Spec

Benefits:

• Consistent data types are easy to compress

• Resulting storage size is typically less than 50% the

size of the raw data

Page 14: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 14

Columnar Storage Options

Decomposed Storage Model

Positional Representation

Modified B-Tree/Row Length Encryption

Bitmap

Page 15: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 15

Modified B-Tree/Run Length

Encryption

Qtr Store# Sales Qtr

Q1 32 6 Q1 1 500

Q1 35 9 Q2 501 999

Q1 36 11 Q3 1000 1498

Q1 39 8 Store#

Q1 42 5 32 1 1

Q1 43 14 35 2 2

Q2 32 31 36 3 3

Q2 35 4

Q2 36 16

Q2 39 7

Q2 42 9

(Value, StartPosition, Count)

Page 16: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 16

Row-based

CustomerIDCompanyName ContactFirstName ContactLastName ContactTitle PhoneNumber

1119 m4ii dhamotharan achaiyan solutions architect 91222507176

1120 Aris Doug Johnson Practice Director 206-676-5636

1121 Stolt Offshore MS Ltd Craig Lennox Mr +66 1226 712519

1122 Medtronic, Inc. Mark Kohls Principle Database Administrator 763.516.2557

1123 Beckman Coulter Tim Parsons Business Systems Manager +61 22 996 0963

1124 Banco de Bogotá José Alfredo López Arias Administrador DWH 5713320032

1126 The Boeing Company Mike Roberts Senior Business Process Architect (206)655-7155

1127 IT/1 Consulting Leif B. Soerensen Data Warehouse Consultant +65 26236691

1128 Banco de Bogotá JOSE ALFREDO LOPEZ ARIAS Administrador DWH 5713320032

1133 The HArtford Jimmy Chen Business System Analyst 215-653-2662

1134 CGI Group Terry Petherick Senior Consultant 613-236-2155

1135 Metavante Corporation Ron Kundinger Assistant Vice President 616-577-9227

1138 CP Associates Wilson Mak Consultant 252-92593731

1142 PRSB Ming Long Wu Assistant Administrator 226-2-23931261 ext 719

1143 aft greg tanner cto 303.233.6122

1144 Zamba Solutions Jeff McCall Executive Vice President 602-626-6125

1146 MR Consultancy Mukesh Rughani Mr +66 (0)1379 662219

1147 Intellor Group Robin Martin Project Coordinator 301-202-6766

1148 Banco de Bogotá José Alfredo López Arias Administrador DWH 5713320032

Workload Splitting

Same data in both structures

Optimizer or user determines which to use

Columnar

CustomerIDCompanyName ContactFirstName ContactLastName ContactTitle PhoneNumber

1119 m4ii dhamotharan achaiyan solutions architect 91222507176

1120 Aris Doug Johnson Practice Director 206-676-5636

1121 Stolt Offshore MS Ltd Craig Lennox Mr +66 1226 712519

1122 Medtronic, Inc. Mark Kohls Principle Database Administrator 763.516.2557

1123 Beckman Coulter Tim Parsons Business Systems Manager +61 22 996 0963

1124 Banco de Bogotá José Alfredo López Arias Administrador DWH 5713320032

1126 The Boeing Company Mike Roberts Senior Business Process Architect (206)655-7155

1127 IT/1 Consulting Leif B. Soerensen Data Warehouse Consultant +65 26236691

1128 Banco de Bogotá JOSE ALFREDO LOPEZ ARIAS Administrador DWH 5713320032

1133 The HArtford Jimmy Chen Business System Analyst 215-653-2662

1134 CGI Group Terry Petherick Senior Consultant 613-236-2155

1135 Metavante Corporation Ron Kundinger Assistant Vice President 616-577-9227

1138 CP Associates Wilson Mak Consultant 252-92593731

1142 PRSB Ming Long Wu Assistant Administrator 226-2-23931261 ext 719

1143 aft greg tanner cto 303.233.6122

1144 Zamba Solutions Jeff McCall Executive Vice President 602-626-6125

1146 MR Consultancy Mukesh Rughani Mr +66 (0)1379 662219

1147 Intellor Group Robin Martin Project Coordinator 301-202-6766

1148 Banco de Bogotá José Alfredo López Arias Administrador DWH 5713320032

CustomerIDCompanyName ContactFirstName ContactLastName ContactTitle PhoneNumber

1119 m4ii dhamotharan achaiyan solutions architect 91222507176

1120 Aris Doug Johnson Practice Director 206-676-5636

1121 Stolt Offshore MS Ltd Craig Lennox Mr +66 1226 712519

1122 Medtronic, Inc. Mark Kohls Principle Database Administrator 763.516.2557

1123 Beckman Coulter Tim Parsons Business Systems Manager +61 22 996 0963

1124 Banco de Bogotá José Alfredo López Arias Administrador DWH 5713320032

1126 The Boeing Company Mike Roberts Senior Business Process Architect (206)655-7155

1127 IT/1 Consulting Leif B. Soerensen Data Warehouse Consultant +65 26236691

1128 Banco de Bogotá JOSE ALFREDO LOPEZ ARIAS Administrador DWH 5713320032

1133 The HArtford Jimmy Chen Business System Analyst 215-653-2662

1134 CGI Group Terry Petherick Senior Consultant 613-236-2155

1135 Metavante Corporation Ron Kundinger Assistant Vice President 616-577-9227

1138 CP Associates Wilson Mak Consultant 252-92593731

1142 PRSB Ming Long Wu Assistant Administrator 226-2-23931261 ext 719

1143 aft greg tanner cto 303.233.6122

1144 Zamba Solutions Jeff McCall Executive Vice President 602-626-6125

1146 MR Consultancy Mukesh Rughani Mr +66 (0)1379 662219

1147 Intellor Group Robin Martin Project Coordinator 301-202-6766

1148 Banco de Bogotá José Alfredo López Arias Administrador DWH 5713320032

CustomerIDCompanyName ContactFirstName ContactLastName ContactTitle PhoneNumber

1119 m4ii dhamotharan achaiyan solutions architect 91222507176

1120 Aris Doug Johnson Practice Director 206-676-5636

1121 Stolt Offshore MS Ltd Craig Lennox Mr +66 1226 712519

1122 Medtronic, Inc. Mark Kohls Principle Database Administrator 763.516.2557

1123 Beckman Coulter Tim Parsons Business Systems Manager +61 22 996 0963

1124 Banco de Bogotá José Alfredo López Arias Administrador DWH 5713320032

1126 The Boeing Company Mike Roberts Senior Business Process Architect (206)655-7155

1127 IT/1 Consulting Leif B. Soerensen Data Warehouse Consultant +65 26236691

1128 Banco de Bogotá JOSE ALFREDO LOPEZ ARIAS Administrador DWH 5713320032

1133 The HArtford Jimmy Chen Business System Analyst 215-653-2662

1134 CGI Group Terry Petherick Senior Consultant 613-236-2155

1135 Metavante Corporation Ron Kundinger Assistant Vice President 616-577-9227

1138 CP Associates Wilson Mak Consultant 252-92593731

1142 PRSB Ming Long Wu Assistant Administrator 226-2-23931261 ext 719

1143 aft greg tanner cto 303.233.6122

1144 Zamba Solutions Jeff McCall Executive Vice President 602-626-6125

1146 MR Consultancy Mukesh Rughani Mr +66 (0)1379 662219

1147 Intellor Group Robin Martin Project Coordinator 301-202-6766

1148 Banco de Bogotá José Alfredo López Arias Administrador DWH 5713320032

CustomerIDCompanyName ContactFirstName ContactLastName ContactTitle PhoneNumber

1119 m4ii dhamotharan achaiyan solutions architect 91222507176

1120 Aris Doug Johnson Practice Director 206-676-5636

1121 Stolt Offshore MS Ltd Craig Lennox Mr +66 1226 712519

1122 Medtronic, Inc. Mark Kohls Principle Database Administrator 763.516.2557

1123 Beckman Coulter Tim Parsons Business Systems Manager +61 22 996 0963

1124 Banco de Bogotá José Alfredo López Arias Administrador DWH 5713320032

1126 The Boeing Company Mike Roberts Senior Business Process Architect (206)655-7155

1127 IT/1 Consulting Leif B. Soerensen Data Warehouse Consultant +65 26236691

1128 Banco de Bogotá JOSE ALFREDO LOPEZ ARIAS Administrador DWH 5713320032

1133 The HArtford Jimmy Chen Business System Analyst 215-653-2662

1134 CGI Group Terry Petherick Senior Consultant 613-236-2155

1135 Metavante Corporation Ron Kundinger Assistant Vice President 616-577-9227

1138 CP Associates Wilson Mak Consultant 252-92593731

1142 PRSB Ming Long Wu Assistant Administrator 226-2-23931261 ext 719

1143 aft greg tanner cto 303.233.6122

1144 Zamba Solutions Jeff McCall Executive Vice President 602-626-6125

1146 MR Consultancy Mukesh Rughani Mr +66 (0)1379 662219

1147 Intellor Group Robin Martin Project Coordinator 301-202-6766

1148 Banco de Bogotá José Alfredo López Arias Administrador DWH 5713320032

CustomerIDCompanyName ContactFirstName ContactLastName ContactTitle PhoneNumber

1119 m4ii dhamotharan achaiyan solutions architect 91222507176

1120 Aris Doug Johnson Practice Director 206-676-5636

1121 Stolt Offshore MS Ltd Craig Lennox Mr +66 1226 712519

1122 Medtronic, Inc. Mark Kohls Principle Database Administrator 763.516.2557

1123 Beckman Coulter Tim Parsons Business Systems Manager +61 22 996 0963

1124 Banco de Bogotá José Alfredo López Arias Administrador DWH 5713320032

1126 The Boeing Company Mike Roberts Senior Business Process Architect (206)655-7155

1127 IT/1 Consulting Leif B. Soerensen Data Warehouse Consultant +65 26236691

1128 Banco de Bogotá JOSE ALFREDO LOPEZ ARIAS Administrador DWH 5713320032

1133 The HArtford Jimmy Chen Business System Analyst 215-653-2662

1134 CGI Group Terry Petherick Senior Consultant 613-236-2155

1135 Metavante Corporation Ron Kundinger Assistant Vice President 616-577-9227

1138 CP Associates Wilson Mak Consultant 252-92593731

1142 PRSB Ming Long Wu Assistant Administrator 226-2-23931261 ext 719

1143 aft greg tanner cto 303.233.6122

1144 Zamba Solutions Jeff McCall Executive Vice President 602-626-6125

1146 MR Consultancy Mukesh Rughani Mr +66 (0)1379 662219

1147 Intellor Group Robin Martin Project Coordinator 301-202-6766

1148 Banco de Bogotá José Alfredo López Arias Administrador DWH 5713320032

CustomerIDCompanyName ContactFirstName ContactLastName ContactTitle PhoneNumber

1119 m4ii dhamotharan achaiyan solutions architect 91222507176

1120 Aris Doug Johnson Practice Director 206-676-5636

1121 Stolt Offshore MS Ltd Craig Lennox Mr +66 1226 712519

1122 Medtronic, Inc. Mark Kohls Principle Database Administrator 763.516.2557

1123 Beckman Coulter Tim Parsons Business Systems Manager +61 22 996 0963

1124 Banco de Bogotá José Alfredo López Arias Administrador DWH 5713320032

1126 The Boeing Company Mike Roberts Senior Business Process Architect (206)655-7155

1127 IT/1 Consulting Leif B. Soerensen Data Warehouse Consultant +65 26236691

1128 Banco de Bogotá JOSE ALFREDO LOPEZ ARIAS Administrador DWH 5713320032

1133 The HArtford Jimmy Chen Business System Analyst 215-653-2662

1134 CGI Group Terry Petherick Senior Consultant 613-236-2155

1135 Metavante Corporation Ron Kundinger Assistant Vice President 616-577-9227

1138 CP Associates Wilson Mak Consultant 252-92593731

1142 PRSB Ming Long Wu Assistant Administrator 226-2-23931261 ext 719

1143 aft greg tanner cto 303.233.6122

1144 Zamba Solutions Jeff McCall Executive Vice President 602-626-6125

1146 MR Consultancy Mukesh Rughani Mr +66 (0)1379 662219

1147 Intellor Group Robin Martin Project Coordinator 301-202-6766

1148 Banco de Bogotá José Alfredo López Arias Administrador DWH 5713320032

Page 17: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 17

The Value of Performance

“How many MALES are NOT INSURED in CALIFORNIA?

Gender

M

M

F

M

M

-

800 Bytes/Row

10M

ROWS

State

NY

CA

CT

MA

CA

-

RDBMS Insured

Y

Y

N

Y

N

800 Bytes x 10M 16K Page

= 500,000 I/Os

Process large amounts of unused data

Often requires full table scan

M Y CA

M N CA

F Y NY

M N CA

1

2

4

3

Gender Insured State

= 2 + +

1

1

0

1

1

1

0

1

0

1

0

1

10M

Bits

10M Bits x 3 col / 8 16K Page

= 235 I/Os

Page 18: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 18

Unlock Potential

Materialization

© McKnight Consulting Group, 2010

Page 19: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 19

Materialization Strategies

Function of ‘projection’

Row-stores = removes unneeded columns

from result set

Column-stores = when to GLUE

Early Materialization

Construct rows before processing

Decompress all compressed columns first

Late Materialization

Wait until end of operation

Page 20: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 20

Early Materialization

2

3

3

3

7

13

42

80

2

1

3

1

4

4

4

4

(4,1,4)

prodID

2

1

3

1

storeID

2

3

3

3

custID

7

13

42

80

price

SELECT custID,price

FROM Sales

WHERE (prodID = 4) AND (storeID = 1)

Materialize

Selection (where) 3 13 1 4

3 80 1 4

Projection

(select)

3 13

3 80

Page 21: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 21

Late Materialization

(4,1,4)

prodID

2

1

3

1

storeID

SELECT custID, price

FROM Sales

WHERE (prodID = 4) AND (storeID = 1)

Select

prodId = 4

Select

storeID = 1

1

1

1

1

0

1

0

1

AND

3

3

13

80

3 13

3 80

Construct

Page 22: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 22

Unlock Potential

Wrap-Up

© McKnight Consulting Group, 2010

Page 23: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 23

Summary: Column Databases

Is an alternative to row storage

Is seeing more adoption – vendors/customers

Stores each column independently

Addresses idle CPUs and disk bottlenecks

Is great for compression

Is best when there is a lot of data, long rows and

when you can isolate the loads

Is great for high column selectivity queries

Takes longer to load

Page 24: Wed 1030 mc_knight_william_color

Copyright © 2011 McKnight Consulting Group, LLC All Rights Reserved – Confidential and Proprietary Slide 24

Columnar Databases: Data Does the

Twist and Analytics Shout

Presented by:

William McKnight

President

McKnight Consulting Group LLC

(214) 514-1444

[email protected]

www.mcknightcg.com

Twitter @williammcknight