Download - Column Store Index and Batch Mode Scalability

Column Store Index and Batch Mode

Scalability

An independent SQL ConsultantA user of SQL Server from version 2000 onwards with 12+ years

experience.

About me . . .

The scalability challenges we face . . . .

Slides borrowedfrom Thomas Kejserwith his kind permission

CPU Cache, Memory and IO Subsystem Latency

Core

Core

Core

Core

L1

L1

L1

L1

L3

L2

L2

L2

L2

1ns 10ns 100ns 100us 10ms10us

C The “Cache out” CurveThroughput

Touched Data Size

CPU Cache

TLB

NUMARemote

Storage

Every time we drop out of a cache and use the next slower one down, we pay a big throughput penalty

CPCaches

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000 28000 30000 320000

100,000,000

200,000,000

300,000,000

400,000,000

500,000,000

600,000,000

700,000,000

800,000,000

900,000,000

1,000,000,000

Random Pages

Sequential Pages

Single Page

Size of Accessed memory (MB)Service Time + Wait Time

C Sequential Versus Random Page CPU Cache Throughput

“Transistors per square inch on integrated circuits has doubled every two years since the integrated circuit was invented”

Spinning disk state of play Interfaces have evolved Aerial density has increased Rotation speed has peaked at 15K RPM Not much else . . .

Up until NAND flash, disk based IO sub systems have not kept pace with CPU advancements.

With next generation storage ( resistance ram etc) CPUs and storage may follow the same curve.

Moores Law Vs. Advancements In Disk Technology

How Execution Plans Run

Row by row Row by row

Row by row Row by row

How do rows travel betweenIterators ?

Control flow

Data Flow

What Is Required

Query execution which leverages CPU caches.

Break through levels of compressionto bridge the performance gap between IO subsystems andmodern processors.

Better query execution scalabilityas the degree of parallelism increase.

Optimizer Batch Mode

First introduced in SQL Server 2012, greatly enhanced in 2014 A batch is roughly 1000 rows in size and it is designed to fit into the L2/3

cache of the CPU, remember the slide on latency. Moving batches around is very efficient*:

One test showed that regular row-mode hash join consumed about 600 instructions per row while the batch-mode hash join needed about 85 instructions per row and in the best case (small, dense join domain) was a low as 16 instructions per row.

* From: Enhancements To SQL Server Column Stores Microsoft Research

Stack Walking The Database Engine

SELECT p.EnglishProductName ,SUM([OrderQuantity]) ,SUM([UnitPrice]) ,SUM([ExtendedAmount]) ,SUM([UnitPriceDiscountPct]) ,SUM([DiscountAmount]) ,SUM([ProductStandardCost]) ,SUM([TotalProductCost]) ,SUM([SalesAmount]) ,SUM([TaxAmt]) ,SUM([Freight])FROM [dbo].[FactInternetSales] fJOIN [dbo].[DimProduct] pON f.ProductKey = p.ProductKeyGOUP BY p.EnglishProductName

xperf –on base –stackwalk profile

xperf –d stackwalk.etl

xperfview stackwalk.etl

How do we squeeze an entire column store index into a CPU L2/3 cache ?

AnswerIts pipelined into the CPU

CPU

Lob cache

Load segments into blob cache

Break blobs into batches and pipeline them into CPU cache

Conceptual View . . . . . and whats happening in the call stack

What Difference Does Batch Mode Make ?

SELECT p.EnglishProductName ,SUM([OrderQuantity]) ,SUM([UnitPrice]) ,SUM([ExtendedAmount]) ,SUM([UnitPriceDiscountPct]) ,SUM([DiscountAmount]) ,SUM([ProductStandardCost]) ,SUM([TotalProductCost]) ,SUM([SalesAmount]) ,SUM([TaxAmt]) ,SUM([Freight]) FROM [dbo].[FactInternetSalesBig] f JOIN [dbo].[DimProduct] p ON f.ProductKey = p.ProductKey GROUP BY p.EnglishProductName

Row mode

Batch

0 50 100 150 200 250 300 350 400 450 500

Time (s)

x12

at DOP 2

What Are The Pre-Requisites For Batch Mode ?

Feature SQL Server 2012SQL

Server 2014

Presence of column store indexes Yes Yes

Parallel execution plan Yes Yes

No outer joins, NOT Ins or UNION ALLs Yes No

Hash joins do not spill from memory Yes No

Scalar aggregates cannot be used Yes No

Batch modeHash MatchAggregate78,400 ms** Timings are a statistical estimate

Row modeHash MatchAggregate445,585 ms*

Vs.

ColourRedRedBlueBlueGreenGreenGreen

DictionaryLookup ID Label1 Red2 Blue3 Green

SegmentLookup ID Run Length1 22 23 3

Optimizing Serial Scan Performance

Compressing data going down the column is far superior to compressing data going across the row, also we only retrieve the column data that is of interest.

Run length compression is usedin order to achieve this.

SQL Server 2012 introduces column store compression . . ., SQL Server 2014 adds more features to this.

SQL Server 2014 Column Store Storage Internals

RowGroups

Columns

A B C

Encode andCompress

Segments

Store

Blobs

Encode & Compress

Delta stores

< 102,400rows

Inserts of 102,400 rows and over

Inserts less than 102,400 rowsand updates update = insert into

delta store+ insert to the

deletion bit map Delta store B-tree Column store segments

Column Store Index Split Personality

Tuple mover

Local Dictionary

Global dictionary

Deletion Bitmap

SELECT [ProductKey] ,[OrderDateKey]

,[DueDateKey],[ShipDateKey],[CustomerKey],[PromotionKey],[CurrencyKey]..

INTO FactInternetSalesBigFROM [dbo].[FactInternetSales]CROSS JOIN master..spt_values AS aCROSS JOIN master..spt_values AS bWHERE a.type = 'p'AND b.type = 'p'AND a.number <= 80AND b.number <= 100

What Levels Of Compression Are Achievable ?Our ‘Big’ FactInternetSales Table

494,116,038 rows 0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

57 % 74 % 92 % 94 %

Size (Mb)

What Levels Of Compression Are Achievable ?Stack Exchange Posts* Table

Heap Row Compression Page compression Clustered column store index

Clustered column store index archive

compression

0

50

100

150

200

250

300

350

* Posts tables from the four largest stack exchanges combined ( superuser, serverfault, maths and Ubuntu )

59 %53 % 64 % 72 %

SQL Server 2012 / 2014 Column Store Comparison Feature SQL Server 2012 SQL Server

2014

Column store indexes Yes Yes

Clustered column store indexes No Yes

Updateable column store indexes No Yes

Column store archive compression No Yes

Columns in a column store index can be dropped No Yes

Support for GUID, binary, datetimeoffset precision > 2, numeric precision > 18. No Yes

Enhanced compression by storing short strings natively ( instead of 32 bit IDs ) No Yes

Bookmark support ( row_group_id:tuple_id) No Yes

Mixed row / batch mode execution No Yes

Optimized hash build and join in a single iterator No Yes

Hash memory spills cause row mode execution No Yes

Iterators supported Scan, filter, project, hash (inner) join and (local) hash aggregate

Yes

Column Store Index and Batch ModeTest Drive

Disclaimer: your own mileage may vary depending on your data, hardwareand queries

Hardware2 x 2.0 Ghz 6 core Xeon CPUsHyper threading enabled22 GB memoryRaid 0: 6 x 250 GB SATA III HD 10K RPMRaid 0: 3 x 80 GB Fusion IO

SoftwareWindows server 2012SQL Server 2014 CTP 2AdventureWorksDW DimProductTableEnlarged FactInternetSales table

Test Set Up

SELECT SUM([OrderQuantity]) ,SUM([UnitPrice]) ,SUM([ExtendedAmount]) ,SUM([UnitPriceDiscountPct]) ,SUM([DiscountAmount]) ,SUM([ProductStandardCost]) ,SUM([TotalProductCost]) ,SUM([SalesAmount]) ,SUM([TaxAmt]) ,SUM([Freight]) FROM [dbo].[FactInternetSalesBig]

Sequential Scan Performance

0

50000

100000

150000

200000

250000

300000Compression Type / Time (ms)

Time (ms)

2050Mb/s 678Mb/s 256Mb/s85% CPU 98% CPU 98% CPU

Pagecompression1,340,097 ms*

All stack trace timings are a statistical estimate

No compression545,761 ms*

Vs.

hdd cstore hdd cstore archive flash cstore flash cstore archive0

500

1000

1500

2000

2500

3000

3500

4000

4500

Elapsed Time(ms) / Column Store Compression Type

Elapsed Time(ms)/Compression Type

52 Mb/s 27 Mb/s 99% CPU 56% CPU

Clustered column store index with archive compression61,196 ms

Clustered column store index60,651 ms

Vs.

Takeaways

CPU

CPU used for IO consumption + CPU used for decompression < total CPU capacity

Compression works for you

What most people tend to have

Takeaways

CPU

CPU used for IO consumption + CPU used for decompression > total CPU capacity

Compression works against you CPU used for IO consumption + CPU used for decompression = total CPU capacity

Nothing to be gained or lost from using compression

Testing Join Scalability

SELECT p.EnglishProductName ,SUM([OrderQuantity]) ,SUM([UnitPrice]) ,SUM([ExtendedAmount]) ,SUM([UnitPriceDiscountPct]) ,SUM([DiscountAmount]) ,SUM([ProductStandardCost]) ,SUM([TotalProductCost]) ,SUM([SalesAmount]) ,SUM([TaxAmt]) ,SUM([Freight])FROM [dbo].[FactInternetSalesBig] fJOIN [dbo].[DimProduct] pON f.ProductKey = p.ProductKeyGROUP BY p.EnglishProductName

We will look at the best we can do without column store indexes:Partitioned heap fact table with page

compression for spinning diskPartitioned heap fact table without

any compression our flash storageNon partitioned column store indexes

on both types of store with and without archive compression.

2 4 6 8 10 12 14 16 18 20 22 240

100000

200000

300000

400000

500000

600000

700000

800000

HDD page compressed partitioned fact table

Flash partitioned fact table

Join Scalability DOP / Time (ms)Time (ms)

Degree of parallelism

2 4 6 8 10 12 14 16 18 20 22 240

10000

20000

30000

40000

50000

60000hdd column storehdd column store archiveflash column storeflash column store archive

Degree of parallel-ism

Time (ms) Join Scalability DOP / Time (ms)

A simple join between a dimension and fact table using batch mode is an order of magnitude faster than the row mode equivalent.

For flash, the cost of decompressing the column store is more than offset by:CPU cycle savings made by moving rows around in batches.CPU cycles savings made through the reduction of cache misses.

Takeaways

Diving Deeper intoBatch ModeScalability

1 2 3 4 5 6 7 8 9 10 11 120

5000

10000

15000

20000

25000

30000

35000

40000

0

20

40

60

80

100

120

Elapsed Time (ms)Pct CPU Utilisation

Average CPU Utilisation and Elapsed Time (ms) / Degree of Parallelism

2 4 6 8 10 12 14 16 18 20 22 24

Wait and Spinlock Analysis At 100 CPU Utilisation

Hypothesis: could main memory not being able to keep up ?

Wait Wait_S Resource_S Signal_S Waits Percentage------------------------- -------- ----------- -------- ------ -------------------HTBUILD 0.490000 0.477000 0.013000 138

55.3047404063205411SOS_SCHEDULER_YIELD 0.245000 0.050000 0.195000 4613127.6523702031602702QUERY_TASK_ENQUEUE_MUTEX 0.079000 0.053000 0.026000 238.9164785553047403LATCH_EX 0.036000 0.034000 0.002000 89

4.0632054176072234HTDELETE 0.024000 0.011000 0.013000 138

2.7088036117381485

Total spinlock spins = 554397

1 2 3 4 5 6 7 8 9 10 11 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7CPI / DOP

2 4 6 8 10 12 14 16 18 20 22 24

Going past one memory channel per physical core

Memory bandwidthFunction of:

Memory channelsNumber of DIMMSDIMM speed

= Total CPU core consumption capacity

Takeaway

Enhancements To Column Store Indexes (SQL Server 2014 ) Microsoft Research

SQL Server Clustered Columnstore Tuple MoverRemus Rasanu SQL Server Columnstore Indexes at Teched 2013

Remus RasanuThe Effect of CPU Caches and Memory Access Patterns

Thomas Kejser

Further Reading

http://research.microsoft.com/apps/pubs/default.aspx?id=193599

http://rusanu.com/2013/12/02/sql-server-clustered-columnstore-tuple-mover/



http://rusanu.com/2013/06/11/sql-server-clustered-columnstore-indexes-at-teched-2013/





http://blog.kejser.org/2012/06/14/the-effect-of-cpu-caches-and-memory-access-patterns/

Thanks To My Reviewer and Contributor

Thomas Kejser

Former SQL CAT member and CTO of Livedrive

[email protected]

http://uk.linkedin.com/in/wollatondba

Contact Details

ChrisAdkin8

mailto:[email protected]



Questions ?

Download - Column Store Index and Batch Mode Scalability

Top Related