Column Store Index and Batch Mode
Scalability
An independent SQL ConsultantA user of SQL Server from version 2000 onwards with 12+ years
experience.
About me . . .
The scalability challenges we face . . . .
Slides borrowedfrom Thomas Kejserwith his kind permission
CPU Cache, Memory and IO Subsystem Latency
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
1ns 10ns 100ns 100us 10ms10us
C The “Cache out” CurveThroughput
Touched Data Size
CPU Cache
TLB
NUMARemote
Storage
Every time we drop out of a cache and use the next slower one down, we pay a big throughput penalty
CPCaches
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000 28000 30000 320000
100,000,000
200,000,000
300,000,000
400,000,000
500,000,000
600,000,000
700,000,000
800,000,000
900,000,000
1,000,000,000
Random Pages
Sequential Pages
Single Page
Size of Accessed memory (MB)Service Time + Wait Time
C Sequential Versus Random Page CPU Cache Throughput
“Transistors per square inch on integrated circuits has doubled every two years since the integrated circuit was invented”
Spinning disk state of play Interfaces have evolved Aerial density has increased Rotation speed has peaked at 15K RPM Not much else . . .
Up until NAND flash, disk based IO sub systems have not kept pace with CPU advancements.
With next generation storage ( resistance ram etc) CPUs and storage may follow the same curve.
Moores Law Vs. Advancements In Disk Technology
How Execution Plans Run
Row by row Row by row
Row by row Row by row
How do rows travel betweenIterators ?
Control flow
Data Flow
What Is Required
Query execution which leverages CPU caches.
Break through levels of compressionto bridge the performance gap between IO subsystems andmodern processors.
Better query execution scalabilityas the degree of parallelism increase.
Optimizer Batch Mode
First introduced in SQL Server 2012, greatly enhanced in 2014 A batch is roughly 1000 rows in size and it is designed to fit into the L2/3
cache of the CPU, remember the slide on latency. Moving batches around is very efficient*:
One test showed that regular row-mode hash join consumed about 600 instructions per row while the batch-mode hash join needed about 85 instructions per row and in the best case (small, dense join domain) was a low as 16 instructions per row.
* From: Enhancements To SQL Server Column Stores Microsoft Research
Stack Walking The Database Engine
SELECT p.EnglishProductName ,SUM([OrderQuantity]) ,SUM([UnitPrice]) ,SUM([ExtendedAmount]) ,SUM([UnitPriceDiscountPct]) ,SUM([DiscountAmount]) ,SUM([ProductStandardCost]) ,SUM([TotalProductCost]) ,SUM([SalesAmount]) ,SUM([TaxAmt]) ,SUM([Freight])FROM [dbo].[FactInternetSales] fJOIN [dbo].[DimProduct] pON f.ProductKey = p.ProductKeyGOUP BY p.EnglishProductName
xperf –on base –stackwalk profile
xperf –d stackwalk.etl
xperfview stackwalk.etl
How do we squeeze an entire column store index into a CPU L2/3 cache ?
AnswerIts pipelined into the CPU
CPU
Lob cache
Load segments into blob cache
Break blobs into batches and pipeline them into CPU cache
Conceptual View . . . . . and whats happening in the call stack
What Difference Does Batch Mode Make ?
SELECT p.EnglishProductName ,SUM([OrderQuantity]) ,SUM([UnitPrice]) ,SUM([ExtendedAmount]) ,SUM([UnitPriceDiscountPct]) ,SUM([DiscountAmount]) ,SUM([ProductStandardCost]) ,SUM([TotalProductCost]) ,SUM([SalesAmount]) ,SUM([TaxAmt]) ,SUM([Freight]) FROM [dbo].[FactInternetSalesBig] f JOIN [dbo].[DimProduct] p ON f.ProductKey = p.ProductKey GROUP BY p.EnglishProductName
Row mode
Batch
0 50 100 150 200 250 300 350 400 450 500
Time (s)
x12
at DOP 2
What Are The Pre-Requisites For Batch Mode ?
Feature SQL Server 2012SQL
Server 2014
Presence of column store indexes Yes Yes
Parallel execution plan Yes Yes
No outer joins, NOT Ins or UNION ALLs Yes No
Hash joins do not spill from memory Yes No
Scalar aggregates cannot be used Yes No
Batch modeHash MatchAggregate78,400 ms** Timings are a statistical estimate
Row modeHash MatchAggregate445,585 ms*
Vs.
ColourRedRedBlueBlueGreenGreenGreen
DictionaryLookup ID Label1 Red2 Blue3 Green
SegmentLookup ID Run Length1 22 23 3
Optimizing Serial Scan Performance
Compressing data going down the column is far superior to compressing data going across the row, also we only retrieve the column data that is of interest.
Run length compression is usedin order to achieve this.
SQL Server 2012 introduces column store compression . . ., SQL Server 2014 adds more features to this.
SQL Server 2014 Column Store Storage Internals
RowGroups
Columns
A B C
Encode andCompress
Segments
Store
Blobs
Encode & Compress
Delta stores
< 102,400rows
Inserts of 102,400 rows and over
Inserts less than 102,400 rowsand updates update = insert into
delta store+ insert to the
deletion bit map Delta store B-tree Column store segments
Column Store Index Split Personality
Tuple mover
Local Dictionary
Global dictionary
Deletion Bitmap
SELECT [ProductKey] ,[OrderDateKey]
,[DueDateKey],[ShipDateKey],[CustomerKey],[PromotionKey],[CurrencyKey]..
INTO FactInternetSalesBigFROM [dbo].[FactInternetSales]CROSS JOIN master..spt_values AS aCROSS JOIN master..spt_values AS bWHERE a.type = 'p'AND b.type = 'p'AND a.number <= 80AND b.number <= 100
What Levels Of Compression Are Achievable ?Our ‘Big’ FactInternetSales Table
494,116,038 rows 0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
57 % 74 % 92 % 94 %
Size (Mb)
What Levels Of Compression Are Achievable ?Stack Exchange Posts* Table
Heap Row Compression Page compression Clustered column store index
Clustered column store index archive
compression
0
50
100
150
200
250
300
350
* Posts tables from the four largest stack exchanges combined ( superuser, serverfault, maths and Ubuntu )
59 %53 % 64 % 72 %
SQL Server 2012 / 2014 Column Store Comparison Feature SQL Server 2012 SQL Server
2014
Column store indexes Yes Yes
Clustered column store indexes No Yes
Updateable column store indexes No Yes
Column store archive compression No Yes
Columns in a column store index can be dropped No Yes
Support for GUID, binary, datetimeoffset precision > 2, numeric precision > 18. No Yes
Enhanced compression by storing short strings natively ( instead of 32 bit IDs ) No Yes
Bookmark support ( row_group_id:tuple_id) No Yes
Mixed row / batch mode execution No Yes
Optimized hash build and join in a single iterator No Yes
Hash memory spills cause row mode execution No Yes
Iterators supported Scan, filter, project, hash (inner) join and (local) hash aggregate
Yes
Column Store Index and Batch ModeTest Drive
Disclaimer: your own mileage may vary depending on your data, hardwareand queries
Hardware2 x 2.0 Ghz 6 core Xeon CPUsHyper threading enabled22 GB memoryRaid 0: 6 x 250 GB SATA III HD 10K RPMRaid 0: 3 x 80 GB Fusion IO
SoftwareWindows server 2012SQL Server 2014 CTP 2AdventureWorksDW DimProductTableEnlarged FactInternetSales table
Test Set Up
SELECT SUM([OrderQuantity]) ,SUM([UnitPrice]) ,SUM([ExtendedAmount]) ,SUM([UnitPriceDiscountPct]) ,SUM([DiscountAmount]) ,SUM([ProductStandardCost]) ,SUM([TotalProductCost]) ,SUM([SalesAmount]) ,SUM([TaxAmt]) ,SUM([Freight]) FROM [dbo].[FactInternetSalesBig]
Sequential Scan Performance
0
50000
100000
150000
200000
250000
300000Compression Type / Time (ms)
Time (ms)
2050Mb/s 678Mb/s 256Mb/s85% CPU 98% CPU 98% CPU
Pagecompression1,340,097 ms*
All stack trace timings are a statistical estimate
No compression545,761 ms*
Vs.
hdd cstore hdd cstore archive flash cstore flash cstore archive0
500
1000
1500
2000
2500
3000
3500
4000
4500
Elapsed Time(ms) / Column Store Compression Type
Elapsed Time(ms)/Compression Type
52 Mb/s 27 Mb/s 99% CPU 56% CPU
Clustered column store index with archive compression61,196 ms
Clustered column store index60,651 ms
Vs.
Takeaways
CPU
CPU used for IO consumption + CPU used for decompression < total CPU capacity
Compression works for you
What most people tend to have
Takeaways
CPU
CPU used for IO consumption + CPU used for decompression > total CPU capacity
Compression works against you CPU used for IO consumption + CPU used for decompression = total CPU capacity
Nothing to be gained or lost from using compression
Testing Join Scalability
SELECT p.EnglishProductName ,SUM([OrderQuantity]) ,SUM([UnitPrice]) ,SUM([ExtendedAmount]) ,SUM([UnitPriceDiscountPct]) ,SUM([DiscountAmount]) ,SUM([ProductStandardCost]) ,SUM([TotalProductCost]) ,SUM([SalesAmount]) ,SUM([TaxAmt]) ,SUM([Freight])FROM [dbo].[FactInternetSalesBig] fJOIN [dbo].[DimProduct] pON f.ProductKey = p.ProductKeyGROUP BY p.EnglishProductName
We will look at the best we can do without column store indexes:Partitioned heap fact table with page
compression for spinning diskPartitioned heap fact table without
any compression our flash storageNon partitioned column store indexes
on both types of store with and without archive compression.
2 4 6 8 10 12 14 16 18 20 22 240
100000
200000
300000
400000
500000
600000
700000
800000
HDD page compressed partitioned fact table
Flash partitioned fact table
Join Scalability DOP / Time (ms)Time (ms)
Degree of parallelism
2 4 6 8 10 12 14 16 18 20 22 240
10000
20000
30000
40000
50000
60000hdd column storehdd column store archiveflash column storeflash column store archive
Degree of parallel-ism
Time (ms) Join Scalability DOP / Time (ms)
A simple join between a dimension and fact table using batch mode is an order of magnitude faster than the row mode equivalent.
For flash, the cost of decompressing the column store is more than offset by:CPU cycle savings made by moving rows around in batches.CPU cycles savings made through the reduction of cache misses.
Takeaways
Diving Deeper intoBatch ModeScalability
1 2 3 4 5 6 7 8 9 10 11 120
5000
10000
15000
20000
25000
30000
35000
40000
0
20
40
60
80
100
120
Elapsed Time (ms)Pct CPU Utilisation
Average CPU Utilisation and Elapsed Time (ms) / Degree of Parallelism
2 4 6 8 10 12 14 16 18 20 22 24
Wait and Spinlock Analysis At 100 CPU Utilisation
Hypothesis: could main memory not being able to keep up ?
Wait Wait_S Resource_S Signal_S Waits Percentage------------------------- -------- ----------- -------- ------ -------------------HTBUILD 0.490000 0.477000 0.013000 138
55.3047404063205411SOS_SCHEDULER_YIELD 0.245000 0.050000 0.195000 4613127.6523702031602702QUERY_TASK_ENQUEUE_MUTEX 0.079000 0.053000 0.026000 238.9164785553047403LATCH_EX 0.036000 0.034000 0.002000 89
4.0632054176072234HTDELETE 0.024000 0.011000 0.013000 138
2.7088036117381485
Total spinlock spins = 554397
1 2 3 4 5 6 7 8 9 10 11 120
0.1
0.2
0.3
0.4
0.5
0.6
0.7CPI / DOP
2 4 6 8 10 12 14 16 18 20 22 24
Going past one memory channel per physical core
Memory bandwidthFunction of:
Memory channelsNumber of DIMMSDIMM speed
= Total CPU core consumption capacity
Takeaway
Enhancements To Column Store Indexes (SQL Server 2014 ) Microsoft Research
SQL Server Clustered Columnstore Tuple MoverRemus Rasanu SQL Server Columnstore Indexes at Teched 2013
Remus RasanuThe Effect of CPU Caches and Memory Access Patterns
Thomas Kejser
Further Reading
Thanks To My Reviewer and Contributor
Thomas Kejser
Former SQL CAT member and CTO of Livedrive
http://uk.linkedin.com/in/wollatondba
Contact Details
ChrisAdkin8
Questions ?