sql sever engine batch mode and cpu architectures

SQL Server EngineBatch Mode

andCPU Architectures

DBA Level 400

About me

An independent SQL ConsultantA user of SQL Server from version 2000 onwards with 12+ years

experience. I have a passion for understanding how the database engine works

at a deep level.

“Everything fits in memory, so performance is as good as it will get. It fits in memory therefore end of story”

A Simple Atypical Data Warehouse Query

11426ms

1095500000 rows

1,798MB in size

Same Query, But With A Subtly Different Column Store

1095500000 rows

8,555MB in size

6060ms

Larger column store, faster query ?!?, could it

be related to the way CPUs work ?

Elapsed Time (ms) / Degree of Parallelism

2 4 6 8 10 12 14 16 18 20 22 240

10000

20000

30000

40000

50000

60000

70000

80000

Non-sorted column store Sorted column store

Degree of Parallelism

Tim

e (m

s)

Percentage CPU Consumption / Degree of Parallelism

2 4 6 8 10 12 14 16 18 20 22 240

10

20

30

40

50

60

Non-sorted Sorted


Perc

enta

ge C

PU U

tilis

ation

Wait Statistics AnalysisThese stats are for the query ran with a DOP of 24, a warm column store object pool

and the column store created on pre-sorted data.CXPACKET waits can be discounted 99.9 % of the time. Signal wait time = total wait time is to be expected for short waits on uncontended

spin locks.They shed little light on why only 60% CPU utilisation is achievable

https://datatake.files.wordpress.com/2014/09/waitactivity.png

cpCCPU Utilisation / DOP for The ‘Sorted’ Column Store High Resolution

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

CPU Utilisation 5 8 10 13 15 18 20 23 25 28 30 33 35 38 40 43 46 48 50 53 54 58 59

5

15

25

35

45

55


CPU

Util

isati

on

5 % CPU utilisation per core Approximately 60% core utilisation

for 1 st thread to use core ( as expected )

Core

Modern CPU Architecture

L3 Cache

L1 Instruction Cache 32KB

L0 UOP cache

L2 Unified Cache 256K

Power and

ClockQPIMemory

Controller

L1 Data Cache32KB

Core

CoreL1 Instruction Cache 32KB

L0 UOP cache

L2 Unified Cache 256K

L1 Data Cache32KB

Core

Bi-directional ring bus

IOTLBMemory bus

system-on-chip ( SOC ) design with CPU cores as the basic building block.

Utility services are provisioned by the ‘Un-core’ part of the CPU die.

Four level cache hierarchy.

C P U

QPI…

Un-core

How Modern CPUs Work

Cache Fetch and Decode Execute

Decoded Instructions are micro operations ( uops )

Front End Back End

Non-decoded instructions are complex x84/64 instructions

Batch mode focusseson thisarea

Appendices A and B cover this in greater details.

CPU Pipeline ArchitectureC P U

Front end

Back end

A ‘Pipeline’ made of logical slots runs through the processor.

The front end can allocate upto four micro ops per clock cycle.

The back end can retire upto four micro operations per clock cycle.

Empty slots, or ‘Bubbles’ can indicate front/back end pressure.

The KPIs: Clock Cycles per instruction (CPI), Front end bound and Backend bound are derived from these facts.

AllocationRetirem

ent

Database Engine

Language processing

Sqllang.dll

Runtime Sqlmin.dllexecution and storage engine

Sqltst.dllexpression service

QDS.dllquery data store ( new to SQL 2014)

SQL OS Sqlos.dll Sqldk.dll

The Database Engine Is Layered Also . . .

L1 Cache sequential access

L1 Cache In Page Random access

L1 Cache In Full Random access



L2 Cache Full Random access



L3 Cache Full Random access

Main memory

0 20 40 60 80 100 120 140 160 180

4

4

4

11

11

11

14

18

38

167

CPU Cache Access Latencies in Clock Cycles Memory

Batch mode is about working in the 4 ~ 38 clock cycle range and NOT the 167 cycle “CPU stall” range.

A Basic NUMA Architecture

Core

Core

Core

Core

L1

L1

L1

L1

L3

L2

L2

L2

L2

Core

Core

Core

Core

L1

L1

L1

L1

L3

L2

L2

L2

L2

Remote memory access

Local memory access

Local memory access

NUMA Node 0 NUMA Node 1

IO H

ub

Four and Eight Node NUMA QPI Topologies ( Nehalem i7 onwards )

CPU 1 CPU 3

CPU 0 CPU 2

CPU 6 CPU 7

CPU 4 CPU 5

CPU 2 CPU 3

CPU 0 CPU 1

IO Hub

IO Hub

IO H

ubIO

Hub

IO H

ub

With 18 core Xeon’s in the offing, these topologies will become increasingly rare.

NUMA Node Remote Memory Access LatencyMemory Access Latency

An additional 20% overhead when accessing ‘Foreign’ memory !( from coreinfo )

CPU Stalls and Batch Mode, Memory Access Patterns Are Key AccessSegment

HashKey Value

Sequential scans make it easy for the pre-fetcher to determine the need to perform read-a-heads

Random memory access will make the pre-fetcher see little value in performing read-a-heads.

The ability of the pre-fetcher to do its job determines the likelihood of the data you require being in one of the CPU caches.

OLTP Workloads, CPU Stalls and Hyper Threading ( Nehalem i7 onwards ) Access

n row B-tree

L1

L2

L3

Last levelcache miss n row B-tree

Session 1 performs an index seek, the page is not found in the CPU cache. A CPU stall takes place

( 160 clock cycles+ ) whilst the page is retrieved from memory.

The ‘Dead’ CPU cycles incurred by the stall gives the physical core the opportunity to run the 2nd hyper thread.

Core

L1

L1

L1

L1

L3

L2

L2

L2

L2

Core

Core

Core

Core

SQL OS LayerOne Scheduler Per Logical Processor

The OS has no concept of SQLresources such as latches.

SQL OS schedulers act as virtual processors turning context switches into soft context switches.

SQL OS scheduler threads byprioritizing L2 cache hits and reuse over ‘Fairness’.

The Rationale Behind SQL OS

Putting Batch Modeand

How CPUs Work Together

CPU

+

Test Setup

CPU6 core 2.0 Ghz (Sandybridge)

Warm large object cache used in all tests to remove storage as a factor.

CPU6 core 2.0 Ghz (Sandybridge)

48 Gb quad channel 1333 Mhz DDR3 memory

Hyper-threading enabled, unless specified otherwise.

Windows Performance Analysis Tool Stack

Weight is sampled CPU times across all physical cores. At the top of the stack ( above ) SQL Server has consumed 24.22% of the weight. Further down the stack 14.45 % of the weight sqlmin.dll!

CBpagHashTable::FAggregateBatch

Control flow

Data flow

Where Is The Bottleneck In The Plan ?

The stack trace is indicating that theBottleneck is right here

Segment Hash Key Value

Scan

Segment Hash Key Value

Scan

Column Store created on the heap

Hash probes causing random memory access.

Hash table is likely to be at the high latency end of the cache hierarchy.

Hypothesis

Column Store created on the clustered index

Hash probes causing sequential memory access.

Hash table is likely to be at the low latency end of the cache hierarchy.

Introducing Intel VTune Amplifier XE

Investigating what happens at the CPU cache, clock cycle and instruction level requires tools outside of the standard set that ships with Windows and SQL Server.

VTune Amplifier uses hardware event sampling to determine what is happening on the CPU.

Refer to Appendix D for an overview of what “General exploration” provides.

Hash ( Aggregate ) Probe Random Vs Sequential Access Efficiency

181,578,272,367 versus 466,000,699 clock cycles !!!

Last Level Caches Misses / Degree of Parallelism

2 4 6 8 10 12 14 16 18 20 22 24

Non-sorted 13200924 30902163 161411298 1835828499 2069544858 4580720628 2796495741 3080615628 3950376507 4419593391 4952446647 5311271763

Sorted 3000210 1200084 16203164 29102037 34802436 35102457 48903413 64204494 63004410 85205964 68404788 72605082

500,000,000

1,500,000,000

2,500,000,000

3,500,000,000

4,500,000,000

5,500,000,000

Non-sorted Sorted


LLC

Miss

es

Segment Sizes Per Column For The Two Column Stores

Takeaways

The aggregate size of a column store does not tell the entire story as to how well a query will perform using it:

Compressing the column used for the hash aggregate probes such that it fits inside the CPU cache.and

Turning random memory access on hash probes to sequential memory access.

Leads to savings in CPU cycles which outweigh the costs of scanning in a column store which is larger than one based on the same non-pre-sorted data.

Batch mode hash joins leverages SKU in order to leverage the CPU cache better ( explained in the next two slides )

Row Mode Hash Join

Exchange Exchange Exchange Exchange

Data Skew reduces parallelism by increasing the level of re-partition activity. Skew works against us. Repartitioning is expensive due to the workspace buffer management overheads

involved.

Probe Input Build Input

Hash table partitioned across NUMA nodes

Batch Mode Hash JoinC P U

No expensive repartitioning overhead. Skew works for us. Hash keys more likely to be in L2/3

cache as the skew increases the probability of the same keys being hit.

B1

B2

BN

B4

Thread Thread Thread

BNB1 B2

Thread Thread

Does The Stream Aggregate Perform Any Better ?

Even with the ‘Sorted’ column store, performance is killed by a huge row mode sort prior to the stream aggregate.

The above query takes seven minutes and nine seconds to run.

NUMA Architectures and Thread Scheduling

Node 0 Node 1There are 24 logical processors and a DOP of 12.

How does SQLOS choose to schedule the threads, on one NUMA node, or does it split them between the two ?

Hyper-Threading And CPU Core Scheduling

With 24 logical processors and a DOP of 6, how does SQLOS schedule hyper-threads in relation to physical cores ?

CPU socket 0

CPU socket 1

Core 0 Core 1 Core 2

Core 3 Core 4 Core 5

Ensure a consistent flow of data from physical disk to server CPUs

Avoiding CPU Core Starvation

STORAGE PROCESSORCPU

HBA SAN Fabric SAN Ports

D A T A F L O W

The “Fast track” methodology involves the creation of balanced architectures that prevent CPU core starvation.The same thing applies to the world of “In memory”.

Making Efficient Use Of The CPU In The “In Memory” World

C P U

D A T A F L O WD A T A F L O WFront endBack end

Backend boundNo uops are delivered due to a lack of resources at the backend of the pipeline ( port saturation )

Frontend boundFront end delivering < 4uops per cycle whilst backend is ready to accept uops( CPU stalls )

CPU KPI / Degree of Parallelism

2 4 6 8 10 12 14 16 18 20 22 240

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

CPI Front end Bound Back end bound


KPI V

alue

Refer to Appendix C for the formulae from which these metrics are derived.

Which Parts Of The Database Engine Are Suffering Backend Pressure ?

Results obtained for a degree of parallelism of 24.

256-bitFMULBlend

Ports Saturation, The Excessive Demand For An Execution Unit

( *Sandybridge core)

CBagAggregateExpression::TryAggregateUsingQE_Pure

*

Ports Saturation Analysis

0

1

2

3

4

5

0 0.1 0.2 0.3 0.4 0.5 0.6

0.46

0.45

0.16

0.17

0.1

0.55

Hit Ratio

Port

0.7 and above is deemed as port saturation, if we can drive CPU utilisation above 60% we may start to see this on ports 0, 1 and 5.

Lowering Clock Cycles Per Instruction By Leveraging SIMD

1 2 3 4

2 3 4 5

+

3 5 7 9

=

1 2+ 3=Scalar instructionC = A + B

SIMD instruction

Vector C = Vector A + Vector B

Does The SQL Server Database Engine Leverage SIMD Instructions ?

AVX 1.0 (Sandybridge) offers x2 peak floating point performance.

Haswell offers more flexibility including integer vectorisation ( AVX 2 ).

Skylake introduces wide 512 bit vectorisation ( AVX 3.2 ).

Conclusions

Memory is incredibly nuanced, where the memory is and how it is accessed are significant factors in software performance.

Batch mode aims to eliminate the huge clock cycle penalties paid by accessing main memory: memory is the new disk.

Batch mode in SQL Server 2014 CU2 is not scalable, some bottleneck which cannot be identified through SQL Server itself is leading to 60% CPU utilisation ceiling.

To work carried out to date on the database engine has helped to minimise CPU stalls, this has pushed the bottleneck onto the CPUs backend. Microsoft now needs to address this, the leveraging of SIMD technology could help in this area.

Questions ?

[email protected]

http://uk.linkedin.com/in/wollatondba

Contact Details

ChrisAdkin8

mailto:[email protected]



Appendices

Appendix A: Instruction Execution And The CPU Front / Back Ends

Cache

FetchDecodeExecuteBranchPredict

DecodedInstruction

BufferExecute

ReorderAnd

Retire

Front end Back end

Appendix B - The CPU Front / Back Ends In Detail

Front end Back end

Front end bound ( smaller is better ) IDQ_NOT_DELIVERED.CORE / (4 * Clock ticks)

Bad speculation (UOPS_ISSUED.ANY – UOPS.RETIRED.RETIRED_SLOTS + 4 *

INT_MISC.RECOVERY_CYCLES) / (4 * Clock ticks) Retiring

UOPS_RETIRE_SLOTS / (4 * Clock ticks) Back end bound ( ideally, should = 1 - Retiring)

1 – (Front end bound + Bad speculation + Retiring)

Appendix C - CPU Pressure Points, Important Calculations

An illustration of what the “General exploration” analysis capability of the tool provides

Appendix D - VTune Amplifier General Exploration

sql sever engine batch mode and cpu architectures

Data & Analytics

cpu stalls

dead cpu

cpu die

cpu caches

cpu cache access latencies

level cache hierarchy

main memoryl3 cache

size6060mslarger column