sql sever engine batch mode and cpu architectures

52
SQL Server Engine Batch Mode and CPU Architectures DBA Level 400

Upload: chris-adkin

Post on 12-Dec-2014

59 views

Category:

Data & Analytics


1 download

DESCRIPTION

This deck provides an in depth ( level 400 ) overview of how the SQL Server batch execution run time leverages modern CPU architectures.

TRANSCRIPT

Page 1: Sql sever engine batch mode and cpu architectures

SQL Server EngineBatch Mode

andCPU Architectures

DBA Level 400

Page 2: Sql sever engine batch mode and cpu architectures

About me

An independent SQL ConsultantA user of SQL Server from version 2000 onwards with 12+ years

experience. I have a passion for understanding how the database engine works

at a deep level.

Page 3: Sql sever engine batch mode and cpu architectures

“Everything fits in memory, so performance is as good as it will get. It fits in memory therefore end of story”

Page 4: Sql sever engine batch mode and cpu architectures

A Simple Atypical Data Warehouse Query

11426ms

1095500000 rows

1,798MB in size

Page 5: Sql sever engine batch mode and cpu architectures

Same Query, But With A Subtly Different Column Store

1095500000 rows

8,555MB in size

6060ms

Larger column store, faster query ?!?, could it

be related to the way CPUs work ?

Page 6: Sql sever engine batch mode and cpu architectures

Elapsed Time (ms) / Degree of Parallelism

2 4 6 8 10 12 14 16 18 20 22 240

10000

20000

30000

40000

50000

60000

70000

80000

Non-sorted column store Sorted column store

Degree of Parallelism

Tim

e (m

s)

Page 7: Sql sever engine batch mode and cpu architectures

Percentage CPU Consumption / Degree of Parallelism

2 4 6 8 10 12 14 16 18 20 22 240

10

20

30

40

50

60

Non-sorted Sorted

Degree of Parallelism

Perc

enta

ge C

PU U

tilis

ation

Page 8: Sql sever engine batch mode and cpu architectures

Wait Statistics AnalysisThese stats are for the query ran with a DOP of 24, a warm column store object pool

and the column store created on pre-sorted data.CXPACKET waits can be discounted 99.9 % of the time. Signal wait time = total wait time is to be expected for short waits on uncontended

spin locks.They shed little light on why only 60% CPU utilisation is achievable

Page 9: Sql sever engine batch mode and cpu architectures

cpCCPU Utilisation / DOP for The ‘Sorted’ Column Store High Resolution

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

CPU Utilisation 5 8 10 13 15 18 20 23 25 28 30 33 35 38 40 43 46 48 50 53 54 58 59

5

15

25

35

45

55

Degree of Parallelism

CPU

Util

isati

on

5 % CPU utilisation per core Approximately 60% core utilisation

for 1 st thread to use core ( as expected )

Page 10: Sql sever engine batch mode and cpu architectures

Core

Modern CPU Architecture

L3 Cache

L1 Instruction Cache 32KB

L0 UOP cache

L2 Unified Cache 256K

Power and

ClockQPIMemory

Controller

L1 Data Cache32KB

Core

CoreL1 Instruction Cache 32KB

L0 UOP cache

L2 Unified Cache 256K

L1 Data Cache32KB

Core

Bi-directional ring bus

IOTLBMemory bus

system-on-chip ( SOC ) design with CPU cores as the basic building block.

Utility services are provisioned by the ‘Un-core’ part of the CPU die.

Four level cache hierarchy.

C P U

QPI…

Un-core

Page 11: Sql sever engine batch mode and cpu architectures

How Modern CPUs Work

Cache Fetch and Decode Execute

Decoded Instructions are micro operations ( uops )

Front End Back End

Non-decoded instructions are complex x84/64 instructions

Batch mode focusseson thisarea

Appendices A and B cover this in greater details.

Page 12: Sql sever engine batch mode and cpu architectures

CPU Pipeline ArchitectureC P U

Front end

Back end

A ‘Pipeline’ made of logical slots runs through the processor.

The front end can allocate upto four micro ops per clock cycle.

The back end can retire upto four micro operations per clock cycle.

Empty slots, or ‘Bubbles’ can indicate front/back end pressure.

The KPIs: Clock Cycles per instruction (CPI), Front end bound and Backend bound are derived from these facts.

AllocationRetirem

ent

Page 13: Sql sever engine batch mode and cpu architectures

Database Engine

Language processing

Sqllang.dll

Runtime Sqlmin.dllexecution and storage engine

Sqltst.dllexpression service

QDS.dllquery data store ( new to SQL 2014)

SQL OS Sqlos.dll Sqldk.dll

The Database Engine Is Layered Also . . .

Page 14: Sql sever engine batch mode and cpu architectures

L1 Cache sequential access

L1 Cache In Page Random access

L1 Cache In Full Random access

L2 Cache sequential access

L2 Cache In Page Random access

L2 Cache Full Random access

L3 Cache sequential access

L3 Cache In Page Random access

L3 Cache Full Random access

Main memory

0 20 40 60 80 100 120 140 160 180

4

4

4

11

11

11

14

18

38

167

CPU Cache Access Latencies in Clock Cycles Memory

Batch mode is about working in the 4 ~ 38 clock cycle range and NOT the 167 cycle “CPU stall” range.

Page 15: Sql sever engine batch mode and cpu architectures

A Basic NUMA Architecture

Core

Core

Core

Core

L1

L1

L1

L1

L3

L2

L2

L2

L2

Core

Core

Core

Core

L1

L1

L1

L1

L3

L2

L2

L2

L2

Remote memory access

Local memory access

Local memory access

NUMA Node 0 NUMA Node 1

Page 16: Sql sever engine batch mode and cpu architectures

IO H

ub

Four and Eight Node NUMA QPI Topologies ( Nehalem i7 onwards )

CPU 1 CPU 3

CPU 0 CPU 2

CPU 6 CPU 7

CPU 4 CPU 5

CPU 2 CPU 3

CPU 0 CPU 1

IO Hub

IO Hub

IO H

ubIO

Hub

IO H

ub

With 18 core Xeon’s in the offing, these topologies will become increasingly rare.

Page 17: Sql sever engine batch mode and cpu architectures

NUMA Node Remote Memory Access LatencyMemory Access Latency

An additional 20% overhead when accessing ‘Foreign’ memory !( from coreinfo )

Page 18: Sql sever engine batch mode and cpu architectures

CPU Stalls and Batch Mode, Memory Access Patterns Are Key AccessSegment

HashKey Value

Sequential scans make it easy for the pre-fetcher to determine the need to perform read-a-heads

Random memory access will make the pre-fetcher see little value in performing read-a-heads.

The ability of the pre-fetcher to do its job determines the likelihood of the data you require being in one of the CPU caches.

Page 19: Sql sever engine batch mode and cpu architectures

OLTP Workloads, CPU Stalls and Hyper Threading ( Nehalem i7 onwards ) Access

n row B-tree

L1

L2

L3

Last levelcache miss n row B-tree

Session 1 performs an index seek, the page is not found in the CPU cache. A CPU stall takes place

( 160 clock cycles+ ) whilst the page is retrieved from memory.

The ‘Dead’ CPU cycles incurred by the stall gives the physical core the opportunity to run the 2nd hyper thread.

Core

Page 20: Sql sever engine batch mode and cpu architectures

L1

L1

L1

L1

L3

L2

L2

L2

L2

Core

Core

Core

Core

SQL OS LayerOne Scheduler Per Logical Processor

The OS has no concept of SQLresources such as latches.

SQL OS schedulers act as virtual processors turning context switches into soft context switches.

SQL OS scheduler threads byprioritizing L2 cache hits and reuse over ‘Fairness’.

The Rationale Behind SQL OS

Page 21: Sql sever engine batch mode and cpu architectures

Putting Batch Modeand

How CPUs Work Together

CPU

+

Page 22: Sql sever engine batch mode and cpu architectures

Test Setup

CPU6 core 2.0 Ghz (Sandybridge)

Warm large object cache used in all tests to remove storage as a factor.

CPU6 core 2.0 Ghz (Sandybridge)

48 Gb quad channel 1333 Mhz DDR3 memory

Hyper-threading enabled, unless specified otherwise.

Page 23: Sql sever engine batch mode and cpu architectures

Windows Performance Analysis Tool Stack

Weight is sampled CPU times across all physical cores. At the top of the stack ( above ) SQL Server has consumed 24.22% of the weight. Further down the stack 14.45 % of the weight sqlmin.dll!

CBpagHashTable::FAggregateBatch

Page 24: Sql sever engine batch mode and cpu architectures

Control flow

Data flow

Where Is The Bottleneck In The Plan ?

The stack trace is indicating that theBottleneck is right here

Page 25: Sql sever engine batch mode and cpu architectures

Segment Hash Key Value

Scan

Segment Hash Key Value

Scan

Column Store created on the heap

Hash probes causing random memory access.

Hash table is likely to be at the high latency end of the cache hierarchy.

Hypothesis

Column Store created on the clustered index

Hash probes causing sequential memory access.

Hash table is likely to be at the low latency end of the cache hierarchy.

Page 26: Sql sever engine batch mode and cpu architectures

Introducing Intel VTune Amplifier XE

Investigating what happens at the CPU cache, clock cycle and instruction level requires tools outside of the standard set that ships with Windows and SQL Server.

VTune Amplifier uses hardware event sampling to determine what is happening on the CPU.

Refer to Appendix D for an overview of what “General exploration” provides.

Page 27: Sql sever engine batch mode and cpu architectures

Hash ( Aggregate ) Probe Random Vs Sequential Access Efficiency

181,578,272,367 versus 466,000,699 clock cycles !!!

Page 28: Sql sever engine batch mode and cpu architectures

Last Level Caches Misses / Degree of Parallelism

2 4 6 8 10 12 14 16 18 20 22 24

Non-sorted 13200924 30902163 161411298 1835828499 2069544858 4580720628 2796495741 3080615628 3950376507 4419593391 4952446647 5311271763

Sorted 3000210 1200084 16203164 29102037 34802436 35102457 48903413 64204494 63004410 85205964 68404788 72605082

500,000,000

1,500,000,000

2,500,000,000

3,500,000,000

4,500,000,000

5,500,000,000

Non-sorted Sorted

Degree of Parallelism

LLC

Miss

es

Page 29: Sql sever engine batch mode and cpu architectures

Segment Sizes Per Column For The Two Column Stores

Page 30: Sql sever engine batch mode and cpu architectures

Takeaways

The aggregate size of a column store does not tell the entire story as to how well a query will perform using it:

Compressing the column used for the hash aggregate probes such that it fits inside the CPU cache.and

Turning random memory access on hash probes to sequential memory access.

Leads to savings in CPU cycles which outweigh the costs of scanning in a column store which is larger than one based on the same non-pre-sorted data.

Batch mode hash joins leverages SKU in order to leverage the CPU cache better ( explained in the next two slides )

Page 31: Sql sever engine batch mode and cpu architectures

Row Mode Hash Join

Exchange Exchange Exchange Exchange

Data Skew reduces parallelism by increasing the level of re-partition activity. Skew works against us. Repartitioning is expensive due to the workspace buffer management overheads

involved.

Probe Input Build Input

Hash table partitioned across NUMA nodes

Page 32: Sql sever engine batch mode and cpu architectures

Batch Mode Hash JoinC P U

No expensive repartitioning overhead. Skew works for us. Hash keys more likely to be in L2/3

cache as the skew increases the probability of the same keys being hit.

B1

B2

BN

B4

Thread Thread Thread

BNB1 B2

Thread Thread

Page 33: Sql sever engine batch mode and cpu architectures

Does The Stream Aggregate Perform Any Better ?

Even with the ‘Sorted’ column store, performance is killed by a huge row mode sort prior to the stream aggregate.

The above query takes seven minutes and nine seconds to run.

Page 34: Sql sever engine batch mode and cpu architectures

NUMA Architectures and Thread Scheduling

Node 0 Node 1There are 24 logical processors and a DOP of 12.

How does SQLOS choose to schedule the threads, on one NUMA node, or does it split them between the two ?

Page 35: Sql sever engine batch mode and cpu architectures

Hyper-Threading And CPU Core Scheduling

With 24 logical processors and a DOP of 6, how does SQLOS schedule hyper-threads in relation to physical cores ?

CPU socket 0

CPU socket 1

Core 0 Core 1 Core 2

Core 3 Core 4 Core 5

Page 36: Sql sever engine batch mode and cpu architectures

Ensure a consistent flow of data from physical disk to server CPUs

Avoiding CPU Core Starvation

STORAGE PROCESSORCPU

HBA SAN Fabric SAN Ports

D A T A F L O W

The “Fast track” methodology involves the creation of balanced architectures that prevent CPU core starvation.The same thing applies to the world of “In memory”.

Page 37: Sql sever engine batch mode and cpu architectures

Making Efficient Use Of The CPU In The “In Memory” World

C P U

D A T A F L O WD A T A F L O WFront endBack end

Backend boundNo uops are delivered due to a lack of resources at the backend of the pipeline ( port saturation )

Frontend boundFront end delivering < 4uops per cycle whilst backend is ready to accept uops( CPU stalls )

Page 38: Sql sever engine batch mode and cpu architectures

CPU KPI / Degree of Parallelism

2 4 6 8 10 12 14 16 18 20 22 240

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

CPI Front end Bound Back end bound

Degree of Parallelism

KPI V

alue

Refer to Appendix C for the formulae from which these metrics are derived.

Page 39: Sql sever engine batch mode and cpu architectures

Which Parts Of The Database Engine Are Suffering Backend Pressure ?

Results obtained for a degree of parallelism of 24.

Page 40: Sql sever engine batch mode and cpu architectures

256-bitFMULBlend

Ports Saturation, The Excessive Demand For An Execution Unit

( *Sandybridge core)

CBagAggregateExpression::TryAggregateUsingQE_Pure

*

Page 41: Sql sever engine batch mode and cpu architectures

Ports Saturation Analysis

0

1

2

3

4

5

0 0.1 0.2 0.3 0.4 0.5 0.6

0.46

0.45

0.16

0.17

0.1

0.55

Hit Ratio

Port

0.7 and above is deemed as port saturation, if we can drive CPU utilisation above 60% we may start to see this on ports 0, 1 and 5.

Page 42: Sql sever engine batch mode and cpu architectures

Lowering Clock Cycles Per Instruction By Leveraging SIMD

1 2 3 4

2 3 4 5

+

3 5 7 9

=

1 2+ 3=Scalar instructionC = A + B

SIMD instruction

Vector C = Vector A + Vector B

Page 43: Sql sever engine batch mode and cpu architectures

Does The SQL Server Database Engine Leverage SIMD Instructions ?

AVX 1.0 (Sandybridge) offers x2 peak floating point performance.

Haswell offers more flexibility including integer vectorisation ( AVX 2 ).

Skylake introduces wide 512 bit vectorisation ( AVX 3.2 ).

Page 44: Sql sever engine batch mode and cpu architectures

Conclusions

Page 45: Sql sever engine batch mode and cpu architectures

Memory is incredibly nuanced, where the memory is and how it is accessed are significant factors in software performance.

Batch mode aims to eliminate the huge clock cycle penalties paid by accessing main memory: memory is the new disk.

Batch mode in SQL Server 2014 CU2 is not scalable, some bottleneck which cannot be identified through SQL Server itself is leading to 60% CPU utilisation ceiling.

To work carried out to date on the database engine has helped to minimise CPU stalls, this has pushed the bottleneck onto the CPUs backend. Microsoft now needs to address this, the leveraging of SIMD technology could help in this area.

Page 46: Sql sever engine batch mode and cpu architectures

Questions ?

Page 48: Sql sever engine batch mode and cpu architectures

Appendices

Page 49: Sql sever engine batch mode and cpu architectures

Appendix A: Instruction Execution And The CPU Front / Back Ends

Cache

FetchDecodeExecuteBranchPredict

DecodedInstruction

BufferExecute

ReorderAnd

Retire

Front end Back end

Page 50: Sql sever engine batch mode and cpu architectures

Appendix B - The CPU Front / Back Ends In Detail

Front end Back end

Page 51: Sql sever engine batch mode and cpu architectures

Front end bound ( smaller is better ) IDQ_NOT_DELIVERED.CORE / (4 * Clock ticks)

Bad speculation (UOPS_ISSUED.ANY – UOPS.RETIRED.RETIRED_SLOTS + 4 *

INT_MISC.RECOVERY_CYCLES) / (4 * Clock ticks) Retiring

UOPS_RETIRE_SLOTS / (4 * Clock ticks) Back end bound ( ideally, should = 1 - Retiring)

1 – (Front end bound + Bad speculation + Retiring)

Appendix C - CPU Pressure Points, Important Calculations

Page 52: Sql sever engine batch mode and cpu architectures

An illustration of what the “General exploration” analysis capability of the tool provides

Appendix D - VTune Amplifier General Exploration