leveraging memory in sql server

About Me

Leveraging Memory With SQL ServerLevel 400

15+ years plus database experience Speaker at the last three SQL Bits and at

Pass events around Europe Some of my material on spinlocks is

referenced by SQL Skills

About Me

Page life expectancy Understanding different forms of

memory pressure through: System health event Pending memory grants

Memory grant mis-estimates Plan cache bloat Memory pressure in virtualized

environments Balloon drivers etc

C P UC P U “Old World” Memory Wisdom

What memory is and isn’t: DRAM, NVRAM, Flash etc

How to encourage CPU cachefriendly hash join behavior

How large pages work

Synchronization primitives and memory

Locality, locality, locality !!!

C P UC P U However, We Are Going To Go “Off-Piste” !!!

The Basics

Level 300

C P UC P UC P UC P U Myth Busting: DRAM and NAND Flash Are The Same !

Flash latency 15~97 ms = 0.015 seconds (best case)

DRAM latency 100 ns = 0.000000100 seconds

Also NAND flash is not byte addressable

C P UMemory Cache Lines Modern Computer, A 200 Foot Overview

Core

L3 Cache

L1 Instruction Cache 32KB

L0 UOP cache

L2 Unified Cache 256K

L1 Data Cache32KB

Core

CoreL1 Instruction Cache 32KB

L0 UOP cache


L1 Data Cache32KB

Core

Bi-directional ring bus

Memory bus

C P U

Your CPU Has Its Own Memory Hierarchy

Single Socket IO Performance

Core

Core

Core

Core

L1

L1

L1

L1

L3

L2

L2

L2

L2

Core

Core

Core

Core

L1

L1

L1

L1

L3

L2

L2

L2

L2

Remote memory accessLocal

memory access

Local memory access

NUMA Node 0 NUMA Node 1

C P UC P UC P UC P U Introducing NUMAe

An additional 20% overhead when accessing ‘Foreign’ memory !( from coreinfo )

C P UC P U Remote Memory Access – No Free Lunch Here !

C P UC P U The Foreign Memory Penalty As Experienced By The Engine

From Linchi Shea SQLBLOGS 2012

Does removing the overhead of latching,

locking and interpreting the SQL

languagemake overhead of foreign memory

access noticeable ?

http://sqlblog.com/blogs/linchi_shea/archive/2012/01/30/performance-impact-the-cost-of-numa-remote-memory-access.aspx




C P UC P UC P UC P U Why Do We Have NUMA In The First Place ?

Single shared memory bus

CPU CPU CPU CPU

Bus saturation !!!

The “Old world” of uniform memory access:

C P UC P UC P UC P U SQL Server High Level Memory Architecture

CPU 0 CPU 1

Memorynode

Memorynode

Memoryclerk

Memoryclerk

sys.dm_os_nodes

sys.dm_os_memory_clerks

sys.dm_os_memory_nodes

C P UC P UC P UC P U Knobs and Dials For Controlling NUMA and Memory

Min and max memory setting

Lock pages in memory privilege

CPU affinity mask

Trace flag 8048: upgrade memory partitioning to CPU level (SQL 2016 default)

Trace flag 8015: disable NUMA at SQL OS level

Trace 835: use lock pages in memory for SQL Server standard edition

https://blogs.msdn.microsoft.com/psssql/2011/09/01/sql-server-20082008-r2-on-newer-machines-with-more-than-8-cpus-presented-per-numa-node-may-need-trace-flag-8048/

https://blogs.msdn.microsoft.com/psssql/2010/04/02/how-it-works-soft-numa-io-completion-thread-lazy-writer-workers-and-memory-nodes/

C P UC P UC P UC P U SQL Server Memory Myth Busting


Might help when not hitting a good PLE per node for an OLTP style application

For a data warehouse / OLAP style application, focus on being able to fit the largest partition in memory.

More memory may equal slower clocked memory

How you access memory mattersWhere you access memory matters

L1 Cache sequential access

L1 Cache In Page Random access

L1 Cache In Full Random access



L2 Cache Full Random access



L3 Cache Full Random access

Main memory

0 20 40 60 80 100 120 140 160 1804

4

4

11

11

11

14

18

38

167

Main

memoryCPU

Main Memory Is Not As Fast As We Might Think !!!

The Database Engine Is Not Always CPU Cache Friendly

Take the loop join for example . . .

Crawling A Tree In Memory

Memory Is Scare What Happens When Memory Is Scarce ?

Available hash memory (MB)

C P UC P U There Are Things We Can Do To Leverage The CPU Cache !!!

2 4 6 8 10 12 14 16 18 20 22 240

10000

20000

30000

40000

50000

60000

70000

80000

Non-sorted column store Sorted column store

Degree of Parallelism

Tim

e (m

s)

Advanced Topics

Level 400

System on chip architecture

Multi level memory hierarchy

Integrated memory and PCI controllers

Utility services provisioned by ‘Un-core’ part of the die

Core

L3 Cache

L1 Instruction Cache 32KB

L0 UOP cache


Power and

ClockQPIMemory

Controller

L1 Data Cache32KB

Core

CoreL1 Instruction Cache 32KB

L0 UOP cache


L1 Data Cache32KB

Core


PCI2.0TLB

C P U

QPI…

Un-core

Memory Cache Lines The Modern Intel CPU . . . Again

Memory Cache Lines

new OperationData() new OperationData() new OperationData()

Cache LineCache Line

64B

Cache Lines And CPU Cache SetsC P UCache

Level 1: 8 way associative



Memory Cache Lines All Memory Access is CPU IntensiveC P U

Takeaway point: we want to stay “On socket” !!!

Memory Cache Lines How We Access Memory Matters !!!

C P U

Core

L3 CacheThe Old World

The New World With Data Direct IO

Core

C P U

Core

L3 Cache

Core

Everything Should Go Via Main Memory, Right ?

Main memory is not as fast as you might think !!!

2 x 10 GBe

2 x 10 GBe

4 x 10 GBe

4 x 10 GBe

6 x 10 GBe

6 x 10 GBe

8 x 10 GBe

8 x 10 GBe

0102030405060708090

Single Socket IO Performance

Tran

sacti

ons/

Sec

(Mu)

Xeon5600

XeonE5

C P UC P U What Data-Direct IO Gives Us

How Large Memory Pages Work

L3 Cache

Power andClock

Core


PCI

TLBQPI

Un-core

Core

Page Translation

Table

MemoryController

DTLB( 1 st level )

STLB( 2 nd level )

~10s of CPU

cycles

160+ CPU

cycles

The Look Aside Buffer With Large Pages

L3 Cache

Power andClock

Core


PCITLB: 32 x 2MBpages

QPI

Un-core

Core

MemoryController

128Kb of logical to physical memory mapping coverage is

increased to 64Mb !!!

Fewer trips off theCPU to the page table

. . . And You Think I’m An Uber-Geek ?

“Dr Bandwidth”from the IntelDeveloper Zone

C P UC P UOLTP Rules Of Thumb – Execution PlansOLTP Tuning For Dummies The Difference Large Pages Make

Large pages29 % increase

in page lookups / s

IO H

ub

CPU 1 CPU 3

CPU 0 CPU 2

CPU 6 CPU 7

CPU 4 CPU 5

CPU 2 CPU 3

CPU 0 CPU 1

IO Hub

IO HubIO

Hub

IO H

ub

IO H

ub

C P UC P U Advanced NUMA Topolgies

Information courtesy of Joe Chang

C P UC P U SQLOS Checking For NUMA Locality Under The Covers

C P UC P U Why Is Buffer Pool Pressure Bad ?

17.41Mb column store Vs. 51.7Mb column store

The fastest

?

The fastest

?

C P UC P U Which Statement Has The Lowest Elapsed Time ?

Hash agg lookup weight 65,329.87

Column Store scan weight 28,488.73

C P UC P U Using Non Pre-Sorted Data – Call Stack

Control flow

Data flow

The call stack indicates that theBottleneck is right here

C P UC P U How Queries Are Executed

Hash agg lookup weight: now 275.00 before 65,329.87

Column Store scan weight now 45,764.07 before 28,488.73

Hash probes resulting in sequential memory access = CPU savings > cost of scanning an enlarged column store

C P UC P U Using Pre-Sorted Data – Call Stack

2 4 6 8 10 12 14 16 18 20 22 240

1,000,000,000

2,000,000,000

3,000,000,000

4,000,000,000

5,000,000,000

6,000,000,000

Non-sorted Sorted

Degree of Parallelism

LLC

Miss

es

Last Level Cache saturation point

Dip because worker threads 13 and14 have the LLC of CPU 1 to themselves

C P UC P U What It Boils Down To – CPU Stalls !


Memory only matters for the majormemory pools and

query plan iterator memory grants

C P UC P U Latches Versus Spinlocks

A task will spin until it can acquire the spinlock it is after

For short lived waits this uses less CPU cycles than a yielding then waiting for the task thread to be at the head of the runnable queue.

C P UC P U How Spinlocks Work

We have to yield the scheduler at some stage !

C P UC P U SQL 2008 R2 Introduced Exponential Back Off

spin_acquireInt s

spin_acquireInt s

spin_acquireInt s

Transfer cache entry

Transfer cache entry

CPU CPU

L3

Core

Core

C P U

L3

Core

Core

C P U

C P UC P U Spinlocks and Memory

L3

Core 0

Core 1

Core 2

Core 4

Core 3

Core 5

Core 6

Core 7

Core 9

Core 8

C P U

L3

Core 0

Core 1

Core 2

Core 4

Core 3

Core 5

Core 6

Core 7

Core 9

Core 8

C P U

Faster here ?Numa Node 0

. . . Or faster here? Numa Node 1

18 threads here

73 s

18 threads here

125 s

C P UC P U Which CPU Socket The Insert Run The Fastest On ?

18 insert thread log writer CPU socketCo-location.

18 insert threads not co-located on same socket as the log writer

84,697 ms

Vs.

11,281,235 ms

C P UC P U What Does Windows Performance Toolkit Have To Say ?

spin_acquireInt s

spin_acquireInt s

spin_acquireInt s

Transfer cache line

Transfer cache line

CPU CPU

L3Core

C P U

C P U C P U100 CPU cycles

Core

34 CPU cycles100 CPU cycles

34 CPU cycles

Core to core on the same socket Core to core on different sockets

C P UC P U The CPU Cycle Cost Of Cache Line Transfers

C P UC P U The In Memory OLTP Hash Indexes: Think Buckets

Smaller bucket counts = better cache line reuse + reduced TLB thrashing + reduced hash table cache out

Larger bucket counts = reduced cache line reuse + increased TLB thrashing + less hash bucket scanning for lookups

Lookup Table(Hash)

C P UC P U Is There A Hash Index Bucket Count Sweet Spot ?

0

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

7,000,000

8,000,000

9,000,000

531575 545851 530447

7911392 8064516 8169934 8445945Insert Rate For 10 Threads / Bucket Count

Bucket Count

Inse

rt R

ate

16,777,216 67,108,864 33,554,432 2,097,152 1,048,576 4,194,304 524,288

C P UC P U NUMA Locality and The In-Memory OLTP Engine

SQL Server 2016 RC3

Singleton inserts into a memory optimised table with a hash index

2 sockets, 10 cores per socket

Measuring the effect of moving the CPU affinity mask around

C P UC P U

Questions ?

C P UC P U My Contact Details

[email protected]

http://uk.linkedin.com/in/wollatondba

ChrisAdkin8

mailto:[email protected]





C P UC P U

Addendum: Windows Performance Toolkit Basics

Wait Time +

Service Time

What is happening

here ?

C P UC P U Wait Time Is Well Understood , Service Time However…

C P UC P U Introducing Windows Performance Toolkit

Remember to turn the paging executive off !

CPU analysisWhere out CPU cycles are going

Wait AnalysisWhat threads are waiting on

Deferred Procedure Call/Interrupt Service Request Analysis ?

C P UC P U What Does Windows Performance Toolkit Allow Us To See ?

xperf –on base –stackwalk profile

xperf –d stackwalk.etlWPA

Run query

C P UC P U Collecting An Event Trace For Windows

A call stack is a stack data structure that stores information about the active subroutines of a computer program.

C P UC P U What Is A Call Stack ?

https://en.wikipedia.org/wiki/Stack_(abstract_data_type)

https://en.wikipedia.org/wiki/Subroutine

https://en.wikipedia.org/wiki/Computer_program

https://en.wikipedia.org/wiki/Computer_program

But for two DLL’s, SQLOSwould run on bare metal

You should only be interestedCPU and DPC/ISR analysisunless significant waits onPREEMPTIVE_OS_ waitevents are prevalent

C P UC P U What Available CPU Stats Are We Interested In ?

Database EngineLanguage Processing and Optimisation

sqllang.dll

Runtimesqlmin.dll, sqltst.dll, qds.dll, hekaton.dll, <in-memory-table.dll>, <natively-compiled-obj.dll>

SQLOSsqldk.dll, sqlos.dll

C P UC P U What Do The .DLLs In The Call Stack Represent ?

C P UC P U Where Is The CPU Burned In The Legacy Engine ?

sqlmin.dllQuery executionLatchingSpin lockingLockingLog writingLazy writing IO

sqldk.dll sqltst.dll sqlos.dll qds.dll

A debug symbol expresses which programming-language constructs generated a specific piece of machine code in a given executable module

If the debug symbols exist, they will be on the symbol server pointed to by WPA by default

C P UC P U What Is A Debug Symbol ?

https://en.wikipedia.org/wiki/Machine_code

C P UC P U Investigating CPU Saturation

1. Load ETL file 2. Load symbols

3. Open graph explorer

4. Drag ‘Computation’ onto analysis canvas and select graph and table

C P UC P U Computation Columns Of Interest

Weight (in view)The sampled CPU time in ms across all CPU cores

% WeightSampled CPU time as a percentage of all CPU time available during the entire sampling period

leveraging memory in sql server

Data & Analytics