leveraging memory in sql server

About Me

Leveraging Memory With SQL ServerLevel 400

15+ years plus database experience Speaker at the last three SQL Bits and at

Pass events around Europe Some of my material on spinlocks is

referenced by SQL Skills

About Me

Page life expectancy Understanding different forms of

memory pressure through: System health event Pending memory grants

Memory grant mis-estimates Plan cache bloat Memory pressure in virtualized

environments Balloon drivers etc

C P UC P U “Old World” Memory Wisdom

What memory is and isn’t: DRAM, NVRAM, Flash etc

How to encourage CPU cachefriendly hash join behavior

How large pages work

Synchronization primitives and memory

Locality, locality, locality !!!

C P UC P U However, We Are Going To Go “Off-Piste” !!!

The Basics

Level 300

C P UC P UC P UC P U Myth Busting: DRAM and NAND Flash Are The Same !

Flash latency 15~97 ms = 0.015 seconds (best case)

DRAM latency 100 ns = 0.000000100 seconds

Also NAND flash is not byte addressable

C P UMemory Cache Lines Modern Computer, A 200 Foot Overview

L3 Cache

L1 Instruction Cache 32KB

L0 UOP cache

L2 Unified Cache 256K

L1 Data Cache32KB

CoreL1 Instruction Cache 32KB

L0 UOP cache

L1 Data Cache32KB

Bi-directional ring bus

Memory bus

Your CPU Has Its Own Memory Hierarchy

Single Socket IO Performance

Remote memory accessLocal

memory access

Local memory access

NUMA Node 0 NUMA Node 1

C P UC P UC P UC P U Introducing NUMAe

An additional 20% overhead when accessing ‘Foreign’ memory !( from coreinfo )

C P UC P U Remote Memory Access – No Free Lunch Here !

C P UC P U The Foreign Memory Penalty As Experienced By The Engine

From Linchi Shea SQLBLOGS 2012

Does removing the overhead of latching,

locking and interpreting the SQL

languagemake overhead of foreign memory

access noticeable ?

C P UC P UC P UC P U Why Do We Have NUMA In The First Place ?

Single shared memory bus

CPU CPU CPU CPU

Bus saturation !!!

The “Old world” of uniform memory access:

C P UC P UC P UC P U SQL Server High Level Memory Architecture

CPU 0 CPU 1

Memorynode

Memoryclerk

sys.dm_os_nodes

sys.dm_os_memory_clerks

sys.dm_os_memory_nodes

C P UC P UC P UC P U Knobs and Dials For Controlling NUMA and Memory

Min and max memory setting

Lock pages in memory privilege

CPU affinity mask

Trace flag 8048: upgrade memory partitioning to CPU level (SQL 2016 default)

Trace flag 8015: disable NUMA at SQL OS level

Trace 835: use lock pages in memory for SQL Server standard edition

C P UC P UC P UC P U SQL Server Memory Myth Busting

Might help when not hitting a good PLE per node for an OLTP style application

For a data warehouse / OLAP style application, focus on being able to fit the largest partition in memory.

More memory may equal slower clocked memory

How you access memory mattersWhere you access memory matters

L1 Cache sequential access

L1 Cache In Page Random access

L1 Cache In Full Random access

L2 Cache Full Random access

L3 Cache Full Random access

Main memory

0 20 40 60 80 100 120 140 160 1804

memoryCPU

Main Memory Is Not As Fast As We Might Think !!!

The Database Engine Is Not Always CPU Cache Friendly

Take the loop join for example . . .

Crawling A Tree In Memory

Memory Is Scare What Happens When Memory Is Scarce ?

Available hash memory (MB)

C P UC P U There Are Things We Can Do To Leverage The CPU Cache !!!

2 4 6 8 10 12 14 16 18 20 22 240

Non-sorted column store Sorted column store

Degree of Parallelism

Advanced Topics

Level 400

System on chip architecture

Multi level memory hierarchy

Integrated memory and PCI controllers

Utility services provisioned by ‘Un-core’ part of the die

L3 Cache

L1 Instruction Cache 32KB

L0 UOP cache

Power and

ClockQPIMemory

Controller

L1 Data Cache32KB

CoreL1 Instruction Cache 32KB

L0 UOP cache

L1 Data Cache32KB

PCI2.0TLB

QPI…

Un-core

Memory Cache Lines The Modern Intel CPU . . . Again

Memory Cache Lines

new OperationData() new OperationData() new OperationData()

Cache LineCache Line

Cache Lines And CPU Cache SetsC P UCache

Level 1: 8 way associative

Memory Cache Lines All Memory Access is CPU IntensiveC P U

Takeaway point: we want to stay “On socket” !!!

Memory Cache Lines How We Access Memory Matters !!!

L3 CacheThe Old World

The New World With Data Direct IO

L3 Cache

Everything Should Go Via Main Memory, Right ?

Main memory is not as fast as you might think !!!

2 x 10 GBe

4 x 10 GBe

6 x 10 GBe

8 x 10 GBe

0102030405060708090

Single Socket IO Performance

Xeon5600

XeonE5

C P UC P U What Data-Direct IO Gives Us

How Large Memory Pages Work

L3 Cache

Power andClock

TLBQPI

Un-core

Page Translation

MemoryController

DTLB( 1 st level )

STLB( 2 nd level )

~10s of CPU

cycles

160+ CPU

cycles

The Look Aside Buffer With Large Pages

L3 Cache

Power andClock

PCITLB: 32 x 2MBpages

Un-core

MemoryController

128Kb of logical to physical memory mapping coverage is

increased to 64Mb !!!

Fewer trips off theCPU to the page table

. . . And You Think I’m An Uber-Geek ?

“Dr Bandwidth”from the IntelDeveloper Zone

C P UC P UOLTP Rules Of Thumb – Execution PlansOLTP Tuning For Dummies The Difference Large Pages Make

Large pages29 % increase

in page lookups / s

CPU 1 CPU 3

CPU 0 CPU 2

CPU 6 CPU 7

CPU 4 CPU 5

CPU 2 CPU 3

CPU 0 CPU 1

IO Hub

IO HubIO

C P UC P U Advanced NUMA Topolgies

Information courtesy of Joe Chang

C P UC P U SQLOS Checking For NUMA Locality Under The Covers

C P UC P U Why Is Buffer Pool Pressure Bad ?

17.41Mb column store Vs. 51.7Mb column store

The fastest

C P UC P U Which Statement Has The Lowest Elapsed Time ?

Hash agg lookup weight 65,329.87

Column Store scan weight 28,488.73

C P UC P U Using Non Pre-Sorted Data – Call Stack

Control flow

Data flow

The call stack indicates that theBottleneck is right here

C P UC P U How Queries Are Executed

Hash agg lookup weight: now 275.00 before 65,329.87

Column Store scan weight now 45,764.07 before 28,488.73

Hash probes resulting in sequential memory access = CPU savings > cost of scanning an enlarged column store

C P UC P U Using Pre-Sorted Data – Call Stack

2 4 6 8 10 12 14 16 18 20 22 240

1,000,000,000

2,000,000,000

3,000,000,000

4,000,000,000

5,000,000,000

6,000,000,000

Non-sorted Sorted

Degree of Parallelism

Last Level Cache saturation point

Dip because worker threads 13 and14 have the LLC of CPU 1 to themselves

C P UC P U What It Boils Down To – CPU Stalls !

C P UC P UC P UC P U SQL Server Memory Myth Busting

Memory only matters for the majormemory pools and

query plan iterator memory grants

C P UC P U Latches Versus Spinlocks

A task will spin until it can acquire the spinlock it is after

For short lived waits this uses less CPU cycles than a yielding then waiting for the task thread to be at the head of the runnable queue.

C P UC P U How Spinlocks Work

We have to yield the scheduler at some stage !

C P UC P U SQL 2008 R2 Introduced Exponential Back Off

spin_acquireInt s

Transfer cache entry

CPU CPU

C P UC P U Spinlocks and Memory

Core 0

Core 1

Core 2

Core 4

Core 3

Core 5

Core 6

Core 7

Core 9

Core 8

Core 0

Core 1

Core 2

Core 4

Core 3

Core 5

Core 6

Core 7

Core 9

Core 8

Faster here ?Numa Node 0

. . . Or faster here? Numa Node 1

18 threads here

C P UC P U Which CPU Socket The Insert Run The Fastest On ?

18 insert thread log writer CPU socketCo-location.

18 insert threads not co-located on same socket as the log writer

84,697 ms

11,281,235 ms

C P UC P U What Does Windows Performance Toolkit Have To Say ?

spin_acquireInt s

Transfer cache line

CPU CPU

L3Core

C P U C P U100 CPU cycles

34 CPU cycles100 CPU cycles

34 CPU cycles

Core to core on the same socket Core to core on different sockets

C P UC P U The CPU Cycle Cost Of Cache Line Transfers

C P UC P U The In Memory OLTP Hash Indexes: Think Buckets

Smaller bucket counts = better cache line reuse + reduced TLB thrashing + reduced hash table cache out

Larger bucket counts = reduced cache line reuse + increased TLB thrashing + less hash bucket scanning for lookups

Lookup Table(Hash)

C P UC P U Is There A Hash Index Bucket Count Sweet Spot ?

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

7,000,000

8,000,000

9,000,000

531575 545851 530447

7911392 8064516 8169934 8445945Insert Rate For 10 Threads / Bucket Count

Bucket Count

16,777,216 67,108,864 33,554,432 2,097,152 1,048,576 4,194,304 524,288

C P UC P U NUMA Locality and The In-Memory OLTP Engine

SQL Server 2016 RC3

Singleton inserts into a memory optimised table with a hash index

2 sockets, 10 cores per socket

Measuring the effect of moving the CPU affinity mask around

C P UC P U

Questions ?

C P UC P U My Contact Details

chris1adkin@yahoo.co.uk

http://uk.linkedin.com/in/wollatondba

ChrisAdkin8

C P UC P U

Addendum: Windows Performance Toolkit Basics

Wait Time +

Service Time

What is happening

here ?

C P UC P U Wait Time Is Well Understood , Service Time However…

C P UC P U Introducing Windows Performance Toolkit

Remember to turn the paging executive off !

CPU analysisWhere out CPU cycles are going

Wait AnalysisWhat threads are waiting on

Deferred Procedure Call/Interrupt Service Request Analysis ?

C P UC P U What Does Windows Performance Toolkit Allow Us To See ?

xperf –on base –stackwalk profile

xperf –d stackwalk.etlWPA

Run query

C P UC P U Collecting An Event Trace For Windows

A call stack is a stack data structure that stores information about the active subroutines of a computer program.

C P UC P U What Is A Call Stack ?

But for two DLL’s, SQLOSwould run on bare metal

You should only be interestedCPU and DPC/ISR analysisunless significant waits onPREEMPTIVE_OS_ waitevents are prevalent

C P UC P U What Available CPU Stats Are We Interested In ?

Database EngineLanguage Processing and Optimisation

sqllang.dll

Runtimesqlmin.dll, sqltst.dll, qds.dll, hekaton.dll, <in-memory-table.dll>, <natively-compiled-obj.dll>

SQLOSsqldk.dll, sqlos.dll

C P UC P U What Do The .DLLs In The Call Stack Represent ?

C P UC P U Where Is The CPU Burned In The Legacy Engine ?

sqlmin.dllQuery executionLatchingSpin lockingLockingLog writingLazy writing IO

sqldk.dll sqltst.dll sqlos.dll qds.dll

A debug symbol expresses which programming-language constructs generated a specific piece of machine code in a given executable module

If the debug symbols exist, they will be on the symbol server pointed to by WPA by default

C P UC P U What Is A Debug Symbol ?

C P UC P U Investigating CPU Saturation

1. Load ETL file 2. Load symbols

3. Open graph explorer

4. Drag ‘Computation’ onto analysis canvas and select graph and table

C P UC P U Computation Columns Of Interest

Weight (in view)The sampled CPU time in ms across all CPU cores

% WeightSampled CPU time as a percentage of all CPU time available during the entire sampling period

leveraging memory in sql server

Data & Analytics

leveraging sql server to improve vector display through...

leveraging sql server business intellience tools

advanced sql memory management (geekready 2012)

tf07 leveraging sql server reporting services (ssrs)

sql 2005 memory module

sql server 2014 in memory

sql server in-memory oltp internals overview for...

in-memory oltp en sql server 2016

energy-efficient in-memory architectures leveraging

blazing speed for oltp and more - novick software · sql...

minding sql server memory

sql track: in memory oltp in sql server

sql server in-memory oltp and columnstore feature …...oltp...

db2 for z/os is serious about analytics: leveraging sql to

leveraging sql spatial analytics for making business...

going native: leveraging db2 for z/os sql procedures and...

sql server in-memory oltp internals overview for … sql...

sql server in-memory oltp migration overview

sql server 2014 in-memory tables (xtp, hekaton)

leveraging oracle sql and pl/sql to simplify user ... ·...