leveraging memory in sql server
TRANSCRIPT
About Me
Leveraging Memory With SQL ServerLevel 400
15+ years plus database experience Speaker at the last three SQL Bits and at
Pass events around Europe Some of my material on spinlocks is
referenced by SQL Skills
About Me
Page life expectancy Understanding different forms of
memory pressure through: System health event Pending memory grants
Memory grant mis-estimates Plan cache bloat Memory pressure in virtualized
environments Balloon drivers etc
C P UC P U “Old World” Memory Wisdom
What memory is and isn’t: DRAM, NVRAM, Flash etc
How to encourage CPU cachefriendly hash join behavior
How large pages work
Synchronization primitives and memory
Locality, locality, locality !!!
C P UC P U However, We Are Going To Go “Off-Piste” !!!
The Basics
Level 300
C P UC P UC P UC P U Myth Busting: DRAM and NAND Flash Are The Same !
Flash latency 15~97 ms = 0.015 seconds (best case)
DRAM latency 100 ns = 0.000000100 seconds
Also NAND flash is not byte addressable
C P UMemory Cache Lines Modern Computer, A 200 Foot Overview
Core
L3 Cache
L1 Instruction Cache 32KB
L0 UOP cache
L2 Unified Cache 256K
L1 Data Cache32KB
Core
CoreL1 Instruction Cache 32KB
L0 UOP cache
L2 Unified Cache 256K
L1 Data Cache32KB
Core
Bi-directional ring bus
Memory bus
C P U
Your CPU Has Its Own Memory Hierarchy
Single Socket IO Performance
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
Remote memory accessLocal
memory access
Local memory access
NUMA Node 0 NUMA Node 1
C P UC P UC P UC P U Introducing NUMAe
An additional 20% overhead when accessing ‘Foreign’ memory !( from coreinfo )
C P UC P U Remote Memory Access – No Free Lunch Here !
C P UC P U The Foreign Memory Penalty As Experienced By The Engine
From Linchi Shea SQLBLOGS 2012
Does removing the overhead of latching,
locking and interpreting the SQL
languagemake overhead of foreign memory
access noticeable ?
C P UC P UC P UC P U Why Do We Have NUMA In The First Place ?
Single shared memory bus
CPU CPU CPU CPU
Bus saturation !!!
The “Old world” of uniform memory access:
C P UC P UC P UC P U SQL Server High Level Memory Architecture
CPU 0 CPU 1
Memorynode
Memorynode
Memoryclerk
Memoryclerk
sys.dm_os_nodes
sys.dm_os_memory_clerks
sys.dm_os_memory_nodes
C P UC P UC P UC P U Knobs and Dials For Controlling NUMA and Memory
Min and max memory setting
Lock pages in memory privilege
CPU affinity mask
Trace flag 8048: upgrade memory partitioning to CPU level (SQL 2016 default)
Trace flag 8015: disable NUMA at SQL OS level
Trace 835: use lock pages in memory for SQL Server standard edition
C P UC P UC P UC P U SQL Server Memory Myth Busting
C P UC P UC P UC P U SQL Server Memory Myth Busting
Might help when not hitting a good PLE per node for an OLTP style application
For a data warehouse / OLAP style application, focus on being able to fit the largest partition in memory.
More memory may equal slower clocked memory
How you access memory mattersWhere you access memory matters
L1 Cache sequential access
L1 Cache In Page Random access
L1 Cache In Full Random access
L2 Cache sequential access
L2 Cache In Page Random access
L2 Cache Full Random access
L3 Cache sequential access
L3 Cache In Page Random access
L3 Cache Full Random access
Main memory
0 20 40 60 80 100 120 140 160 1804
4
4
11
11
11
14
18
38
167
Main
memoryCPU
Main Memory Is Not As Fast As We Might Think !!!
The Database Engine Is Not Always CPU Cache Friendly
Take the loop join for example . . .
Crawling A Tree In Memory
Memory Is Scare What Happens When Memory Is Scarce ?
Available hash memory (MB)
C P UC P U There Are Things We Can Do To Leverage The CPU Cache !!!
2 4 6 8 10 12 14 16 18 20 22 240
10000
20000
30000
40000
50000
60000
70000
80000
Non-sorted column store Sorted column store
Degree of Parallelism
Tim
e (m
s)
Advanced Topics
Level 400
System on chip architecture
Multi level memory hierarchy
Integrated memory and PCI controllers
Utility services provisioned by ‘Un-core’ part of the die
Core
L3 Cache
L1 Instruction Cache 32KB
L0 UOP cache
L2 Unified Cache 256K
Power and
ClockQPIMemory
Controller
L1 Data Cache32KB
Core
CoreL1 Instruction Cache 32KB
L0 UOP cache
L2 Unified Cache 256K
L1 Data Cache32KB
Core
Bi-directional ring bus
PCI2.0TLB
C P U
QPI…
Un-core
Memory Cache Lines The Modern Intel CPU . . . Again
Memory Cache Lines
new OperationData() new OperationData() new OperationData()
Cache LineCache Line
64B
Cache Lines And CPU Cache SetsC P UCache
Level 1: 8 way associative
Level 2: 8 way associative
Level 3: 16 way associative
Memory Cache Lines All Memory Access is CPU IntensiveC P U
Takeaway point: we want to stay “On socket” !!!
Memory Cache Lines How We Access Memory Matters !!!
C P U
Core
L3 CacheThe Old World
The New World With Data Direct IO
Core
C P U
Core
L3 Cache
Core
Everything Should Go Via Main Memory, Right ?
Main memory is not as fast as you might think !!!
2 x 10 GBe
2 x 10 GBe
4 x 10 GBe
4 x 10 GBe
6 x 10 GBe
6 x 10 GBe
8 x 10 GBe
8 x 10 GBe
0102030405060708090
Single Socket IO Performance
Tran
sacti
ons/
Sec
(Mu)
Xeon5600
XeonE5
C P UC P U What Data-Direct IO Gives Us
How Large Memory Pages Work
L3 Cache
Power andClock
Core
Bi-directional ring bus
PCI
TLBQPI
Un-core
Core
Page Translation
Table
MemoryController
DTLB( 1 st level )
STLB( 2 nd level )
~10s of CPU
cycles
160+ CPU
cycles
The Look Aside Buffer With Large Pages
L3 Cache
Power andClock
Core
Bi-directional ring bus
PCITLB: 32 x 2MBpages
QPI
Un-core
Core
MemoryController
128Kb of logical to physical memory mapping coverage is
increased to 64Mb !!!
Fewer trips off theCPU to the page table
. . . And You Think I’m An Uber-Geek ?
“Dr Bandwidth”from the IntelDeveloper Zone
C P UC P UOLTP Rules Of Thumb – Execution PlansOLTP Tuning For Dummies The Difference Large Pages Make
Large pages29 % increase
in page lookups / s
IO H
ub
CPU 1 CPU 3
CPU 0 CPU 2
CPU 6 CPU 7
CPU 4 CPU 5
CPU 2 CPU 3
CPU 0 CPU 1
IO Hub
IO HubIO
Hub
IO H
ub
IO H
ub
C P UC P U Advanced NUMA Topolgies
Information courtesy of Joe Chang
C P UC P U SQLOS Checking For NUMA Locality Under The Covers
C P UC P U Why Is Buffer Pool Pressure Bad ?
17.41Mb column store Vs. 51.7Mb column store
The fastest
?
The fastest
?
C P UC P U Which Statement Has The Lowest Elapsed Time ?
Hash agg lookup weight 65,329.87
Column Store scan weight 28,488.73
C P UC P U Using Non Pre-Sorted Data – Call Stack
Control flow
Data flow
The call stack indicates that theBottleneck is right here
C P UC P U How Queries Are Executed
Hash agg lookup weight: now 275.00 before 65,329.87
Column Store scan weight now 45,764.07 before 28,488.73
Hash probes resulting in sequential memory access = CPU savings > cost of scanning an enlarged column store
C P UC P U Using Pre-Sorted Data – Call Stack
2 4 6 8 10 12 14 16 18 20 22 240
1,000,000,000
2,000,000,000
3,000,000,000
4,000,000,000
5,000,000,000
6,000,000,000
Non-sorted Sorted
Degree of Parallelism
LLC
Miss
es
Last Level Cache saturation point
Dip because worker threads 13 and14 have the LLC of CPU 1 to themselves
C P UC P U What It Boils Down To – CPU Stalls !
C P UC P UC P UC P U SQL Server Memory Myth Busting
Memory only matters for the majormemory pools and
query plan iterator memory grants
C P UC P U Latches Versus Spinlocks
A task will spin until it can acquire the spinlock it is after
For short lived waits this uses less CPU cycles than a yielding then waiting for the task thread to be at the head of the runnable queue.
C P UC P U How Spinlocks Work
We have to yield the scheduler at some stage !
C P UC P U SQL 2008 R2 Introduced Exponential Back Off
spin_acquireInt s
spin_acquireInt s
spin_acquireInt s
Transfer cache entry
Transfer cache entry
CPU CPU
L3
Core
Core
C P U
L3
Core
Core
C P U
C P UC P U Spinlocks and Memory
L3
Core 0
Core 1
Core 2
Core 4
Core 3
Core 5
Core 6
Core 7
Core 9
Core 8
C P U
L3
Core 0
Core 1
Core 2
Core 4
Core 3
Core 5
Core 6
Core 7
Core 9
Core 8
C P U
Faster here ?Numa Node 0
. . . Or faster here? Numa Node 1
18 threads here
73 s
18 threads here
125 s
C P UC P U Which CPU Socket The Insert Run The Fastest On ?
18 insert thread log writer CPU socketCo-location.
18 insert threads not co-located on same socket as the log writer
84,697 ms
Vs.
11,281,235 ms
C P UC P U What Does Windows Performance Toolkit Have To Say ?
spin_acquireInt s
spin_acquireInt s
spin_acquireInt s
Transfer cache line
Transfer cache line
CPU CPU
L3Core
C P U
C P U C P U100 CPU cycles
Core
34 CPU cycles100 CPU cycles
34 CPU cycles
Core to core on the same socket Core to core on different sockets
C P UC P U The CPU Cycle Cost Of Cache Line Transfers
C P UC P U The In Memory OLTP Hash Indexes: Think Buckets
Smaller bucket counts = better cache line reuse + reduced TLB thrashing + reduced hash table cache out
Larger bucket counts = reduced cache line reuse + increased TLB thrashing + less hash bucket scanning for lookups
Lookup Table(Hash)
C P UC P U Is There A Hash Index Bucket Count Sweet Spot ?
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
8,000,000
9,000,000
531575 545851 530447
7911392 8064516 8169934 8445945Insert Rate For 10 Threads / Bucket Count
Bucket Count
Inse
rt R
ate
16,777,216 67,108,864 33,554,432 2,097,152 1,048,576 4,194,304 524,288
C P UC P U NUMA Locality and The In-Memory OLTP Engine
SQL Server 2016 RC3
Singleton inserts into a memory optimised table with a hash index
2 sockets, 10 cores per socket
Measuring the effect of moving the CPU affinity mask around
C P UC P U
Questions ?
C P UC P U My Contact Details
http://uk.linkedin.com/in/wollatondba
ChrisAdkin8
C P UC P U
Addendum: Windows Performance Toolkit Basics
Wait Time +
Service Time
What is happening
here ?
C P UC P U Wait Time Is Well Understood , Service Time However…
C P UC P U Introducing Windows Performance Toolkit
Remember to turn the paging executive off !
CPU analysisWhere out CPU cycles are going
Wait AnalysisWhat threads are waiting on
Deferred Procedure Call/Interrupt Service Request Analysis ?
C P UC P U What Does Windows Performance Toolkit Allow Us To See ?
xperf –on base –stackwalk profile
xperf –d stackwalk.etlWPA
Run query
C P UC P U Collecting An Event Trace For Windows
A call stack is a stack data structure that stores information about the active subroutines of a computer program.
C P UC P U What Is A Call Stack ?
But for two DLL’s, SQLOSwould run on bare metal
You should only be interestedCPU and DPC/ISR analysisunless significant waits onPREEMPTIVE_OS_ waitevents are prevalent
C P UC P U What Available CPU Stats Are We Interested In ?
Database EngineLanguage Processing and Optimisation
sqllang.dll
Runtimesqlmin.dll, sqltst.dll, qds.dll, hekaton.dll, <in-memory-table.dll>, <natively-compiled-obj.dll>
SQLOSsqldk.dll, sqlos.dll
C P UC P U What Do The .DLLs In The Call Stack Represent ?
C P UC P U Where Is The CPU Burned In The Legacy Engine ?
sqlmin.dllQuery executionLatchingSpin lockingLockingLog writingLazy writing IO
sqldk.dll sqltst.dll sqlos.dll qds.dll
A debug symbol expresses which programming-language constructs generated a specific piece of machine code in a given executable module
If the debug symbols exist, they will be on the symbol server pointed to by WPA by default
C P UC P U What Is A Debug Symbol ?
C P UC P U Investigating CPU Saturation
1. Load ETL file 2. Load symbols
3. Open graph explorer
4. Drag ‘Computation’ onto analysis canvas and select graph and table
C P UC P U Computation Columns Of Interest
Weight (in view)The sampled CPU time in ms across all CPU cores
% WeightSampled CPU time as a percentage of all CPU time available during the entire sampling period