system architecture: big iron (numa)
DESCRIPTION
System Architecture: Big Iron (NUMA). Joe Chang [email protected] www.qdpma.com. About Joe Chang. SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools - PowerPoint PPT PresentationTRANSCRIPT
System Architecture: Big Iron (NUMA)
About Joe ChangAbout Joe Chang
SQL Server Execution Plan Cost Model
True cost structure by system architecture
Decoding statblob (distribution statistics)
SQL Clone – statistics-only database
ToolsExecStats – cross-reference index use by SQL-execution plan
Performance Monitoring,
Profiler/Trace aggregation
Scaling SQL on NUMA TopicsScaling SQL on NUMA Topics
OLTP – Thomas Kejser session“Designing High Scale OLTP Systems”
Data Warehouse
Ongoing Database DevelopmentOngoing Database Development
Bulk Load – SQL CAT paper + TK session
“The Data Loading Performance Guide”Other Sessions with common coverage:Monitoring and Tuning Parallel Query Execution II, R Meyyappan(SQLBits 6) Inside the SQL Server Query Optimizer, Conor CunninghamNotes from the field: High Performance Storage, John LangfordSQL Server Storage – 1000GB Level, Brent Ozar
Server Systems and ArchitectureServer Systems and Architecture
Symmetric Multi-ProcessingSymmetric Multi-Processing
CPU CPU
System Bus
CPU CPU
MCH
ICHPXH PXH
SMP, processors are not dedicated to specific tasks (ASMP), single OS image, each processor can acess all memory
SMP makes no reference to memory architecture?
Not to be confused to Simultaneous Multi-Threading (SMT)Intel calls SMT Hyper-Threading (HT), which is not to be confused with AMD Hyper-Transport (also HT)
Non-Uniform Memory AccessNon-Uniform Memory Access
MemoryController
CPU CPU CPU CPU
Shared Bus or X Bar
MemoryController
CPU CPU CPU CPU
MemoryController
CPU CPU CPU CPU
MemoryController
CPU CPU CPU CPU
Node Controller Node Controller Node Controller Node Controller
NUMA Architecture - Path to memory is not uniform1) Node: Processors, Memory, Separate or combined
Memory + Node Controllers 2) Nodes connected by shared bus, cross-bar, ring
Traditionally, 8-way+ systems
Local memory latency ~150ns, remote node memory ~300-400ns, can cause erratic behavior if OS/code is not NUMA aware
AMD OpteronAMD Opteron
Opteron
Opteron
HT2100
Opteron
Opteron
HT1100HT2100
Local memory latency ~50ns, 1 hop ~100ns, two hop 150ns?Actual: more complicated because of snooping (cache coherency traffic)
Technically, Opteron is NUMA, but remote node memory latency is low, no negative impact or erratic behavior!For practical purposes: behave like SMP system
8-way Opteron Sys Architecture8-way Opteron Sys Architecture
Opteron processor (prior to Magny-Cours) has 3 Hyper-Transport links. Note 8-way top and bottom right processors use 2 HT to connect to other processors, 3rd HT for IO, CPU 1 & 7 require 3 hops to each other
CPU0
CPU2
CPU4
CPU6
CPU1
CPU3
CPU5
CPU7
http://www.techpowerup.com/img/09-08-26/17d.jpg
Nehalem System ArchitectureNehalem System Architecture
Intel Nehalem generation processors have Quick Path Interconnect (QPI) Xeon 5500/5600 series have 2, Xeon 7500 series have 4 QPI 8-way Glue-less is possible
NUMA Local and Remote MemoryNUMA Local and Remote Memory
Local memory is closer than remote
Physical access time is shorter
What is actual access time?With cache coherency requirement!
HT Assist – Probe FilterHT Assist – Probe Filterpart of L3 cache used as directory cache
ZDNET
Source Snoop CoherencySource Snoop Coherency
From HP PREMA Architecture whitepaper:
All reads result in snoops to all other caches, … Memory controller cannot return the data until it has collected all the snoop responses and is sure that no cache provided a more recent copy of the memory line
DL980G7DL980G7
From HP PREAM Architecture whitepaper:Each node controller stores information about* all data in the processor caches, minimizes inter-processor coherency communication, reduces latency to local memory(*only cache tags, not cache data)
HP ProLiant DL980 ArchitectureHP ProLiant DL980 Architecture
Node Controllers reduces effective memory latency
Superdome 2 – Itanium, sx3000Superdome 2 – Itanium, sx3000
Agent – Remote Ownership Tag + L4 cache tags
64M eDRAM L4 cache data
IBM x3850 X5 (Glue-less)IBM x3850 X5 (Glue-less)
Connect two 4-socket Nodes to make 8-way system
OS Memory ModelsOS Memory Models
251791
Node 0
241680
2719113
Node 1
2618102
2921135
Node 2
2820124
3123157
Node 3
3022146
7531
Node 0
6420
1513119
Node 1
1412108
23211917
Node 2
22201816
31292725
Node 3
30282624
SUMA: Sufficiently Uniform Memory AccessMemory interleaved across nodes
NUMA: first interleaved within a node, then spanned across nodes
Memory stripe is then spanned across nodes
1
2
1
2
Windows OS NUMA SupportWindows OS NUMA Support
Memory modelsSUMA – Sufficiently Uniform Memory Access
NUMA – separate memory pools by Node
Node 0
0
24168
1
25179
Node 1
2
261810
3
271911
Node 2
4
282012
5
292113
Node 3
6
302214
7
312315
Node 0
0
642
1
753
Node 1
8
141210
9
151311
Node 2
16
222018
17
232119
Node 3
24
302826
25
312927
Memory is striped across NUMA nodes
Memory Model Example: 4 NodesMemory Model Example: 4 Nodes
SUMA Memory Modelmemory access uniformly distributed
25% of memory accesses local, 75% remote
NUMA Memory ModelGoal is better than 25% local node access
True local access time also needs to be faster
Cache Coherency may increase local access
Architecting for NUMAArchitecting for NUMA
Web determines port for each user by group (but should not be by geography!)
Affinitize port to NUMA node
Each node access localized data (partition?)
OS may allocate substantial chunk from Node 0?
End to End Affinity
North East
Mid Atlantic
South East
Central
Texas
Mountain
California
Pacific NW
1440
1441
1442
1443
1444
1445
1446
1447
Node 0
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
Node 7
0-0
0-1
1-0
1-1
2-0
2-1
3-0
3-1
4-0
4-1
5-0
5-1
6-0
6-1
7-0
7-1
NE
MidA
SE
Cen
Tex
Mnt
Cal
PNW
App Server TCP Port CPU Memory Table
Architecting for NUMAArchitecting for NUMA
Web determines port for each user by group (but should not be by geography!)
Affinitize port to NUMA node
Each node access localized data (partition?)
OS may allocate substantial chunk from Node 0?
End to End Affinity
North East
Mid Atlantic
South East
Central
Texas
Mountain
California
Pacific NW
1440
1441
1442
1443
1444
1445
1446
1447
Node 0
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
Node 7
0-0
0-1
1-0
1-1
2-0
2-1
3-0
3-1
4-0
4-1
5-0
5-1
6-0
6-1
7-0
7-1
NE
MidA
SE
Cen
Tex
Mnt
Cal
PNW
App Server TCP Port CPU Memory Table
HP-UX LORAHP-UX LORA
HP-UX – Not Microsoft Windows
Locality-Optimizer Resource Alignment
12.5% Interleaved Memory
87.5% NUMA node Local Memory
System Tech SpecsSystem Tech Specs
8GB $400 ea 18 x 8G = 144GB, $7200, 64 x 8G = 512GB - $26K
16GB $1100 ea 12 x16G =192GB, $13K, 64 x 16G = 1TB – $70K
Processors
2 x Xeon X56x0
4 x Opteron 6100
4 x Xeon X7560
8 x Xeon X7560
Cores DIMM PCI-E G2
6 18 5 x8+,1 x4
12 32 5 x8, 1 x4
8 64 4 x8, 6 x4†
8 128 9 x8, 5 x4‡
Max memory
192G*
512G
1TB
2TB
Total Cores
12
48
32
64
Base
$7K
$14K
$30K
$100K
Max memory for 2-way Xeon 5600 is 12 x 16 = 192GB,† Dell R910 and HP DL580G7 have different PCI-E ‡ ProLiant DL980G7 can have 3 IOH for additional PCI-E slots
Software StackSoftware Stack
Operating SystemOperating System
Windows Server 2003 RTM, SP1Network limitations (default)
Scalable Networking Pack (912222)
Windows Server 2008
Windows Server 2008 R2 (64-bit only)
Breaks 64 logical processor limit
NUMA IO enhancements?Do not bother trying to do DW on 32-bit OS or 32-bit SQL ServerDon’t try to do DW on SQL Server 2000
Impacts OLTP
Search: MSI-X
SQL Server versionSQL Server version
SQL Server 2000Serious disk IO limitations (1GB/sec ?)
Problematic parallel execution plans
SQL Server 2005 (fixed most S2K problems)
64-bit on X64 (Opteron and Xeon)
SP2 – performance improvement 10%(?)
SQL Server 2008 & R2Compression, Filtered Indexes, etc
Star join, Parallel query to partitioned table
ConfigurationConfiguration
SQL Server Startup Parameter: ETrace Flags 834, 836, 2301
Auto_Date_CorrelationOrder date < A, Ship date > A
Implied: Order date > A-C, Ship date < A+C
Port Affinity – mostly OLTP
Dedicated processor ?for log writer ?
Storage Performance for Data Warehousing
About Joe ChangAbout Joe Chang
SQL Server Execution Plan Cost Model
True cost structure by system architecture
Decoding statblob (distribution statistics)
SQL Clone – statistics-only database
ToolsExecStats – cross-reference index use by SQL-execution plan
Performance Monitoring,
Profiler/Trace aggregation
StorageStorage
Organization StructureOrganization Structure
In many large IT departmentsDB and Storage are in separate groups
Storage usually has own objectivesBring all storage into one big system under full management (read: control)
Storage as a Service, in the CloudOne size fits all needs
Usually have zero DB knowledge
Of course we do high bandwidth, 600MB/sec good enough for you?
Data Warehouse StorageData Warehouse Storage
OLTP – Throughput with Fast Response
DW – Flood the queues for maximum through-put
Do not use shared storage for data warehouse!Storage system vendors like to give the impression the SAN is a magical, immensely powerful box that can meet all your needs. Just tell us how much capacity you need and don’t worry about anything else.My advice: stay away from shared storage, controlled by different team.
Nominal and Net BandwidthNominal and Net Bandwidth
PCI-E Gen 2 – 5 Gbit/sec signalingx8 = 5GB/s, net BW 4GB/s, x4 = 2GB/s net
SAS 6Gbit/s – 6 Gbit/s x4 port: 3GB/s nominal, 2.2GB/sec net?
Fibre Channel 8 Gbit/s nominal780GB/s point-to-point,
680MB/s from host to SAN to back-end loop
SAS RAID Controller, x8 PCI-E G2, 2 x4 6G
2.8GB/s
Depends on the controller, will change!
Storage Storage –– SAS Direct-Attach SAS Direct-Attach
Many Fat Pipes
Very Many Disks
Balance by pipe bandwidth
Don’t forget fat network pipes
Option A: 24-disks in one enclosure for each x4 SAS port. Two x4 SAS ports per controller
Option B: Split enclosure over 2 x4 SAS ports, 1 controller
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
RAIDPCI-E x8
SAS x4
SAS x4
RAIDPCI-E x8
SAS x4
SAS x4
RAIDPCI-E x8
SAS x4
SAS x4
RAIDPCI-E x8
SAS x4
SAS x4
PCI-E x4
PCI-E x4
RAID
2 x10GbE
PCI-E x4 2 x10GbE
SAS x4
Storage Storage –– FC/SAN FC/SAN
PCI-E x8 Gen 2 Slot with quad-port 8Gb FC
If 8Gb quad-port is not supported, consider system with many x4 slots, or consider SAS!
SAN systems typically offer 3.5in 15-disk enclosures. Difficult to get high spindle count with density.
1-2 15-disk enclosures per 8Gb FC port, 20-30MB/s per disk?
2 x10GbE
PCI-E x4 2 x10GbE
PCI-E x4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
HBAPCI-E x8
HBAPCI-E x8
8Gb FC
8Gb FC
8Gb FC
8Gb FC
8Gb FC
8Gb FC
8Gb FC
8Gb FC
HBAPCI-E x4
PCI-E x4 HBA
8Gb FC
8Gb FC
8Gb FC
8Gb FC
HBAPCI-E x48Gb FC
8Gb FC
PCI-E x4 HBA8Gb FC
8Gb FC
PCI-E x4 HBA8Gb FC
8Gb FC
Storage Storage –– SSD / HDD Hybrid SSD / HDD Hybrid
Log: Single DB – HDD, unless rollbacks or T-log backups disrupts log writes. Multi DB – SSD, otherwise to many RAID1 pairs to logs
Storage enclosures typically 12 disks per channel. Can only support bandwidth of a few SSD. Use remaining bays for extra storage with HDD. No point expending valuable SSD space for backups and flat files
No RAID w/SSD?
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
SASPCI-E x8
SAS x4
SAS x4
SASPCI-E x8
SAS x4
SAS x4
SASPCI-E x8
SAS x4
SAS x4
SASPCI-E x8
SAS x4
SAS x4
PCI-E x4
PCI-E x4
RAID
2 x10GbE
PCI-E x4 2 x10GbE
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SAS x4
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SS
D
SSDSSD
Current: mostly 3Gbps SAS/SATA SDD
Some 6Gbps SATA SSD
Fusion IO – direct PCI-E Gen2 interface
320GB-1.2TB capacity, 200K IOPS, 1.5GB/s
No RAID ?HDD is fundamentally a single point failure
SDD could be built with redundant components
HP report problems with SSD on RAID controllers, Fujitsu did not?
Big DW Storage – iSCSI Big DW Storage – iSCSI
Are you nuts?
Storage Configuration - ArraysStorage Configuration - Arrays
Shown:two 12-disk Arrays per 24-disk enclosure
Options: between 6-16 disks per array
SAN systems may recommend R10 4+4 or R5 7+1
Very Many Spindles Comment on Meta LUN
Data Consumption Rate: XeonData Consumption Rate: Xeon
TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M
Data consumption rate is much higher for current generation Nehalem and Westmere processors than Core 2 referenced in Microsoft FTDW document. TPC-H Q1 is more compute intensive than the FTDW light query.
ProcessorsTotalCores
Q1sec
SQLTotalMB/s
MB/sper coreGHz
MemGB
SF
2 Xeon 5355
2 Xeon 5570
2 Xeon 5680
8
8
12
85.4
42.2
21.0
5sp2
8sp1
8r2
1,165.5
2,073.5
4,166.7
145.7
259.2
347.2
2.66
2.93
3.33
64
144
192
100
100
100
4 Xeon 7560 32 37.28r2 7,056.5 220.52.26 640 300
Nehalem
Westmere
Neh.-EX
Conroe
8 Xeon 7560 64 183.88r2 14,282 223.22.26 512 3000
Data Consumption Rate: OpteronData Consumption Rate: Opteron
Expected Istanbul to have better performance per core than Shanghai due to HT Assist. Magny-Cours has much better performance per core! (at 2.3GHz versus 2.8 for Istanbul), or is this Win/SQL 2K8 R2?
TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M
ProcessorsTotalCores
Q1sec
SQLTotalMB/s
MB/sper coreGHz
MemGB
SF
4 Opt 8220
8 Opt 8360
8
32
309.7
91.4
5rtm
8rtm
868.7
2,872.0
121.1
89.7
2.8
2.5
128
256
300
300
8 Opt 8384 32 72.58rtm 3,620.7 113.22.7 256 300
8 Opt 8439 48 49.08sp1 5,357.1 111.62.8 256 300
Barcelona
Shanghai
Istanbul
2 Opt 6176 24 20.28r2 4,331.7 180.52.3 192 100 Magny-C
4 Opt 6176 48 31.88r2 8,254.7 172.02.3 512 300 -
8 Opt 8439 48 166.98rtm 5,242.7 109.22.8 512 1000
Data Consumption RateData Consumption RateTPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M
ProcessorsTotalCores
Q1sec
SQLTotalMB/s
MB/sper coreGHz
MemGB
SF
2 Xeon 5355
2 Xeon 5570
2 Xeon 5680
2 Opt 6176
8
8
12
24
85.4
42.2
21.0
20.2
5sp2
8sp1
8r2
8r2
1165.5
2073.5
4166.7
4331.7
145.7
259.2
347.2
180.5
2.66
2.93
3.33
2.3
64
144
192
192
100
100
100
100
4 Opt 8220
8 Opt 8360
8
32
309.7
91.4
5rtm
8rtm
868.7
2872.0
121.1
89.7
2.8
2.5
128
256
300
300
8 Opt 8384
8 Opt 8439
32
48
72.5
49.0
8rtm
8sp1
3620.7
5357.1
113.2
111.6
2.7
2.8
256
256
300
300
4 Opt 6176 48 31.88r2 8254.7 172.02.3 512 300
8 Xeon 7560 64 183.88r2 14282 223.22.26 512 3000
Barcelona
Shanghai
Istanbul
Magny-C
Storage Targets Storage Targets
Processors
2 Xeon X5680
4 Opt 6176
4 Xeon X7560
8 Xeon X7560
Total Cores
12
48
32
64
PCI-Ex8-x4
5 - 1
5 - 1
6 - 4
9 - 5
SASHBA
2
4
6
11†
Storage Units/Disks
2 - 48
4 - 96
6 - 144
10 - 240
Actual Bandwidth
5 GB/s
10 GB/s
15 GB/s
26 GB/s
† 8-way : 9 controllers in x8 slots, 24 disks per x4 SAS port 2 controllers in x4 slots, 12 disk
24 15K disks per enclosure, 12 disks per x4 SAS port requires 100MB/sec per disk,
possible but not always practical24 disks per x4 SAS port requires 50MB/sec,
more achievable in practice
2U disk enclosure 24 x 73GB 15K 2.5in disks $14K, $600 per disk
BWCore
350
175
250
225
Target MB/s
4200
8400
8000
14400
StorageUnits/Disks
4 - 96
8 - 192
12 - 288
20 - 480
Think: Shortest path to metal (iron-oxide)
Your Storage and the Optimizer Your Storage and the Optimizer
Assumptions2.8GB/sec per SAS 2 x4 Adapter, Could be 3.2GB/sec per PCI-E G2 x8HDD 400 IOPS per disk – Big query key lookup, loop join at high queue, and short-stroked, possible skip-seek. SSD 35,000 IOPS
Sequential IOPS
1,350
350,000
350,000
Model
Optimizer
SAS 2x4
SAS 2x4
Disks
-
24
48
BW (KB/s)
10,800
2,800,000
2,800,000
“Random” IOPS
320
9,600
19,200
Sequential- Rand IO ratio
4.22
36.5
18.2
45,000FC 4G 30 360,000 12,000 3.75
350,000SSD 8 2,800,000 280,000 1.25
The SQL Server Query Optimizer make key lookup versus table scan decisions based on a 4.22 sequential-to-random IO ratioA DW configured storage system has a 18-36 ratio, 30 disks per 4G FC about matches the QO, SSD is in the other direction
0
50
100
150
200
250
300
350
400
450
Q1 Q9 Q18 Q21
X 5355 5sp2 X 5570 8sp1
X 5680 8R2 O 6176 8R2
Data Consumption RatesData Consumption Rates
0
50
100
150
200
250
300
Q1 Q9 Q18 Q21
O DC 2.8G 128 5rtm O QC 2.5G 256 8rtm
O QC 2.7G 256 8rtm O 6C 2.8G 256 8sp1
O 12C 2.3G 512 8R2 X7560 R2 640
TPC-H SF100Query 1, 9, 13, 21
TPC-H SF300Query 1, 9, 13, 21
Fast Track Reference ArchitectureFast Track Reference Architecture
Several Expensive SAN systems (11 disks)
Each must be configured independently
$1,500-2,000 amortized per disk
Too many 2-disk Arrays2 LUN per Array, too many data files
Build Indexes with MAXDOP 1 Is this brain dead?
Designed around 100MB/sec per diskNot all DW is single scan, or sequential
My Complaints
Scripting?
FragmentationFragmentation
Weak Storage System 1) Fragmentation could degrade IO performance,2) Defragmenting very large table on a weak storage system could render the database marginally to completely non-functional for a very long time.
Powerful Storage System3) Fragmentation has very little impact.4) Defragmenting has mild impact, and completes within night time window.
What is the correct conclusion?
File
Partition
LUN
Disk
Table
Operating System View of Operating System View of StorageStorage
Operating System Disk ViewOperating System Disk View
Controller 1 Port 0
Controller 1 Port 1
Disk 2Basic396GBOnline
Disk 3Basic396GBOnline
Controller 2 Port 0
Controller 2 Port 1
Disk 4Basic396GBOnline
Disk 5Basic396GBOnline
Controller 3 Port 0
Controller 3 Port 1
Disk 6Basic396GBOnline
Disk 7Basic396GBOnline
Additional disks not shown, Disk 0 is boot drive, 1 – install source?
File LayoutFile Layout
Disk 2, Partition 0
File Group for the big TableFile 1
Partition 1
File Group for all othersFile 1
Partition 2
TempdbFile 1
Partition 4
Backup and Load File 1
Disk 3 Partition 0
File Group for the big TableFile 2
Partition 1
Small File GroupFile 2
Partition 2
TempdbFile 2
Partition 4
Backup and LoadFile 2
Disk 4 Partition 0
File Group for the big TableFile 3
Partition 1
Small File GroupFile 3
Partition 2
TempdbFile 3
Partition 4
Backup and Load File 3
Disk 5 Partition 0
File Group for the big TableFile 4
Partition 1
Small File GroupFile 4
Partition 2
TempdbFile 4
Partition 4
Backup and Load File 4
Disk 6 Partition 0
File Group for the big TableFile 5
Partition 1
Small File GroupFile 5
Partition 2
TempdbFile 5
Partition 4
Backup and Load File 5
Disk 7 Partition 0
File Group for the big TableFile 6
Partition 1
Small File GroupFile 6
Partition 2
TempdbFile 6
Partition 4
Backup and Load File 6
Each File Group is distributed across all data disks
Log disks not shown, tempdb share common pool with data
File Groups and FilesFile Groups and Files
Dedicated File Group for largest table
Never defragment
One file group for all other regular tables
Load file group?Rebuild indexes to different file group
Partitioning - PitfallsPartitioning - Pitfalls
Common Partitioning Strategy
Partition Scheme maps partitions to File Groups
What happens in a table scan? Read first from Part 1 then 2, then 3, … ?
SQL 2008 HF to read from each partition in parallel?What if partitions have disparate sizes?
Disk 2
File Group 1
Disk 3
File Group 2
Disk 4
File Group 3
Disk 5
File Group 4
Disk 6
File Group 5
Disk 7
File Group 6
Table Partition 1
Table Partition 2
Table Partition 3
Table Partition 4
Table Partition 5
Table Partition 6
About Joe ChangAbout Joe Chang
SQL Server Execution Plan Cost Model
True cost structure by system architecture
Decoding statblob (distribution statistics)
SQL Clone – statistics-only database
ToolsExecStats – cross-reference index use by SQL-execution plan
Performance Monitoring,
Profiler/Trace aggregation
So you bought a 64+ core boxSo you bought a 64+ core box
Learn all about Parallel ExecutionAll guns (cores) blazing
Negative scaling
Super-scaling
High degree of parallelism & small SQL
Anomalies, execution plan changes etc
Compression
Partitioning
Now
No I have not been smoking pot
Yes, this can happen, how will you know
How much in CPU do I pay for this?
Great management tool, what else?
Parallel Execution PlansParallel Execution Plans
Reference: Adam Machanic PASS
Execution Plan QuickieExecution Plan Quickie
Cost is duration in seconds on some reference platformIO Cost for scan: 1 = 10,800KB/s, 810 implies 8,748,000KBIO in Nested Loops Join: 1 = 320/s, multiple of 0.003125
F4
Estimated Execution Plan
I/O and CPU Cost components
Index + Key Lookup - ScanIndex + Key Lookup - Scan
(926.67- 323655 * 0.0001581) / 0.003125 = 280160 (86.6%)
Actual CPU Time (Data in memory)LU 1919 1919Scan 8736 8727
1,093,729 pages/1350 = 810.17 (8,748MB)
True cross-over approx 1,400,000 rows1 row : page
Index + Key Lookup - ScanIndex + Key Lookup - Scan
8748000KB/8/1350 = 810 (817- 280326 * 0.0001581) / 0.003125 = 247259 (88%)
Actual CPU TimeLU 2138 321Scan 18622 658
Actual Execution PlanActual Execution Plan
Note Actual Number of Rows, Rebinds, Rewinds
Actual
Estimated
Actual Estimated
Row Count and ExecutionsRow Count and Executions
For Loop Join inner source and Key Lookup, Actual Num Rows = Num of Exec × Num of Rows
Inner Source
Outer
Parallel PlansParallel Plans
Parallelism OperationsParallelism Operations
Distribute StreamsNon-parallel source, parallel destination
Repartition StreamsParallel source and destination
Gather StreamsDestination is non-parallel
Parallel Execution PlansParallel Execution Plans
Note: gold circle with double arrow, and parallelism operations
Parallel Scan (and Index Seek)Parallel Scan (and Index Seek)
DOP 1 DOP 2
DOP 4 DOP 8
IO Cost sameCPU reduce by degree of parallelism, except no reduction for DOP 16
2X
4X8X
IO contributes most of cost!
Parallel Scan 2Parallel Scan 2
DOP 16
Hash Match AggregateHash Match Aggregate
CPU cost only reducesBy 2X,
Parallel ScanParallel Scan
IO Cost is the same
CPU cost reduced in proportion to degree of parallelism, last 2X excluded?
On a weak storage system, a single thread can saturate the IO channel, Additional threads will not increase IO (reduce IO duration).A very powerful storage system can provide IO proportional to the number of threads. It might be nice if this was optimizer option?
The IO component can be a very large portion of the overall plan costNot reducing IO cost in parallel plan may inhibit generating favorable plan,i.e., not sufficient to offset the contribution from the Parallelism operations.
A parallel execution plan is more likely on larger systems (-P to fake it?)
Actual Execution Plan - ParallelActual Execution Plan - Parallel
More Parallel Plan DetailsMore Parallel Plan Details
Parallel Plan - ActualParallel Plan - Actual
Parallelism – Hash JoinsParallelism – Hash Joins
Hash Join CostHash Join Cost
DOP 1 DOP 2
DOP 8
DOP 4
Search: Understanding Hash JoinsFor In-memory, Grace, Recursive
Hash Join CostHash Join Cost
CPU Cost is linear with number of rows, outer and inner source
See BOL on Hash Joins for In-Memory, Grace, RecursiveIO Cost is zero for small intermediate data size, beyond set point proportional to server memory(?) IO is proportional to excess data (beyond in-memory limit)Parallel Plan: Memory allocation is per thread!
Summary: Hash Join plan cost depends on memory if IO component is not zero, in which case is disproportionately lower with parallel plans. Does not reflect real cost?
Parallelism Repartition StreamsParallelism Repartition Streams
DOP 2 DOP 4 DOP 8
BitmapBitmap
BOL: Optimizing Data Warehouse Query Performance Through Bitmap Filtering A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. Essentially, the filter performs a semi-join reduction; that is, only the rows in the second table that qualify for the join to the first table are processed.
SQL Server uses the Bitmap operator to implement bitmap filtering in parallel query plans. Bitmap filtering speeds up query execution by eliminating rows with key values that cannot produce any join records before passing rows through another operator such as the Parallelism operator. A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. By removing unnecessary rows early in the query, subsequent operators have fewer rows to work with, and the overall performance of the query improves. The optimizer determines when a bitmap is selective enough to be useful and in which operators to apply the filter. For more information, see Optimizing Data Warehouse Query Performance Through Bitmap Filtering.
Parallel Execution Plan SummaryParallel Execution Plan Summary
Queries with high IO cost may show little plan cost reduction on parallel execution
Plans with high portion hash or sort cost show large parallel plan cost reduction
Parallel plans may be inhibited by high row count in Parallelism Repartition Streams
Watch out for (Parallel) Merge Joins!
Scaling TheoryScaling Theory
Parallel Execution StrategyParallel Execution Strategy
Partition work into little piecesEnsures each thread has same amount
High overhead to coordinate
Partition into big piecesMay have uneven distribution between threads
Small table join to big table
Thread for each row from small table
Partitioned table options
What Should Scale?What Should Scale?
Trivially parallelizable: 1) Split large chunk of work among threads, 2) Each thread works independently,3) Small amount of coordination to consolidate threads
223
More Difficult?More Difficult?
Parallelizable: 1) Split large chunk of work among threads, 2) Each thread works on first stage3) Large coordination effort between threads4) More work…Consolidate
2
2
3
3
4
Partitioned TablesPartitioned TablesNo Repartition Streams
Regular Table
Partitioned TablesNo Repartition Streams operations!
Scaling RealityScaling Reality8-way Quad-Core OpteronWindows Server 2008 R2SQL Server 2008 SP1 + HF 27
Test QueriesTest Queries
TPC-H SF 10 databaseStandard, Compressed, Partitioned (30)
Line Item Table SUM, 59M rows, 8.75GB
Orders Table 15M rows
CPU-secCPU-sec
StandardCPU-sec to SUM 1 or 2 columns in Line Item
0
5
10
15
20
25
30
35
DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32
Sum 1 column
Sum 2 columns
0
4
8
12
16
20
24
28
32
DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32
Sum 1 column
Sum 2 columns
Compressed
Speed UpSpeed Up
0
2
4
6
8
10
12
14
16
18
20
22
24
26
DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32
Sum 1
Sum 2
S2 Group
S2 Join
Compressed
0
4
8
12
16
20
24
28
32
DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32
Sum 1 column
Sum 2 columns
S2 Group
S2 Join
Standard
Line Item sum 1 columnLine Item sum 1 column
Speed up relative to DOP 1
CPU-sec
0
5
10
15
20
25
30
35
DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32
Sum 1 Std
Compressed
Partitioned
0
5
10
15
20
25
30
35
DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32
Sum 1 Std
Compressed
Partitioned
Line Item Sum w/Group ByLine Item Sum w/Group By
Speedup
CPU-sec
0
10
20
30
40
50
60
DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32
Group Std
Compressed
Hash
0
2
4
6
8
10
12
14
16
18
20
22
24
26
DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32
Group Std
Compressed
Hash
Hash JoinHash Join
Speedup
CPU-sec
0
20
40
60
80
100
120
DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32
Join Std
Compressed
Partitioned
0
5
10
15
20
25
30
DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32
Join Std
Compressed
Partitioned
Key Lookup and Table ScanKey Lookup and Table Scan
Speedup
CPU-sec1.4M rows
0
2
4
6
8
10
12
14
16
18
20
DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32
Key Lookup std Key Lookup compr
Table Scan uncmp Table Scan cmpr
02468
101214161820222426283032
DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32
Key Lookup std
Key Lookup compr
Table Scan uncmp
Table Scan cmpr
Parallel Execution SummaryParallel Execution Summary
Contention in queries w/low cost per page
Simple scan,
High Cost per Page – improves scaling!
Multiple Aggregates, Hash Join, Compression
Table Partitioning – alternative query plans
Loop Joins – broken at high DOP
Merge Join – seriously broken (parallel)
Scaling DW SummaryScaling DW Summary
Massive IO bandwidth
Parallel options for data load, updates etc
Investigate Parallel Execution PlansScaling from DOP 1, 2, 4, 8, 16, 32 etc
Scaling with and w/o HT
Strategy for limiting DOP with multiple users
Fixes from Microsoft NeededFixes from Microsoft Needed
Contention issues in parallel execution
Table scan, Nested Loops
Better plan cost model for scalingBack-off on parallelism if gain is negligible
Fix throughput degradation with multiple users running big DW queries
Sybase and Oracle, Throughput is close to Power or better
Test SystemsTest Systems
Test SystemsTest Systems
2-way quad-core Xeon 5430 2.66GHzWindows Server 2008 R2, SQL 2008 R2
8-way dual-core Opteron 2.8GHzWindows Server 2008 SP1, SQL 2008 SP1
8-way quad-core Opteron 2.7GHz Barcelona
Windows Server 2008 R2, SQL 2008 SP18-way systems were configured for AD- not good!
Build 2789
Test MethodologyTest Methodology
Boot with all processorsRun queries at MAXDOP 1, 2, 4, 8, etc
Not the same as running on 1-way, 2-way, 4-way server
Interpret results with caution
ReferencesReferences
Search Adam Machanic PASS
SQL Server Scaling on Big Iron (NUMA) Systems
TPC-H
About Joe ChangAbout Joe Chang
SQL Server Execution Plan Cost Model
True cost structure by system architecture
Decoding statblob (distribution statistics)
SQL Clone – statistics-only database
ToolsExecStats – cross-reference index use by SQL-execution plan
Performance Monitoring,
Profiler/Trace aggregation
TPC-HTPC-H
TPC-HTPC-H
DSS – 22 queries, geometric mean60X range plan cost, comparable actual range
Power – single streamTests ability to scale parallel execution plans
Throughput – multiple streams
Scale Factor 1 – Line item data is 1GB
875MB with DATE instead of DATETIME
Only single column indexes allowed, Ad-hoc
Observed Scaling BehaviorsObserved Scaling Behaviors
Good scaling, leveling off at high DOP
Perfect Scaling ???
Super Scaling
Negative Scaling especially at high DOP
Execution Plan change Completely different behavior
TPC-H Published ResultsTPC-H Published Results
TPC-H SF 100GBTPC-H SF 100GB
Between 2-way Xeon 5570, all are close, HDD has best throughput, SATA SSD has best composite, and Fusion-IO has be power.Westmere and Magny-Cours, both 192GB memory, are very close
2-way Xeon 5355, 5570, 5680, Opt 6176
0
20,000
40,000
60,000
80,000
100,000
Power Throughput QphH
Xeon 5355 5570 HDD
5570 SSD 5570 Fusion
5680 SSD Opt 6176
TPC-H SF 300GBTPC-H SF 300GB8x QC/6C & 4x12C Opt,
6C Istanbul improved over 4C Shanghai by 45% Power, 73% Through-put, 59% overall.4x12C 2.3GHz improved17% over 8x6C 2.8GHz
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
Power Throughput QphH
Opt 8360 4C Opt 8384 4COpt 8439 6C Opt 6716 12X 7560 8C
TPC-H SF 1000TPC-H SF 1000
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
Power Throughput QphH
Opt 8439 SQL Opt 8439 Sybase
Superdome Superdome 2
TPC-H SF 3TBTPC-H SF 3TBX7460 & X7560
Nehalem-EX 64 cores better than 96 Core 2.
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
Power Throughput QphH
16 x X7460
8 x 7560
POWER6
TPC-H SF 100GB, 300GB & 3TBTPC-H SF 100GB, 300GB & 3TB
0
20,000
40,000
60,000
80,000
100,000
Power Throughput QphH
Xeon 5355 5570 HDD
5570 SSD 5570 Fusion
5680 SSD Opt 6176Westmere and Magny-Cours are very closeBetween 2-way Xeon 5570, all are close, HDD has best through-put, SATA SSD has best composite, and Fusion-IO has be power
SF100 2-way
SF300 8x QC/6C & 4x12C6C Istanbul improved over 4C Shanghai by 45% Power, 73% Through-put, 59% overall.4x12C 2.3GHz improved17% over 8x6C 2.8GHz
SF 3TB X7460 & X7560Nehalem-EX 64 cores better than 96 Core 2.
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
Power Throughput QphH
Opt 8360 4C Opt 8384 4COpt 8439 6C Opt 6716 12X 7560 8C
0
50,000
100,000
150,000
200,000
16 x X7460
8 x 7560
32 x Pwr6
TPC-H Published ResultsTPC-H Published Results
SQL Server excels in Power Limited by Geometric mean, anomalies
Trails in ThroughputOther DBMS get better throughput than power
SQL Server throughput below Power
by wide margin
Speculation – SQL Server does not throttle back parallelism with load?
TPC-H SF100TPC-H SF100
PowerThrough
putQphHProcessors
TotalCores
SQLGHzMemGB
SF
23,378.0 13,381.0 17,686.72 Xeon 5355 8 5sp22.66 64 100
67,712.9 38,019.1 50,738.42x5570 HDD 8 8sp12.93 144 100
99,426.3
94,761.5
55,038.2
53,855.6
73,974.6
71,438.3
2 Xeon 5680
2 Opt 6176
12
24
8r2
8r2
3.33
2.3
192
192
100
100
70,048.5 37,749.1 51,422.42x5570 SSD 8 8sp12.93 144 100
72,110.5 36,190.8 51,085.65570 Fusion 8 8sp12.93 144 100
TPC-H SF300TPC-H SF300
PowerThrough
putQphHProcessors
TotalCores
SQLGHzMemGB
SF
25,206.4
67,287.4
75,161.2
109,067.1
13,283.8
41,526.4
44,271.9
76,869.0
18,298.5
52,860.2
57,684.7
91,558.2
4 Opt 8220
8 Opt 8360
8
32
5rtm
8rtm
2.8
2.5
128
256
8 Opt 8384
8 Opt 8439
32
48
8rtm
8sp1
2.7
2.8
256
256
300
300
300
300
129,198.3 89,547.7 107,561.24 Opt 6176 48 8r22.3 512 300
152,453.1 96,585.4 121,345.64 Xeon 7560 32 8r22.26 640 300
All of the above are HP results?, Sun result Opt 8384, sp1, Pwr 67,095.6, Thr 45,343.5, QphH 55,157.5
TPC-H 1TBTPC-H 1TB
PowerThrough
putQphHProcessors
TotalCores
SQLGHzMemGB
SF
95,789.1 69,367.6 81,367.68 Opt 8439 48 8R2?2.8 512 1000
108,436.8 96,652.7 102,375.38 Opt 8439 48 ASE2.8 384 1000
139,181.0 141,188.1 140,181.1Itanium 9350 64 O11R21.73 512 1000
TPC-H 3TBTPC-H 3TB
PowerThrough
putQphHProcessors
TotalCores
SQLGHzMemGB
SF
120,254.8 87,841.4 102,254.816 Xeon 7460 96 8r22.66 1024 3000
185,297.7 142,685.6 162,601.78 Xeon 7560 64 8r22.26 512 3000
142,790.7 171,607.4 156,537.3Itanium 9350 64 Sybase1.73 512 1000
142,790.7 171,607.4 156,537.3POWER6 64 Sybase5.0 512 3000
TPC-H Published ResultsTPC-H Published Results
Power
23,378
72,110.5
99,426.3
94,761.5
25,206.4
67,287.4
75,161.2
109,067.1
129,198.3
185,297.7
Throughput
13,381
36,190.8
55,038.2
53,855.6
13,283.8
41,526.4
44,271.9
76,869.0
89,547.7
142,685.6
QphH
17,686.7
51,085.6
73,974.6
71,438.3
18,298.5
52,860.2
57,684.7
91,558.2
107,561.2
162,601.7
ProcessorsTotalCores
SQLGHzMemGB
2 Xeon 5355
2 Xeon 5570
2 Xeon 5680
2 Opt 6176
8
8
12
24
5sp2
8sp1
8r2
8r2
2.66
2.93
3.33
2.3
64
144
192
192
4 Opt 8220
8 Opt 8360
8
32
5rtm
8rtm
2.8
2.5
128
256
8 Opt 8384
8 Opt 8439
32
48
8rtm
8sp1
2.7
2.8
256
256
4 Opt 6176 48 8r22.3 512
8 Xeon 7560 64 8r22.26 512
SF
100
100
100
100
300
300
300
300
300
3000
SF100 2-way Big Queries (sec)SF100 2-way Big Queries (sec)
0
10
20
30
40
50
60
Q1 Q9 Q13 Q18 Q21
5570 HDD 5570 SSD
5570 FusionIO 5680 SSD
6176 SSD
Xeon 5570 with SATA SSD poor on Q9, reason unknownBoth Xeon 5680 and Opteron 6176 big improvement over Xeon 5570
Qu
ery
tim
e in
se
c
SF100 Middle QSF100 Middle Q
0
1
2
3
4
5
6
7
8
Q3 Q5 Q7 Q8 Q10 Q11 Q12 Q16 Q22
5570 HDD 5570 SSD 5570 FusionIO
5680 SSD 6176 SSD
Xeon 5570-HDD and 5680-SSD poor on Q12, reason unknownOpteron 6176 poor on Q11
Qu
ery
tim
e in
se
c
SF100 Small QueriesSF100 Small Queries
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Q2 Q4 Q6 Q14 Q15 Q17 Q19 Q20
5570 HDD 5570 SSD 5570 FusionIO
5680 SSD 6176 SSD
Qu
ery
tim
e in
se
c
Xeon 5680 and Opteron poor on Q20Note limited scaling on Q2, & 17
SF300 32+ cores Big QueriesSF300 32+ cores Big QueriesQ
ue
ry ti
me
in s
ec
Opteron 6176 poor relative to 8439 on Q9 & 13, same number of total cores
0
20
40
60
80
100
120
Q1 Q9 Q13 Q18 Q21
8 x 8360 QC 2M
8 x 8384 QC 6M
8 x 8439 6C
4 x 6176 12C
4 x 7560 8C
SF300 Middle QSF300 Middle Q
Opteron 6176 much better than 8439 on Q11 & 19Worse on Q12
Qu
ery
tim
e in
se
c
0
4
8
12
16
20
24
28
Q3 Q5 Q7 Q8 Q10 Q11 Q12 Q16 Q19 Q20 Q22
8x8360 QC 2M 8x8384 QC 6M
8x8439 6C 4x6176 12C
4x7560 8C
SF300 Small QSF300 Small Q
Opteron 6176 much better on Q2, even with 8439 on others
Qu
ery
tim
e in
se
c
0
1
2
3
4
5
6
Q2 Q4 Q6 Q14 Q15 Q17
8 x 8360 QC 2M 8 x 8384 QC 6M
8 x 8439 6C 4 x 6176 12C
4 x 7560 8C
SF1000SF1000
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
SF1000SF1000
0
50
100
150
200
250
300
350
400
Q1 Q9 Q13 Q18 Q21
SQL Server
Sybase
SF1000SF1000
0
10
20
30
40
50
60
70
80
Q3 Q5 Q7 Q8 Q10 Q11 Q12 Q17 Q19
SQL Server
Sybase
SF1000SF1000
0
5
10
15
20
25
30
35
Q2 Q4 Q6 Q14 Q15 Q16 Q20 Q22
SQL Server
Sybase
SF1000 Itanium - SuperdomeSF1000 Itanium - Superdome
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
SF 3TB – 8SF 3TB – 8××7560 versus 167560 versus 16××74607460
0.0
0.5
1.0
1.5
2.0
2.5
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
Broadly 50% faster overall, 5X+ on one, slower on 2, comparable on 3
5.6X
64 cores, 7560 relative to PWR664 cores, 7560 relative to PWR6
0
1
2
3
4
5
6
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
0
100
200
300
400
500
600
Q1 Q9 Q13 Q18 Q21
Uni 16x6
DL980 8x8
Pwr6
0
20
40
60
80
100
120
140
160
180
200
Q3 Q5 Q7 Q8 Q10 Q11 Q12 Q17 Q19
Uni 16x6
DL980 8x8
Pwr6
0
10
20
30
40
50
60
Q2 Q4 Q6 Q14 Q15 Q16 Q20 Q22
Uni 16x6
DL980 8x8
Pwr6
TPC-H SummaryTPC-H Summary
Scaling is impressive on some SQL
Limited ability (value) is scaling small Q
Anomalies, negative scaling
TPC-H QueriesTPC-H Queries
Q1 Pricing Summary ReportQ1 Pricing Summary Report
Query 2 Minimum Cost SupplierQuery 2 Minimum Cost Supplier
Wordy, but only touches the small tables, second lowest plan cost (Q15)
Q3Q3
Q6 Forecasting Revenue ChangeQ6 Forecasting Revenue Change
Q7 Volume ShippingQ7 Volume Shipping
Q8 National Market ShareQ8 National Market Share
Q9 Product Type Profit MeasureQ9 Product Type Profit Measure
Q11 Important Stock IdentificationQ11 Important Stock Identification
Non-Parallel Parallel
Q12 Random IO?Q12 Random IO?
Q13Q13 Why does Q13 have perfect scaling?
Q17 Small Quantity Order RevenueQ17 Small Quantity Order Revenue
Q18 Large Volume CustomerQ18 Large Volume Customer
Non-Parallel
Parallel
Q19Q19
Q20?Q20?
This query may get a poor execution plan
Date functions are usually written as
because Line Item date columns are “date” typeCAST helps DOP 1 plan, but get bad plan for parallel
Q21 Suppliers Who Kept Orders WaitingQ21 Suppliers Who Kept Orders Waiting
Note 3 references to Line Item
Q22Q22
About Joe ChangAbout Joe Chang
SQL Server Execution Plan Cost Model
True cost structure by system architecture
Decoding statblob (distribution statistics)
SQL Clone – statistics-only database
ToolsExecStats – cross-reference index use by SQL-execution plan
Performance Monitoring,
Profiler/Trace aggregation
TPC-HTPC-H
TPC-HTPC-H
DSS – 22 queries, geometric mean60X range plan cost, comparable actual range
Power – single streamTests ability to scale parallel execution plans
Throughput – multiple streams
Scale Factor 1 – Line item data is 1GB
875MB with DATE instead of DATETIME
Only single column indexes allowed, Ad-hoc
SF 10, test studiesSF 10, test studies
Not valid for publication
Auto-Statistics enabled, Excludes compile time
Big Queries – Line Item Scan
Super Scaling – Mission Impossible
Small Queries & High Parallelism
Other queries, negative scaling
Did not apply T2301, or disallow page locks
0
500
1,000
1,500
2,000
2,500
3,000
3,500
Q1 Q9 Q13 Q18 Q21
DOP 1 DOP 2 DOP 4
DOP 8 DOP 16
Big Q: Plan Cost vs ActualBig Q: Plan Cost vs ActualPlan Cost reduction from DOP1 to 16/32Q1 28%Q9 44%Q18 70%Q21 20%
Plan Cost says scaling is poor except for Q18,
memory affects Hash IO onset
Plan Cost @ 10GB
0
15
30
45
60
75
Q1 Q9 Q13 Q18 Q21
DOP 1 DOP 2 DOP 4
DOP 8 DOP 16 DOP 24
DOP 30 DOP 32
Actual Query timeIn seconds
Plan Cost is poor indicator of true parallelism scaling
Q18 & Q 21 > 3X Q1, Q9
02468
10121416182022242628303234
Q1 Q9 Q13 Q18 Q21
DOP 1 DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 30 DOP 32
Big Query: Speed Up and CPUBig Query: Speed Up and CPU
Q13 has slightly better than perfect scaling?In general, excellent scaling to DOP 8-24, weak afterwards
Holy Grail
0
10
20
30
40
50
60
70
80
90
Q1 Q9 Q13 Q18 Q21
DOP 1 DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 30 DOP 32
CPU timeIn seconds
Speed up relative to DOP 1
Super ScalingSuper Scaling
Suppose at DOP 1, a query runs for 100 seconds, with one CPU fully pegged
CPU time = 100 sec, elapse time = 100 sec
What is best case for DOP 2?Assuming nearly zero Repartition Threads cost
CPU time = 100 sec, elapsed time = 50?
Super Scaling: CPU time decreases going from Non-Parallel to Parallel plan!No, I have not started drinking, yet
0.0
0.5
1.0
1.5
2.0
2.5
Q7 Q8 Q11 Q21 Q22
DOP 1 DOP 2
DOP 4 DOP 8
DOP 16 DOP 24
DOP 30 DOP 32
Super ScalingSuper Scaling
CPU-sec goes down from DOP 1 to 2 and higher (typically 8)
0
2
4
6
8
10
12
14
16
18
20
22
24
26
Q7 Q8 Q11 Q21 Q22
DOP 1 DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 30 DOP 32
CPU normalized to DOP 1
Speed up relative to DOP 1
3.5X speedup from DOP 1 to 2 (Normalized to DOP 1)
CPU and Query time in secondsCPU and Query time in seconds
0
2
4
6
8
10
12
14
16
18
20
Q7 Q8 Q11 Q21 Q22
DOP 1 DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 30 DOP 32
0
2
4
6
8
10
12
Q7 Q8 Q11 Q21 Q22
DOP 1 DOP 2 DOP 4
DOP 8 DOP 16 DOP 24
DOP 30 DOP 32
CPU time
Query time
Super Scaling SummarySuper Scaling Summary
Most probable causeBitmap Operator in Parallel Plan
Bitmap Filters are great, Question for Microsoft:
Can I use Bitmap Filters in OLTP systems with non-parallel plans?
Small Queries – Plan Cost vs ActSmall Queries – Plan Cost vs Act
Query 3 and 16 have lower plan cost than Q17, but not included
0
50
100
150
200
250
Q2 Q4 Q6 Q15 Q17 Q20
DOP 1 DOP 2 DOP 4
DOP 8 DOP 16
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Q2 Q4 Q6 Q15 Q17 Q20
DOP 1 DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 30 DOP 32
Q4,6,17 great scaling to DOP 4, then weak
Negative scaling also occurs
Query time
Plan Cost
Small Queries CPU & SpeedupSmall Queries CPU & Speedup
What did I get for all that extra CPU?, Interpretation: sharp jump in CPU means poor scaling, disproportionate means negative scaling
0
1
2
3
4
5
6
Q2 Q4 Q6 Q15 Q17 Q20
DOP 1 DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 30 DOP 32
0
2
4
6
8
10
12
14
16
18
Q2 Q4 Q6 Q15 Q17 Q20
DOP 1 DOP 2 DOP 4
DOP 8 DOP 16 DOP 24
DOP 30 DOP 32
Query 2 negative at DOP 2, Q4 is good, Q6 get speedup, but at CPU premium, Q17 and 20 negative after DOP 8
CPU time
Speed up
High Parallelism – Small QueriesHigh Parallelism – Small Queries
Why? Almost No value
TPC-H geometric mean scoringSmall queries have as much impact as large
Linear sum of weights large queries
OLTP with 32, 64+ coresParallelism good if super-scaling
Default max degree of parallelism 0
Seriously bad news, especially for small Q
Increase cost threshold for parallelism?
Sometimes you do get lucky
Q that go NegativeQ that go Negative
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Q17 Q19 Q20 Q22
DOP 1 DOP 2
DOP 4 DOP 8
DOP 16 DOP 24
DOP 30 DOP 32
0
2
4
6
8
10
12
14
Q17 Q19 Q20 Q22
DOP 1 DOP 2
DOP 4 DOP 8
DOP 16 DOP 24
DOP 30 DOP 32
Query time
“Speedup”
CPUCPU
0
2
4
6
8
10
12
Q17 Q19 Q20 Q22
DOP 1 DOP 2
DOP 4 DOP 8
DOP 16 DOP 24
DOP 30 DOP 32
Other Queries – CPU & SpeedupOther Queries – CPU & Speedup
0
2
4
6
8
10
12
14
16
18
20
22
Q3 Q5 Q10 Q12 Q14 Q16
DOP 1 DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 30 DOP 32
0
2
4
6
8
10
12
14
16
18
20
22
Q3 Q5 Q10 Q12 Q14 Q16
DOP 1 DOP 2
DOP 4 DOP 8
DOP 16 DOP 24
DOP 30 DOP 32
Q3 has problems beyond DOP 2
CPU time
Speedup
Other - Query Time secondsOther - Query Time seconds
0
2
4
6
8
10
12
14
16
Q3 Q5 Q10 Q12 Q14 Q16
DOP 1 DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 30 DOP 32
Query time
Scaling SummaryScaling Summary
Some queries show excellent scaling
Super-scaling, better than 2X
Sharp CPU jump on last DOP doubling
Need strategy to cap DOPTo limit negative scaling
Especially for some smaller queries?
Other anomalies
CompressionCompression
PAGE
1.0
1.1
1.2
1.3
1.4
1.5
DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32
1.0
1.1
1.2
1.3
1.4
1.5
DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32
Compression Overhead - OverallCompression Overhead - Overall
40% overhead for compression at low DOP,10% overhead at max DOP???
Query time compressed relative to uncompressed
CPU time compressed relative to uncompressed
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
DOP 1 DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 32
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
DOP 1 DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 32
Query time compressed relative to uncompressed
CPU time compressed relative to uncompressed
Compressed TableCompressed Table
LINEITEM – real data may be more compressibleUncompressed: 8,749,760KB, Average Bytes per row: 149Compressed: 4,819,592KB, Average Bytes per row: 82
PartitioningPartitioning
Orders and Line Item on Order Key
Partitioning Impact - OverallPartitioning Impact - Overall
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32
0.90
0.95
1.00
1.05
1.10
1.15
DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32
Query time partitioned relative to not partitioned
CPU time partitioned relative to not partitioned
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
DOP 1 DOP 2 DOP 4
DOP 8 DOP 16 DOP 24
DOP 32
0
1
2
3
4
5
6
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
DOP 1 DOP 2
DOP 4 DOP 8
DOP 16 DOP 24
DOP 32
Query time partitioned relative to not partitioned
CPU time partitioned relative to not partitioned
Plan for Partitioned TablesPlan for Partitioned Tables
Scaling DW SummaryScaling DW Summary
Massive IO bandwidth
Parallel options for data load, updates etc
Investigate Parallel Execution PlansScaling from DOP 1, 2, 4, 8, 16, 32 etc
Scaling with and w/o HT
Strategy for limiting DOP with multiple users
Fixes from Microsoft NeededFixes from Microsoft Needed
Contention issues in parallel execution
Table scan, Nested Loops
Better plan cost model for scalingBack-off on parallelism if gain is negligible
Fix throughput degradation with multiple users running big DW queries
Sybase and Oracle, Throughput is close to Power or better
Query PlansQuery Plans
Big QueriesBig Queries
Q1 Pricing Summary ReportQ1 Pricing Summary Report
Q1 Plan Q1 Plan
Non-Parallel
Parallel
Parallel plan 28% lower than scalar, IO is 70%, no parallel plan cost reduction
Q9 Product Type Profit MeasureQ9 Product Type Profit Measure
IO from 4 tables contribute 58% of plan cost, parallel plan is 39% lower
Non-Parallel Parallel
Q9 Non-Parallel PlanQ9 Non-Parallel Plan
Table/Index Scans comprise 64%, IO from 4 tables contribute 58% of plan cost
Join sequence: Supplier, (Part, PartSupp), Line Item, Orders
Q9 Parallel PlanQ9 Parallel Plan
Non-Parallel: (Supplier), (Part, PartSupp), Line Item, OrdersParallel: Nation, Supplier, (Part, Line Item), Orders, PartSupp
Q9 Non-Parallel Plan detailsQ9 Non-Parallel Plan details
Table Scans comprise 64%,IO from 4 tables contribute 58% of plan cost
Q9 Parallel reg vs Partitioned Q9 Parallel reg vs Partitioned
Q13Q13 Why does Q13 have perfect scaling?
Q18 Large Volume CustomerQ18 Large Volume Customer
Non-Parallel
Parallel
Q18 Graphical PlanQ18 Graphical Plan
Non-Parallel Plan: 66% of cost in Hash Match, reduced to 5% in Parallel Plan
Q18 Plan DetailsQ18 Plan Details
Non-Parallel
Parallel
Non-Parallel Plan Hash Match cost is 1245 IO, 494.6 CPUDOP 16/32: size is below IO threshold, CPU reduced by >10X
Q21 Suppliers Who Kept Orders WaitingQ21 Suppliers Who Kept Orders Waiting
Note 3 references to Line Item
Non-Parallel Parallel
Q21 Non-Parallel PlanQ21 Non-Parallel Plan
H1
H1H2H3
H2H3
Q21 ParallelQ21 Parallel
Q21Q21
3 full Line Item clustered index scans
Plan cost is approx 3X Q1, single “scan”
Super ScalingSuper Scaling
Q7 Volume ShippingQ7 Volume Shipping
Non-Parallel Parallel
Q7 Non-Parallel PlanQ7 Non-Parallel Plan
Join sequence: Nation, Customer, Orders, Line Item
Q7 Parallel PlanQ7 Parallel Plan
Join sequence: Nation, Customer, Orders, Line Item
Q8 National Market ShareQ8 National Market Share
Non-Parallel Parallel
Q8 Non-Parallel PlanQ8 Non-Parallel Plan
Join sequence: Part, Line Item, Orders, Customer
Q8 Parallel PlanQ8 Parallel Plan
Join sequence: Part, Line Item, Orders, Customer
Q11 Important Stock IdentificationQ11 Important Stock Identification
Non-Parallel Parallel
Q11Q11
Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp
Q11Q11
Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp
Small QueriesSmall Queries
Query 2 Minimum Cost SupplierQuery 2 Minimum Cost Supplier
Wordy, but only touches the small tables, second lowest plan cost (Q15)
Q2Q2
Clustered Index Scan on Part and PartSupp have highest cost (48%+42%)
Q2Q2
PartSupp is now Index Scan + Key Lookup
Q6 Forecasting Revenue ChangeQ6 Forecasting Revenue Change
Note sure why this blows CPUScalar values are pre-computed, pre-converted
Q20?Q20?
This query may get a poor execution plan
Date functions are usually written as
because Line Item date columns are “date” typeCAST helps DOP 1 plan, but get bad plan for parallel
Q20Q20
Q20Q20
Q20 alternate - parallelQ20 alternate - parallel
Statistics estimation error here
Penalty for mistakeapplied here
Other QueriesOther Queries
Q3Q3
Q3Q3
Q12 Random IO?Q12 Random IO?
Will this generate random IO?
Query 12 PlansQuery 12 PlansNon-Parallel
Parallel
Queries that go NegativeQueries that go Negative
Q17 Small Quantity Order RevenueQ17 Small Quantity Order Revenue
Q17Q17
Table Spool is concern
Q17Q17
the usual suspects
Q19Q19
Q19Q19
Q22Q22
Q22Q22
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Tot
DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 32
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Tot
DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 32 Speedup from DOP 1 query time
CPU relative to DOP 1