system architecture: big iron (numa)

254
System Architecture: Big Iron (NUMA) Joe Chang [email protected] www.qdpma.com

Upload: danica

Post on 05-Jan-2016

42 views

Category:

Documents


3 download

DESCRIPTION

System Architecture: Big Iron (NUMA). Joe Chang [email protected] www.qdpma.com. About Joe Chang. SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: System Architecture:  Big Iron (NUMA)

System Architecture: Big Iron (NUMA)

Joe [email protected]

Page 2: System Architecture:  Big Iron (NUMA)

About Joe ChangAbout Joe Chang

SQL Server Execution Plan Cost Model

True cost structure by system architecture

Decoding statblob (distribution statistics)

SQL Clone – statistics-only database

ToolsExecStats – cross-reference index use by SQL-execution plan

Performance Monitoring,

Profiler/Trace aggregation

Page 3: System Architecture:  Big Iron (NUMA)

Scaling SQL on NUMA TopicsScaling SQL on NUMA Topics

OLTP – Thomas Kejser session“Designing High Scale OLTP Systems”

Data Warehouse

Ongoing Database DevelopmentOngoing Database Development

Bulk Load – SQL CAT paper + TK session

“The Data Loading Performance Guide”Other Sessions with common coverage:Monitoring and Tuning Parallel Query Execution II, R Meyyappan(SQLBits 6) Inside the SQL Server Query Optimizer, Conor CunninghamNotes from the field: High Performance Storage, John LangfordSQL Server Storage – 1000GB Level, Brent Ozar

Page 4: System Architecture:  Big Iron (NUMA)

Server Systems and ArchitectureServer Systems and Architecture

Page 5: System Architecture:  Big Iron (NUMA)

Symmetric Multi-ProcessingSymmetric Multi-Processing

CPU CPU

System Bus

CPU CPU

MCH

ICHPXH PXH

SMP, processors are not dedicated to specific tasks (ASMP), single OS image, each processor can acess all memory

SMP makes no reference to memory architecture?

Not to be confused to Simultaneous Multi-Threading (SMT)Intel calls SMT Hyper-Threading (HT), which is not to be confused with AMD Hyper-Transport (also HT)

Page 6: System Architecture:  Big Iron (NUMA)

Non-Uniform Memory AccessNon-Uniform Memory Access

MemoryController

CPU CPU CPU CPU

Shared Bus or X Bar

MemoryController

CPU CPU CPU CPU

MemoryController

CPU CPU CPU CPU

MemoryController

CPU CPU CPU CPU

Node Controller Node Controller Node Controller Node Controller

NUMA Architecture - Path to memory is not uniform1) Node: Processors, Memory, Separate or combined

Memory + Node Controllers 2) Nodes connected by shared bus, cross-bar, ring

Traditionally, 8-way+ systems

Local memory latency ~150ns, remote node memory ~300-400ns, can cause erratic behavior if OS/code is not NUMA aware

Page 7: System Architecture:  Big Iron (NUMA)

AMD OpteronAMD Opteron

Opteron

Opteron

HT2100

Opteron

Opteron

HT1100HT2100

Local memory latency ~50ns, 1 hop ~100ns, two hop 150ns?Actual: more complicated because of snooping (cache coherency traffic)

Technically, Opteron is NUMA, but remote node memory latency is low, no negative impact or erratic behavior!For practical purposes: behave like SMP system

Page 8: System Architecture:  Big Iron (NUMA)

8-way Opteron Sys Architecture8-way Opteron Sys Architecture

Opteron processor (prior to Magny-Cours) has 3 Hyper-Transport links. Note 8-way top and bottom right processors use 2 HT to connect to other processors, 3rd HT for IO, CPU 1 & 7 require 3 hops to each other

CPU0

CPU2

CPU4

CPU6

CPU1

CPU3

CPU5

CPU7

Page 9: System Architecture:  Big Iron (NUMA)

http://www.techpowerup.com/img/09-08-26/17d.jpg

Page 10: System Architecture:  Big Iron (NUMA)

Nehalem System ArchitectureNehalem System Architecture

Intel Nehalem generation processors have Quick Path Interconnect (QPI) Xeon 5500/5600 series have 2, Xeon 7500 series have 4 QPI 8-way Glue-less is possible

Page 11: System Architecture:  Big Iron (NUMA)

NUMA Local and Remote MemoryNUMA Local and Remote Memory

Local memory is closer than remote

Physical access time is shorter

What is actual access time?With cache coherency requirement!

Page 12: System Architecture:  Big Iron (NUMA)

HT Assist – Probe FilterHT Assist – Probe Filterpart of L3 cache used as directory cache

ZDNET

Page 13: System Architecture:  Big Iron (NUMA)

Source Snoop CoherencySource Snoop Coherency

From HP PREMA Architecture whitepaper:

All reads result in snoops to all other caches, … Memory controller cannot return the data until it has collected all the snoop responses and is sure that no cache provided a more recent copy of the memory line

Page 14: System Architecture:  Big Iron (NUMA)

DL980G7DL980G7

From HP PREAM Architecture whitepaper:Each node controller stores information about* all data in the processor caches, minimizes inter-processor coherency communication, reduces latency to local memory(*only cache tags, not cache data)

Page 15: System Architecture:  Big Iron (NUMA)

HP ProLiant DL980 ArchitectureHP ProLiant DL980 Architecture

Node Controllers reduces effective memory latency

Page 16: System Architecture:  Big Iron (NUMA)

Superdome 2 – Itanium, sx3000Superdome 2 – Itanium, sx3000

Agent – Remote Ownership Tag + L4 cache tags

64M eDRAM L4 cache data

Page 17: System Architecture:  Big Iron (NUMA)

IBM x3850 X5 (Glue-less)IBM x3850 X5 (Glue-less)

Connect two 4-socket Nodes to make 8-way system

Page 18: System Architecture:  Big Iron (NUMA)
Page 19: System Architecture:  Big Iron (NUMA)

OS Memory ModelsOS Memory Models

251791

Node 0

241680

2719113

Node 1

2618102

2921135

Node 2

2820124

3123157

Node 3

3022146

7531

Node 0

6420

1513119

Node 1

1412108

23211917

Node 2

22201816

31292725

Node 3

30282624

SUMA: Sufficiently Uniform Memory AccessMemory interleaved across nodes

NUMA: first interleaved within a node, then spanned across nodes

Memory stripe is then spanned across nodes

1

2

1

2

Page 20: System Architecture:  Big Iron (NUMA)

Windows OS NUMA SupportWindows OS NUMA Support

Memory modelsSUMA – Sufficiently Uniform Memory Access

NUMA – separate memory pools by Node

Node 0

0

24168

1

25179

Node 1

2

261810

3

271911

Node 2

4

282012

5

292113

Node 3

6

302214

7

312315

Node 0

0

642

1

753

Node 1

8

141210

9

151311

Node 2

16

222018

17

232119

Node 3

24

302826

25

312927

Memory is striped across NUMA nodes

Page 21: System Architecture:  Big Iron (NUMA)

Memory Model Example: 4 NodesMemory Model Example: 4 Nodes

SUMA Memory Modelmemory access uniformly distributed

25% of memory accesses local, 75% remote

NUMA Memory ModelGoal is better than 25% local node access

True local access time also needs to be faster

Cache Coherency may increase local access

Page 22: System Architecture:  Big Iron (NUMA)

Architecting for NUMAArchitecting for NUMA

Web determines port for each user by group (but should not be by geography!)

Affinitize port to NUMA node

Each node access localized data (partition?)

OS may allocate substantial chunk from Node 0?

End to End Affinity

North East

Mid Atlantic

South East

Central

Texas

Mountain

California

Pacific NW

1440

1441

1442

1443

1444

1445

1446

1447

Node 0

Node 1

Node 2

Node 3

Node 4

Node 5

Node 6

Node 7

0-0

0-1

1-0

1-1

2-0

2-1

3-0

3-1

4-0

4-1

5-0

5-1

6-0

6-1

7-0

7-1

NE

MidA

SE

Cen

Tex

Mnt

Cal

PNW

App Server TCP Port CPU Memory Table

Page 23: System Architecture:  Big Iron (NUMA)

Architecting for NUMAArchitecting for NUMA

Web determines port for each user by group (but should not be by geography!)

Affinitize port to NUMA node

Each node access localized data (partition?)

OS may allocate substantial chunk from Node 0?

End to End Affinity

North East

Mid Atlantic

South East

Central

Texas

Mountain

California

Pacific NW

1440

1441

1442

1443

1444

1445

1446

1447

Node 0

Node 1

Node 2

Node 3

Node 4

Node 5

Node 6

Node 7

0-0

0-1

1-0

1-1

2-0

2-1

3-0

3-1

4-0

4-1

5-0

5-1

6-0

6-1

7-0

7-1

NE

MidA

SE

Cen

Tex

Mnt

Cal

PNW

App Server TCP Port CPU Memory Table

Page 24: System Architecture:  Big Iron (NUMA)

HP-UX LORAHP-UX LORA

HP-UX – Not Microsoft Windows

Locality-Optimizer Resource Alignment

12.5% Interleaved Memory

87.5% NUMA node Local Memory

Page 25: System Architecture:  Big Iron (NUMA)

System Tech SpecsSystem Tech Specs

8GB $400 ea 18 x 8G = 144GB, $7200, 64 x 8G = 512GB - $26K

16GB $1100 ea 12 x16G =192GB, $13K, 64 x 16G = 1TB – $70K

Processors

2 x Xeon X56x0

4 x Opteron 6100

4 x Xeon X7560

8 x Xeon X7560

Cores DIMM PCI-E G2

6 18 5 x8+,1 x4

12 32 5 x8, 1 x4

8 64 4 x8, 6 x4†

8 128 9 x8, 5 x4‡

Max memory

192G*

512G

1TB

2TB

Total Cores

12

48

32

64

Base

$7K

$14K

$30K

$100K

Max memory for 2-way Xeon 5600 is 12 x 16 = 192GB,† Dell R910 and HP DL580G7 have different PCI-E ‡ ProLiant DL980G7 can have 3 IOH for additional PCI-E slots

Page 26: System Architecture:  Big Iron (NUMA)
Page 27: System Architecture:  Big Iron (NUMA)

Software StackSoftware Stack

Page 28: System Architecture:  Big Iron (NUMA)

Operating SystemOperating System

Windows Server 2003 RTM, SP1Network limitations (default)

Scalable Networking Pack (912222)

Windows Server 2008

Windows Server 2008 R2 (64-bit only)

Breaks 64 logical processor limit

NUMA IO enhancements?Do not bother trying to do DW on 32-bit OS or 32-bit SQL ServerDon’t try to do DW on SQL Server 2000

Impacts OLTP

Search: MSI-X

Page 29: System Architecture:  Big Iron (NUMA)

SQL Server versionSQL Server version

SQL Server 2000Serious disk IO limitations (1GB/sec ?)

Problematic parallel execution plans

SQL Server 2005 (fixed most S2K problems)

64-bit on X64 (Opteron and Xeon)

SP2 – performance improvement 10%(?)

SQL Server 2008 & R2Compression, Filtered Indexes, etc

Star join, Parallel query to partitioned table

Page 30: System Architecture:  Big Iron (NUMA)

ConfigurationConfiguration

SQL Server Startup Parameter: ETrace Flags 834, 836, 2301

Auto_Date_CorrelationOrder date < A, Ship date > A

Implied: Order date > A-C, Ship date < A+C

Port Affinity – mostly OLTP

Dedicated processor ?for log writer ?

Page 31: System Architecture:  Big Iron (NUMA)

Storage Performance for Data Warehousing

Joe [email protected]

Page 32: System Architecture:  Big Iron (NUMA)

About Joe ChangAbout Joe Chang

SQL Server Execution Plan Cost Model

True cost structure by system architecture

Decoding statblob (distribution statistics)

SQL Clone – statistics-only database

ToolsExecStats – cross-reference index use by SQL-execution plan

Performance Monitoring,

Profiler/Trace aggregation

Page 33: System Architecture:  Big Iron (NUMA)

StorageStorage

Page 34: System Architecture:  Big Iron (NUMA)

Organization StructureOrganization Structure

In many large IT departmentsDB and Storage are in separate groups

Storage usually has own objectivesBring all storage into one big system under full management (read: control)

Storage as a Service, in the CloudOne size fits all needs

Usually have zero DB knowledge

Of course we do high bandwidth, 600MB/sec good enough for you?

Page 35: System Architecture:  Big Iron (NUMA)

Data Warehouse StorageData Warehouse Storage

OLTP – Throughput with Fast Response

DW – Flood the queues for maximum through-put

Do not use shared storage for data warehouse!Storage system vendors like to give the impression the SAN is a magical, immensely powerful box that can meet all your needs. Just tell us how much capacity you need and don’t worry about anything else.My advice: stay away from shared storage, controlled by different team.

Page 36: System Architecture:  Big Iron (NUMA)

Nominal and Net BandwidthNominal and Net Bandwidth

PCI-E Gen 2 – 5 Gbit/sec signalingx8 = 5GB/s, net BW 4GB/s, x4 = 2GB/s net

SAS 6Gbit/s – 6 Gbit/s x4 port: 3GB/s nominal, 2.2GB/sec net?

Fibre Channel 8 Gbit/s nominal780GB/s point-to-point,

680MB/s from host to SAN to back-end loop

SAS RAID Controller, x8 PCI-E G2, 2 x4 6G

2.8GB/s

Depends on the controller, will change!

Page 37: System Architecture:  Big Iron (NUMA)

Storage Storage –– SAS Direct-Attach SAS Direct-Attach

Many Fat Pipes

Very Many Disks

Balance by pipe bandwidth

Don’t forget fat network pipes

Option A: 24-disks in one enclosure for each x4 SAS port. Two x4 SAS ports per controller

Option B: Split enclosure over 2 x4 SAS ports, 1 controller

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

RAIDPCI-E x8

SAS x4

SAS x4

RAIDPCI-E x8

SAS x4

SAS x4

RAIDPCI-E x8

SAS x4

SAS x4

RAIDPCI-E x8

SAS x4

SAS x4

PCI-E x4

PCI-E x4

RAID

2 x10GbE

PCI-E x4 2 x10GbE

SAS x4

Page 38: System Architecture:  Big Iron (NUMA)

Storage Storage –– FC/SAN FC/SAN

PCI-E x8 Gen 2 Slot with quad-port 8Gb FC

If 8Gb quad-port is not supported, consider system with many x4 slots, or consider SAS!

SAN systems typically offer 3.5in 15-disk enclosures. Difficult to get high spindle count with density.

1-2 15-disk enclosures per 8Gb FC port, 20-30MB/s per disk?

2 x10GbE

PCI-E x4 2 x10GbE

PCI-E x4

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

HBAPCI-E x8

HBAPCI-E x8

8Gb FC

8Gb FC

8Gb FC

8Gb FC

8Gb FC

8Gb FC

8Gb FC

8Gb FC

HBAPCI-E x4

PCI-E x4 HBA

8Gb FC

8Gb FC

8Gb FC

8Gb FC

HBAPCI-E x48Gb FC

8Gb FC

PCI-E x4 HBA8Gb FC

8Gb FC

PCI-E x4 HBA8Gb FC

8Gb FC

Page 39: System Architecture:  Big Iron (NUMA)

Storage Storage –– SSD / HDD Hybrid SSD / HDD Hybrid

Log: Single DB – HDD, unless rollbacks or T-log backups disrupts log writes. Multi DB – SSD, otherwise to many RAID1 pairs to logs

Storage enclosures typically 12 disks per channel. Can only support bandwidth of a few SSD. Use remaining bays for extra storage with HDD. No point expending valuable SSD space for backups and flat files

No RAID w/SSD?

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

SASPCI-E x8

SAS x4

SAS x4

SASPCI-E x8

SAS x4

SAS x4

SASPCI-E x8

SAS x4

SAS x4

SASPCI-E x8

SAS x4

SAS x4

PCI-E x4

PCI-E x4

RAID

2 x10GbE

PCI-E x4 2 x10GbE

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SAS x4

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

Page 40: System Architecture:  Big Iron (NUMA)

SSDSSD

Current: mostly 3Gbps SAS/SATA SDD

Some 6Gbps SATA SSD

Fusion IO – direct PCI-E Gen2 interface

320GB-1.2TB capacity, 200K IOPS, 1.5GB/s

No RAID ?HDD is fundamentally a single point failure

SDD could be built with redundant components

HP report problems with SSD on RAID controllers, Fujitsu did not?

Page 41: System Architecture:  Big Iron (NUMA)

Big DW Storage – iSCSI Big DW Storage – iSCSI

Are you nuts?

Page 42: System Architecture:  Big Iron (NUMA)

Storage Configuration - ArraysStorage Configuration - Arrays

Shown:two 12-disk Arrays per 24-disk enclosure

Options: between 6-16 disks per array

SAN systems may recommend R10 4+4 or R5 7+1

Very Many Spindles Comment on Meta LUN

Page 43: System Architecture:  Big Iron (NUMA)
Page 44: System Architecture:  Big Iron (NUMA)

Data Consumption Rate: XeonData Consumption Rate: Xeon

TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M

Data consumption rate is much higher for current generation Nehalem and Westmere processors than Core 2 referenced in Microsoft FTDW document. TPC-H Q1 is more compute intensive than the FTDW light query.

ProcessorsTotalCores

Q1sec

SQLTotalMB/s

MB/sper coreGHz

MemGB

SF

2 Xeon 5355

2 Xeon 5570

2 Xeon 5680

8

8

12

85.4

42.2

21.0

5sp2

8sp1

8r2

1,165.5

2,073.5

4,166.7

145.7

259.2

347.2

2.66

2.93

3.33

64

144

192

100

100

100

4 Xeon 7560 32 37.28r2 7,056.5 220.52.26 640 300

Nehalem

Westmere

Neh.-EX

Conroe

8 Xeon 7560 64 183.88r2 14,282 223.22.26 512 3000

Page 45: System Architecture:  Big Iron (NUMA)

Data Consumption Rate: OpteronData Consumption Rate: Opteron

Expected Istanbul to have better performance per core than Shanghai due to HT Assist. Magny-Cours has much better performance per core! (at 2.3GHz versus 2.8 for Istanbul), or is this Win/SQL 2K8 R2?

TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M

ProcessorsTotalCores

Q1sec

SQLTotalMB/s

MB/sper coreGHz

MemGB

SF

4 Opt 8220

8 Opt 8360

8

32

309.7

91.4

5rtm

8rtm

868.7

2,872.0

121.1

89.7

2.8

2.5

128

256

300

300

8 Opt 8384 32 72.58rtm 3,620.7 113.22.7 256 300

8 Opt 8439 48 49.08sp1 5,357.1 111.62.8 256 300

Barcelona

Shanghai

Istanbul

2 Opt 6176 24 20.28r2 4,331.7 180.52.3 192 100 Magny-C

4 Opt 6176 48 31.88r2 8,254.7 172.02.3 512 300 -

8 Opt 8439 48 166.98rtm 5,242.7 109.22.8 512 1000

Page 46: System Architecture:  Big Iron (NUMA)

Data Consumption RateData Consumption RateTPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M

ProcessorsTotalCores

Q1sec

SQLTotalMB/s

MB/sper coreGHz

MemGB

SF

2 Xeon 5355

2 Xeon 5570

2 Xeon 5680

2 Opt 6176

8

8

12

24

85.4

42.2

21.0

20.2

5sp2

8sp1

8r2

8r2

1165.5

2073.5

4166.7

4331.7

145.7

259.2

347.2

180.5

2.66

2.93

3.33

2.3

64

144

192

192

100

100

100

100

4 Opt 8220

8 Opt 8360

8

32

309.7

91.4

5rtm

8rtm

868.7

2872.0

121.1

89.7

2.8

2.5

128

256

300

300

8 Opt 8384

8 Opt 8439

32

48

72.5

49.0

8rtm

8sp1

3620.7

5357.1

113.2

111.6

2.7

2.8

256

256

300

300

4 Opt 6176 48 31.88r2 8254.7 172.02.3 512 300

8 Xeon 7560 64 183.88r2 14282 223.22.26 512 3000

Barcelona

Shanghai

Istanbul

Magny-C

Page 47: System Architecture:  Big Iron (NUMA)

Storage Targets Storage Targets

Processors

2 Xeon X5680

4 Opt 6176

4 Xeon X7560

8 Xeon X7560

Total Cores

12

48

32

64

PCI-Ex8-x4

5 - 1

5 - 1

6 - 4

9 - 5

SASHBA

2

4

6

11†

Storage Units/Disks

2 - 48

4 - 96

6 - 144

10 - 240

Actual Bandwidth

5 GB/s

10 GB/s

15 GB/s

26 GB/s

† 8-way : 9 controllers in x8 slots, 24 disks per x4 SAS port 2 controllers in x4 slots, 12 disk

24 15K disks per enclosure, 12 disks per x4 SAS port requires 100MB/sec per disk,

possible but not always practical24 disks per x4 SAS port requires 50MB/sec,

more achievable in practice

2U disk enclosure 24 x 73GB 15K 2.5in disks $14K, $600 per disk

BWCore

350

175

250

225

Target MB/s

4200

8400

8000

14400

StorageUnits/Disks

4 - 96

8 - 192

12 - 288

20 - 480

Think: Shortest path to metal (iron-oxide)

Page 48: System Architecture:  Big Iron (NUMA)

Your Storage and the Optimizer Your Storage and the Optimizer

Assumptions2.8GB/sec per SAS 2 x4 Adapter, Could be 3.2GB/sec per PCI-E G2 x8HDD 400 IOPS per disk – Big query key lookup, loop join at high queue, and short-stroked, possible skip-seek. SSD 35,000 IOPS

Sequential IOPS

1,350

350,000

350,000

Model

Optimizer

SAS 2x4

SAS 2x4

Disks

-

24

48

BW (KB/s)

10,800

2,800,000

2,800,000

“Random” IOPS

320

9,600

19,200

Sequential- Rand IO ratio

4.22

36.5

18.2

45,000FC 4G 30 360,000 12,000 3.75

350,000SSD 8 2,800,000 280,000 1.25

The SQL Server Query Optimizer make key lookup versus table scan decisions based on a 4.22 sequential-to-random IO ratioA DW configured storage system has a 18-36 ratio, 30 disks per 4G FC about matches the QO, SSD is in the other direction

Page 49: System Architecture:  Big Iron (NUMA)

0

50

100

150

200

250

300

350

400

450

Q1 Q9 Q18 Q21

X 5355 5sp2 X 5570 8sp1

X 5680 8R2 O 6176 8R2

Data Consumption RatesData Consumption Rates

0

50

100

150

200

250

300

Q1 Q9 Q18 Q21

O DC 2.8G 128 5rtm O QC 2.5G 256 8rtm

O QC 2.7G 256 8rtm O 6C 2.8G 256 8sp1

O 12C 2.3G 512 8R2 X7560 R2 640

TPC-H SF100Query 1, 9, 13, 21

TPC-H SF300Query 1, 9, 13, 21

Page 50: System Architecture:  Big Iron (NUMA)
Page 51: System Architecture:  Big Iron (NUMA)

Fast Track Reference ArchitectureFast Track Reference Architecture

Several Expensive SAN systems (11 disks)

Each must be configured independently

$1,500-2,000 amortized per disk

Too many 2-disk Arrays2 LUN per Array, too many data files

Build Indexes with MAXDOP 1 Is this brain dead?

Designed around 100MB/sec per diskNot all DW is single scan, or sequential

My Complaints

Scripting?

Page 52: System Architecture:  Big Iron (NUMA)

FragmentationFragmentation

Weak Storage System 1) Fragmentation could degrade IO performance,2) Defragmenting very large table on a weak storage system could render the database marginally to completely non-functional for a very long time.

Powerful Storage System3) Fragmentation has very little impact.4) Defragmenting has mild impact, and completes within night time window.

What is the correct conclusion?

File

Partition

LUN

Disk

Table

Page 53: System Architecture:  Big Iron (NUMA)

Operating System View of Operating System View of StorageStorage

Page 54: System Architecture:  Big Iron (NUMA)

Operating System Disk ViewOperating System Disk View

Controller 1 Port 0

Controller 1 Port 1

Disk 2Basic396GBOnline

Disk 3Basic396GBOnline

Controller 2 Port 0

Controller 2 Port 1

Disk 4Basic396GBOnline

Disk 5Basic396GBOnline

Controller 3 Port 0

Controller 3 Port 1

Disk 6Basic396GBOnline

Disk 7Basic396GBOnline

Additional disks not shown, Disk 0 is boot drive, 1 – install source?

Page 55: System Architecture:  Big Iron (NUMA)

File LayoutFile Layout

Disk 2, Partition 0

File Group for the big TableFile 1

Partition 1

File Group for all othersFile 1

Partition 2

TempdbFile 1

Partition 4

Backup and Load File 1

Disk 3 Partition 0

File Group for the big TableFile 2

Partition 1

Small File GroupFile 2

Partition 2

TempdbFile 2

Partition 4

Backup and LoadFile 2

Disk 4 Partition 0

File Group for the big TableFile 3

Partition 1

Small File GroupFile 3

Partition 2

TempdbFile 3

Partition 4

Backup and Load File 3

Disk 5 Partition 0

File Group for the big TableFile 4

Partition 1

Small File GroupFile 4

Partition 2

TempdbFile 4

Partition 4

Backup and Load File 4

Disk 6 Partition 0

File Group for the big TableFile 5

Partition 1

Small File GroupFile 5

Partition 2

TempdbFile 5

Partition 4

Backup and Load File 5

Disk 7 Partition 0

File Group for the big TableFile 6

Partition 1

Small File GroupFile 6

Partition 2

TempdbFile 6

Partition 4

Backup and Load File 6

Each File Group is distributed across all data disks

Log disks not shown, tempdb share common pool with data

Page 56: System Architecture:  Big Iron (NUMA)

File Groups and FilesFile Groups and Files

Dedicated File Group for largest table

Never defragment

One file group for all other regular tables

Load file group?Rebuild indexes to different file group

Page 57: System Architecture:  Big Iron (NUMA)

Partitioning - PitfallsPartitioning - Pitfalls

Common Partitioning Strategy

Partition Scheme maps partitions to File Groups

What happens in a table scan? Read first from Part 1 then 2, then 3, … ?

SQL 2008 HF to read from each partition in parallel?What if partitions have disparate sizes?

Disk 2

File Group 1

Disk 3

File Group 2

Disk 4

File Group 3

Disk 5

File Group 4

Disk 6

File Group 5

Disk 7

File Group 6

Table Partition 1

Table Partition 2

Table Partition 3

Table Partition 4

Table Partition 5

Table Partition 6

Page 59: System Architecture:  Big Iron (NUMA)

About Joe ChangAbout Joe Chang

SQL Server Execution Plan Cost Model

True cost structure by system architecture

Decoding statblob (distribution statistics)

SQL Clone – statistics-only database

ToolsExecStats – cross-reference index use by SQL-execution plan

Performance Monitoring,

Profiler/Trace aggregation

Page 60: System Architecture:  Big Iron (NUMA)

So you bought a 64+ core boxSo you bought a 64+ core box

Learn all about Parallel ExecutionAll guns (cores) blazing

Negative scaling

Super-scaling

High degree of parallelism & small SQL

Anomalies, execution plan changes etc

Compression

Partitioning

Now

No I have not been smoking pot

Yes, this can happen, how will you know

How much in CPU do I pay for this?

Great management tool, what else?

Page 61: System Architecture:  Big Iron (NUMA)

Parallel Execution PlansParallel Execution Plans

Reference: Adam Machanic PASS

Page 62: System Architecture:  Big Iron (NUMA)

Execution Plan QuickieExecution Plan Quickie

Cost is duration in seconds on some reference platformIO Cost for scan: 1 = 10,800KB/s, 810 implies 8,748,000KBIO in Nested Loops Join: 1 = 320/s, multiple of 0.003125

F4

Estimated Execution Plan

I/O and CPU Cost components

Page 63: System Architecture:  Big Iron (NUMA)

Index + Key Lookup - ScanIndex + Key Lookup - Scan

(926.67- 323655 * 0.0001581) / 0.003125 = 280160 (86.6%)

Actual CPU Time (Data in memory)LU 1919 1919Scan 8736 8727

1,093,729 pages/1350 = 810.17 (8,748MB)

True cross-over approx 1,400,000 rows1 row : page

Page 64: System Architecture:  Big Iron (NUMA)

Index + Key Lookup - ScanIndex + Key Lookup - Scan

8748000KB/8/1350 = 810 (817- 280326 * 0.0001581) / 0.003125 = 247259 (88%)

Actual CPU TimeLU 2138 321Scan 18622 658

Page 65: System Architecture:  Big Iron (NUMA)

Actual Execution PlanActual Execution Plan

Note Actual Number of Rows, Rebinds, Rewinds

Actual

Estimated

Actual Estimated

Page 66: System Architecture:  Big Iron (NUMA)

Row Count and ExecutionsRow Count and Executions

For Loop Join inner source and Key Lookup, Actual Num Rows = Num of Exec × Num of Rows

Inner Source

Outer

Page 67: System Architecture:  Big Iron (NUMA)
Page 68: System Architecture:  Big Iron (NUMA)

Parallel PlansParallel Plans

Page 69: System Architecture:  Big Iron (NUMA)

Parallelism OperationsParallelism Operations

Distribute StreamsNon-parallel source, parallel destination

Repartition StreamsParallel source and destination

Gather StreamsDestination is non-parallel

Page 70: System Architecture:  Big Iron (NUMA)

Parallel Execution PlansParallel Execution Plans

Note: gold circle with double arrow, and parallelism operations

Page 71: System Architecture:  Big Iron (NUMA)

Parallel Scan (and Index Seek)Parallel Scan (and Index Seek)

DOP 1 DOP 2

DOP 4 DOP 8

IO Cost sameCPU reduce by degree of parallelism, except no reduction for DOP 16

2X

4X8X

IO contributes most of cost!

Page 72: System Architecture:  Big Iron (NUMA)

Parallel Scan 2Parallel Scan 2

DOP 16

Page 73: System Architecture:  Big Iron (NUMA)

Hash Match AggregateHash Match Aggregate

CPU cost only reducesBy 2X,

Page 74: System Architecture:  Big Iron (NUMA)

Parallel ScanParallel Scan

IO Cost is the same

CPU cost reduced in proportion to degree of parallelism, last 2X excluded?

On a weak storage system, a single thread can saturate the IO channel, Additional threads will not increase IO (reduce IO duration).A very powerful storage system can provide IO proportional to the number of threads. It might be nice if this was optimizer option?

The IO component can be a very large portion of the overall plan costNot reducing IO cost in parallel plan may inhibit generating favorable plan,i.e., not sufficient to offset the contribution from the Parallelism operations.

A parallel execution plan is more likely on larger systems (-P to fake it?)

Page 75: System Architecture:  Big Iron (NUMA)

Actual Execution Plan - ParallelActual Execution Plan - Parallel

Page 76: System Architecture:  Big Iron (NUMA)

More Parallel Plan DetailsMore Parallel Plan Details

Page 77: System Architecture:  Big Iron (NUMA)

Parallel Plan - ActualParallel Plan - Actual

Page 78: System Architecture:  Big Iron (NUMA)

Parallelism – Hash JoinsParallelism – Hash Joins

Page 79: System Architecture:  Big Iron (NUMA)

Hash Join CostHash Join Cost

DOP 1 DOP 2

DOP 8

DOP 4

Search: Understanding Hash JoinsFor In-memory, Grace, Recursive

Page 80: System Architecture:  Big Iron (NUMA)

Hash Join CostHash Join Cost

CPU Cost is linear with number of rows, outer and inner source

See BOL on Hash Joins for In-Memory, Grace, RecursiveIO Cost is zero for small intermediate data size, beyond set point proportional to server memory(?) IO is proportional to excess data (beyond in-memory limit)Parallel Plan: Memory allocation is per thread!

Summary: Hash Join plan cost depends on memory if IO component is not zero, in which case is disproportionately lower with parallel plans. Does not reflect real cost?

Page 81: System Architecture:  Big Iron (NUMA)

Parallelism Repartition StreamsParallelism Repartition Streams

DOP 2 DOP 4 DOP 8

Page 82: System Architecture:  Big Iron (NUMA)

BitmapBitmap

BOL: Optimizing Data Warehouse Query Performance Through Bitmap Filtering A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. Essentially, the filter performs a semi-join reduction; that is, only the rows in the second table that qualify for the join to the first table are processed.

SQL Server uses the Bitmap operator to implement bitmap filtering in parallel query plans. Bitmap filtering speeds up query execution by eliminating rows with key values that cannot produce any join records before passing rows through another operator such as the Parallelism operator. A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. By removing unnecessary rows early in the query, subsequent operators have fewer rows to work with, and the overall performance of the query improves. The optimizer determines when a bitmap is selective enough to be useful and in which operators to apply the filter. For more information, see Optimizing Data Warehouse Query Performance Through Bitmap Filtering.

Page 83: System Architecture:  Big Iron (NUMA)

Parallel Execution Plan SummaryParallel Execution Plan Summary

Queries with high IO cost may show little plan cost reduction on parallel execution

Plans with high portion hash or sort cost show large parallel plan cost reduction

Parallel plans may be inhibited by high row count in Parallelism Repartition Streams

Watch out for (Parallel) Merge Joins!

Page 84: System Architecture:  Big Iron (NUMA)
Page 85: System Architecture:  Big Iron (NUMA)

Scaling TheoryScaling Theory

Page 86: System Architecture:  Big Iron (NUMA)

Parallel Execution StrategyParallel Execution Strategy

Partition work into little piecesEnsures each thread has same amount

High overhead to coordinate

Partition into big piecesMay have uneven distribution between threads

Small table join to big table

Thread for each row from small table

Partitioned table options

Page 87: System Architecture:  Big Iron (NUMA)

What Should Scale?What Should Scale?

Trivially parallelizable: 1) Split large chunk of work among threads, 2) Each thread works independently,3) Small amount of coordination to consolidate threads

223

Page 88: System Architecture:  Big Iron (NUMA)

More Difficult?More Difficult?

Parallelizable: 1) Split large chunk of work among threads, 2) Each thread works on first stage3) Large coordination effort between threads4) More work…Consolidate

2

2

3

3

4

Page 89: System Architecture:  Big Iron (NUMA)

Partitioned TablesPartitioned TablesNo Repartition Streams

Regular Table

Partitioned TablesNo Repartition Streams operations!

Page 90: System Architecture:  Big Iron (NUMA)

Scaling RealityScaling Reality8-way Quad-Core OpteronWindows Server 2008 R2SQL Server 2008 SP1 + HF 27

Page 91: System Architecture:  Big Iron (NUMA)

Test QueriesTest Queries

TPC-H SF 10 databaseStandard, Compressed, Partitioned (30)

Line Item Table SUM, 59M rows, 8.75GB

Orders Table 15M rows

Page 92: System Architecture:  Big Iron (NUMA)

CPU-secCPU-sec

StandardCPU-sec to SUM 1 or 2 columns in Line Item

0

5

10

15

20

25

30

35

DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32

Sum 1 column

Sum 2 columns

0

4

8

12

16

20

24

28

32

DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32

Sum 1 column

Sum 2 columns

Compressed

Page 93: System Architecture:  Big Iron (NUMA)

Speed UpSpeed Up

0

2

4

6

8

10

12

14

16

18

20

22

24

26

DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32

Sum 1

Sum 2

S2 Group

S2 Join

Compressed

0

4

8

12

16

20

24

28

32

DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32

Sum 1 column

Sum 2 columns

S2 Group

S2 Join

Standard

Page 94: System Architecture:  Big Iron (NUMA)

Line Item sum 1 columnLine Item sum 1 column

Speed up relative to DOP 1

CPU-sec

0

5

10

15

20

25

30

35

DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32

Sum 1 Std

Compressed

Partitioned

0

5

10

15

20

25

30

35

DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32

Sum 1 Std

Compressed

Partitioned

Page 95: System Architecture:  Big Iron (NUMA)

Line Item Sum w/Group ByLine Item Sum w/Group By

Speedup

CPU-sec

0

10

20

30

40

50

60

DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32

Group Std

Compressed

Hash

0

2

4

6

8

10

12

14

16

18

20

22

24

26

DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32

Group Std

Compressed

Hash

Page 96: System Architecture:  Big Iron (NUMA)

Hash JoinHash Join

Speedup

CPU-sec

0

20

40

60

80

100

120

DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32

Join Std

Compressed

Partitioned

0

5

10

15

20

25

30

DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32

Join Std

Compressed

Partitioned

Page 97: System Architecture:  Big Iron (NUMA)

Key Lookup and Table ScanKey Lookup and Table Scan

Speedup

CPU-sec1.4M rows

0

2

4

6

8

10

12

14

16

18

20

DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32

Key Lookup std Key Lookup compr

Table Scan uncmp Table Scan cmpr

02468

101214161820222426283032

DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32

Key Lookup std

Key Lookup compr

Table Scan uncmp

Table Scan cmpr

Page 98: System Architecture:  Big Iron (NUMA)
Page 99: System Architecture:  Big Iron (NUMA)

Parallel Execution SummaryParallel Execution Summary

Contention in queries w/low cost per page

Simple scan,

High Cost per Page – improves scaling!

Multiple Aggregates, Hash Join, Compression

Table Partitioning – alternative query plans

Loop Joins – broken at high DOP

Merge Join – seriously broken (parallel)

Page 100: System Architecture:  Big Iron (NUMA)

Scaling DW SummaryScaling DW Summary

Massive IO bandwidth

Parallel options for data load, updates etc

Investigate Parallel Execution PlansScaling from DOP 1, 2, 4, 8, 16, 32 etc

Scaling with and w/o HT

Strategy for limiting DOP with multiple users

Page 101: System Architecture:  Big Iron (NUMA)

Fixes from Microsoft NeededFixes from Microsoft Needed

Contention issues in parallel execution

Table scan, Nested Loops

Better plan cost model for scalingBack-off on parallelism if gain is negligible

Fix throughput degradation with multiple users running big DW queries

Sybase and Oracle, Throughput is close to Power or better

Page 102: System Architecture:  Big Iron (NUMA)

Test SystemsTest Systems

Page 103: System Architecture:  Big Iron (NUMA)

Test SystemsTest Systems

2-way quad-core Xeon 5430 2.66GHzWindows Server 2008 R2, SQL 2008 R2

8-way dual-core Opteron 2.8GHzWindows Server 2008 SP1, SQL 2008 SP1

8-way quad-core Opteron 2.7GHz Barcelona

Windows Server 2008 R2, SQL 2008 SP18-way systems were configured for AD- not good!

Build 2789

Page 104: System Architecture:  Big Iron (NUMA)

Test MethodologyTest Methodology

Boot with all processorsRun queries at MAXDOP 1, 2, 4, 8, etc

Not the same as running on 1-way, 2-way, 4-way server

Interpret results with caution

Page 105: System Architecture:  Big Iron (NUMA)

ReferencesReferences

Search Adam Machanic PASS

Page 106: System Architecture:  Big Iron (NUMA)

SQL Server Scaling on Big Iron (NUMA) Systems

Joe [email protected]

TPC-H

Page 107: System Architecture:  Big Iron (NUMA)

About Joe ChangAbout Joe Chang

SQL Server Execution Plan Cost Model

True cost structure by system architecture

Decoding statblob (distribution statistics)

SQL Clone – statistics-only database

ToolsExecStats – cross-reference index use by SQL-execution plan

Performance Monitoring,

Profiler/Trace aggregation

Page 108: System Architecture:  Big Iron (NUMA)

TPC-HTPC-H

Page 109: System Architecture:  Big Iron (NUMA)

TPC-HTPC-H

DSS – 22 queries, geometric mean60X range plan cost, comparable actual range

Power – single streamTests ability to scale parallel execution plans

Throughput – multiple streams

Scale Factor 1 – Line item data is 1GB

875MB with DATE instead of DATETIME

Only single column indexes allowed, Ad-hoc

Page 110: System Architecture:  Big Iron (NUMA)

Observed Scaling BehaviorsObserved Scaling Behaviors

Good scaling, leveling off at high DOP

Perfect Scaling ???

Super Scaling

Negative Scaling especially at high DOP

Execution Plan change Completely different behavior

Page 111: System Architecture:  Big Iron (NUMA)
Page 112: System Architecture:  Big Iron (NUMA)

TPC-H Published ResultsTPC-H Published Results

Page 113: System Architecture:  Big Iron (NUMA)

TPC-H SF 100GBTPC-H SF 100GB

Between 2-way Xeon 5570, all are close, HDD has best throughput, SATA SSD has best composite, and Fusion-IO has be power.Westmere and Magny-Cours, both 192GB memory, are very close

2-way Xeon 5355, 5570, 5680, Opt 6176

0

20,000

40,000

60,000

80,000

100,000

Power Throughput QphH

Xeon 5355 5570 HDD

5570 SSD 5570 Fusion

5680 SSD Opt 6176

Page 114: System Architecture:  Big Iron (NUMA)

TPC-H SF 300GBTPC-H SF 300GB8x QC/6C & 4x12C Opt,

6C Istanbul improved over 4C Shanghai by 45% Power, 73% Through-put, 59% overall.4x12C 2.3GHz improved17% over 8x6C 2.8GHz

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

Power Throughput QphH

Opt 8360 4C Opt 8384 4COpt 8439 6C Opt 6716 12X 7560 8C

Page 115: System Architecture:  Big Iron (NUMA)

TPC-H SF 1000TPC-H SF 1000

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

Power Throughput QphH

Opt 8439 SQL Opt 8439 Sybase

Superdome Superdome 2

Page 116: System Architecture:  Big Iron (NUMA)

TPC-H SF 3TBTPC-H SF 3TBX7460 & X7560

Nehalem-EX 64 cores better than 96 Core 2.

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000

Power Throughput QphH

16 x X7460

8 x 7560

POWER6

Page 117: System Architecture:  Big Iron (NUMA)

TPC-H SF 100GB, 300GB & 3TBTPC-H SF 100GB, 300GB & 3TB

0

20,000

40,000

60,000

80,000

100,000

Power Throughput QphH

Xeon 5355 5570 HDD

5570 SSD 5570 Fusion

5680 SSD Opt 6176Westmere and Magny-Cours are very closeBetween 2-way Xeon 5570, all are close, HDD has best through-put, SATA SSD has best composite, and Fusion-IO has be power

SF100 2-way

SF300 8x QC/6C & 4x12C6C Istanbul improved over 4C Shanghai by 45% Power, 73% Through-put, 59% overall.4x12C 2.3GHz improved17% over 8x6C 2.8GHz

SF 3TB X7460 & X7560Nehalem-EX 64 cores better than 96 Core 2.

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

Power Throughput QphH

Opt 8360 4C Opt 8384 4COpt 8439 6C Opt 6716 12X 7560 8C

0

50,000

100,000

150,000

200,000

16 x X7460

8 x 7560

32 x Pwr6

Page 118: System Architecture:  Big Iron (NUMA)

TPC-H Published ResultsTPC-H Published Results

SQL Server excels in Power Limited by Geometric mean, anomalies

Trails in ThroughputOther DBMS get better throughput than power

SQL Server throughput below Power

by wide margin

Speculation – SQL Server does not throttle back parallelism with load?

Page 119: System Architecture:  Big Iron (NUMA)

TPC-H SF100TPC-H SF100

PowerThrough

putQphHProcessors

TotalCores

SQLGHzMemGB

SF

23,378.0 13,381.0 17,686.72 Xeon 5355 8 5sp22.66 64 100

67,712.9 38,019.1 50,738.42x5570 HDD 8 8sp12.93 144 100

99,426.3

94,761.5

55,038.2

53,855.6

73,974.6

71,438.3

2 Xeon 5680

2 Opt 6176

12

24

8r2

8r2

3.33

2.3

192

192

100

100

70,048.5 37,749.1 51,422.42x5570 SSD 8 8sp12.93 144 100

72,110.5 36,190.8 51,085.65570 Fusion 8 8sp12.93 144 100

Page 120: System Architecture:  Big Iron (NUMA)

TPC-H SF300TPC-H SF300

PowerThrough

putQphHProcessors

TotalCores

SQLGHzMemGB

SF

25,206.4

67,287.4

75,161.2

109,067.1

13,283.8

41,526.4

44,271.9

76,869.0

18,298.5

52,860.2

57,684.7

91,558.2

4 Opt 8220

8 Opt 8360

8

32

5rtm

8rtm

2.8

2.5

128

256

8 Opt 8384

8 Opt 8439

32

48

8rtm

8sp1

2.7

2.8

256

256

300

300

300

300

129,198.3 89,547.7 107,561.24 Opt 6176 48 8r22.3 512 300

152,453.1 96,585.4 121,345.64 Xeon 7560 32 8r22.26 640 300

All of the above are HP results?, Sun result Opt 8384, sp1, Pwr 67,095.6, Thr 45,343.5, QphH 55,157.5

Page 121: System Architecture:  Big Iron (NUMA)

TPC-H 1TBTPC-H 1TB

PowerThrough

putQphHProcessors

TotalCores

SQLGHzMemGB

SF

95,789.1 69,367.6 81,367.68 Opt 8439 48 8R2?2.8 512 1000

108,436.8 96,652.7 102,375.38 Opt 8439 48 ASE2.8 384 1000

139,181.0 141,188.1 140,181.1Itanium 9350 64 O11R21.73 512 1000

Page 122: System Architecture:  Big Iron (NUMA)

TPC-H 3TBTPC-H 3TB

PowerThrough

putQphHProcessors

TotalCores

SQLGHzMemGB

SF

120,254.8 87,841.4 102,254.816 Xeon 7460 96 8r22.66 1024 3000

185,297.7 142,685.6 162,601.78 Xeon 7560 64 8r22.26 512 3000

142,790.7 171,607.4 156,537.3Itanium 9350 64 Sybase1.73 512 1000

142,790.7 171,607.4 156,537.3POWER6 64 Sybase5.0 512 3000

Page 123: System Architecture:  Big Iron (NUMA)

TPC-H Published ResultsTPC-H Published Results

Power

23,378

72,110.5

99,426.3

94,761.5

25,206.4

67,287.4

75,161.2

109,067.1

129,198.3

185,297.7

Throughput

13,381

36,190.8

55,038.2

53,855.6

13,283.8

41,526.4

44,271.9

76,869.0

89,547.7

142,685.6

QphH

17,686.7

51,085.6

73,974.6

71,438.3

18,298.5

52,860.2

57,684.7

91,558.2

107,561.2

162,601.7

ProcessorsTotalCores

SQLGHzMemGB

2 Xeon 5355

2 Xeon 5570

2 Xeon 5680

2 Opt 6176

8

8

12

24

5sp2

8sp1

8r2

8r2

2.66

2.93

3.33

2.3

64

144

192

192

4 Opt 8220

8 Opt 8360

8

32

5rtm

8rtm

2.8

2.5

128

256

8 Opt 8384

8 Opt 8439

32

48

8rtm

8sp1

2.7

2.8

256

256

4 Opt 6176 48 8r22.3 512

8 Xeon 7560 64 8r22.26 512

SF

100

100

100

100

300

300

300

300

300

3000

Page 124: System Architecture:  Big Iron (NUMA)

SF100 2-way Big Queries (sec)SF100 2-way Big Queries (sec)

0

10

20

30

40

50

60

Q1 Q9 Q13 Q18 Q21

5570 HDD 5570 SSD

5570 FusionIO 5680 SSD

6176 SSD

Xeon 5570 with SATA SSD poor on Q9, reason unknownBoth Xeon 5680 and Opteron 6176 big improvement over Xeon 5570

Qu

ery

tim

e in

se

c

Page 125: System Architecture:  Big Iron (NUMA)

SF100 Middle QSF100 Middle Q

0

1

2

3

4

5

6

7

8

Q3 Q5 Q7 Q8 Q10 Q11 Q12 Q16 Q22

5570 HDD 5570 SSD 5570 FusionIO

5680 SSD 6176 SSD

Xeon 5570-HDD and 5680-SSD poor on Q12, reason unknownOpteron 6176 poor on Q11

Qu

ery

tim

e in

se

c

Page 126: System Architecture:  Big Iron (NUMA)

SF100 Small QueriesSF100 Small Queries

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Q2 Q4 Q6 Q14 Q15 Q17 Q19 Q20

5570 HDD 5570 SSD 5570 FusionIO

5680 SSD 6176 SSD

Qu

ery

tim

e in

se

c

Xeon 5680 and Opteron poor on Q20Note limited scaling on Q2, & 17

Page 127: System Architecture:  Big Iron (NUMA)
Page 128: System Architecture:  Big Iron (NUMA)

SF300 32+ cores Big QueriesSF300 32+ cores Big QueriesQ

ue

ry ti

me

in s

ec

Opteron 6176 poor relative to 8439 on Q9 & 13, same number of total cores

0

20

40

60

80

100

120

Q1 Q9 Q13 Q18 Q21

8 x 8360 QC 2M

8 x 8384 QC 6M

8 x 8439 6C

4 x 6176 12C

4 x 7560 8C

Page 129: System Architecture:  Big Iron (NUMA)

SF300 Middle QSF300 Middle Q

Opteron 6176 much better than 8439 on Q11 & 19Worse on Q12

Qu

ery

tim

e in

se

c

0

4

8

12

16

20

24

28

Q3 Q5 Q7 Q8 Q10 Q11 Q12 Q16 Q19 Q20 Q22

8x8360 QC 2M 8x8384 QC 6M

8x8439 6C 4x6176 12C

4x7560 8C

Page 130: System Architecture:  Big Iron (NUMA)

SF300 Small QSF300 Small Q

Opteron 6176 much better on Q2, even with 8439 on others

Qu

ery

tim

e in

se

c

0

1

2

3

4

5

6

Q2 Q4 Q6 Q14 Q15 Q17

8 x 8360 QC 2M 8 x 8384 QC 6M

8 x 8439 6C 4 x 6176 12C

4 x 7560 8C

Page 131: System Architecture:  Big Iron (NUMA)
Page 132: System Architecture:  Big Iron (NUMA)

SF1000SF1000

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22

Page 133: System Architecture:  Big Iron (NUMA)

SF1000SF1000

0

50

100

150

200

250

300

350

400

Q1 Q9 Q13 Q18 Q21

SQL Server

Sybase

Page 134: System Architecture:  Big Iron (NUMA)

SF1000SF1000

0

10

20

30

40

50

60

70

80

Q3 Q5 Q7 Q8 Q10 Q11 Q12 Q17 Q19

SQL Server

Sybase

Page 135: System Architecture:  Big Iron (NUMA)

SF1000SF1000

0

5

10

15

20

25

30

35

Q2 Q4 Q6 Q14 Q15 Q16 Q20 Q22

SQL Server

Sybase

Page 136: System Architecture:  Big Iron (NUMA)

SF1000 Itanium - SuperdomeSF1000 Itanium - Superdome

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22

Page 137: System Architecture:  Big Iron (NUMA)
Page 138: System Architecture:  Big Iron (NUMA)

SF 3TB – 8SF 3TB – 8××7560 versus 167560 versus 16××74607460

0.0

0.5

1.0

1.5

2.0

2.5

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22

Broadly 50% faster overall, 5X+ on one, slower on 2, comparable on 3

5.6X

Page 139: System Architecture:  Big Iron (NUMA)

64 cores, 7560 relative to PWR664 cores, 7560 relative to PWR6

0

1

2

3

4

5

6

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22

Page 140: System Architecture:  Big Iron (NUMA)

0

100

200

300

400

500

600

Q1 Q9 Q13 Q18 Q21

Uni 16x6

DL980 8x8

Pwr6

Page 141: System Architecture:  Big Iron (NUMA)

0

20

40

60

80

100

120

140

160

180

200

Q3 Q5 Q7 Q8 Q10 Q11 Q12 Q17 Q19

Uni 16x6

DL980 8x8

Pwr6

0

10

20

30

40

50

60

Q2 Q4 Q6 Q14 Q15 Q16 Q20 Q22

Uni 16x6

DL980 8x8

Pwr6

Page 142: System Architecture:  Big Iron (NUMA)
Page 143: System Architecture:  Big Iron (NUMA)

TPC-H SummaryTPC-H Summary

Scaling is impressive on some SQL

Limited ability (value) is scaling small Q

Anomalies, negative scaling

Page 144: System Architecture:  Big Iron (NUMA)

TPC-H QueriesTPC-H Queries

Page 145: System Architecture:  Big Iron (NUMA)

Q1 Pricing Summary ReportQ1 Pricing Summary Report

Page 146: System Architecture:  Big Iron (NUMA)

Query 2 Minimum Cost SupplierQuery 2 Minimum Cost Supplier

Wordy, but only touches the small tables, second lowest plan cost (Q15)

Page 147: System Architecture:  Big Iron (NUMA)

Q3Q3

Page 148: System Architecture:  Big Iron (NUMA)

Q6 Forecasting Revenue ChangeQ6 Forecasting Revenue Change

Page 149: System Architecture:  Big Iron (NUMA)

Q7 Volume ShippingQ7 Volume Shipping

Page 150: System Architecture:  Big Iron (NUMA)

Q8 National Market ShareQ8 National Market Share

Page 151: System Architecture:  Big Iron (NUMA)

Q9 Product Type Profit MeasureQ9 Product Type Profit Measure

Page 152: System Architecture:  Big Iron (NUMA)

Q11 Important Stock IdentificationQ11 Important Stock Identification

Non-Parallel Parallel

Page 153: System Architecture:  Big Iron (NUMA)

Q12 Random IO?Q12 Random IO?

Page 154: System Architecture:  Big Iron (NUMA)

Q13Q13 Why does Q13 have perfect scaling?

Page 155: System Architecture:  Big Iron (NUMA)

Q17 Small Quantity Order RevenueQ17 Small Quantity Order Revenue

Page 156: System Architecture:  Big Iron (NUMA)

Q18 Large Volume CustomerQ18 Large Volume Customer

Non-Parallel

Parallel

Page 157: System Architecture:  Big Iron (NUMA)

Q19Q19

Page 158: System Architecture:  Big Iron (NUMA)

Q20?Q20?

This query may get a poor execution plan

Date functions are usually written as

because Line Item date columns are “date” typeCAST helps DOP 1 plan, but get bad plan for parallel

Page 159: System Architecture:  Big Iron (NUMA)

Q21 Suppliers Who Kept Orders WaitingQ21 Suppliers Who Kept Orders Waiting

Note 3 references to Line Item

Page 160: System Architecture:  Big Iron (NUMA)

Q22Q22

Page 161: System Architecture:  Big Iron (NUMA)
Page 163: System Architecture:  Big Iron (NUMA)

About Joe ChangAbout Joe Chang

SQL Server Execution Plan Cost Model

True cost structure by system architecture

Decoding statblob (distribution statistics)

SQL Clone – statistics-only database

ToolsExecStats – cross-reference index use by SQL-execution plan

Performance Monitoring,

Profiler/Trace aggregation

Page 164: System Architecture:  Big Iron (NUMA)

TPC-HTPC-H

Page 165: System Architecture:  Big Iron (NUMA)

TPC-HTPC-H

DSS – 22 queries, geometric mean60X range plan cost, comparable actual range

Power – single streamTests ability to scale parallel execution plans

Throughput – multiple streams

Scale Factor 1 – Line item data is 1GB

875MB with DATE instead of DATETIME

Only single column indexes allowed, Ad-hoc

Page 166: System Architecture:  Big Iron (NUMA)

SF 10, test studiesSF 10, test studies

Not valid for publication

Auto-Statistics enabled, Excludes compile time

Big Queries – Line Item Scan

Super Scaling – Mission Impossible

Small Queries & High Parallelism

Other queries, negative scaling

Did not apply T2301, or disallow page locks

Page 167: System Architecture:  Big Iron (NUMA)
Page 168: System Architecture:  Big Iron (NUMA)

0

500

1,000

1,500

2,000

2,500

3,000

3,500

Q1 Q9 Q13 Q18 Q21

DOP 1 DOP 2 DOP 4

DOP 8 DOP 16

Big Q: Plan Cost vs ActualBig Q: Plan Cost vs ActualPlan Cost reduction from DOP1 to 16/32Q1 28%Q9 44%Q18 70%Q21 20%

Plan Cost says scaling is poor except for Q18,

memory affects Hash IO onset

Plan Cost @ 10GB

0

15

30

45

60

75

Q1 Q9 Q13 Q18 Q21

DOP 1 DOP 2 DOP 4

DOP 8 DOP 16 DOP 24

DOP 30 DOP 32

Actual Query timeIn seconds

Plan Cost is poor indicator of true parallelism scaling

Q18 & Q 21 > 3X Q1, Q9

Page 169: System Architecture:  Big Iron (NUMA)

02468

10121416182022242628303234

Q1 Q9 Q13 Q18 Q21

DOP 1 DOP 2 DOP 4 DOP 8

DOP 16 DOP 24 DOP 30 DOP 32

Big Query: Speed Up and CPUBig Query: Speed Up and CPU

Q13 has slightly better than perfect scaling?In general, excellent scaling to DOP 8-24, weak afterwards

Holy Grail

0

10

20

30

40

50

60

70

80

90

Q1 Q9 Q13 Q18 Q21

DOP 1 DOP 2 DOP 4 DOP 8

DOP 16 DOP 24 DOP 30 DOP 32

CPU timeIn seconds

Speed up relative to DOP 1

Page 170: System Architecture:  Big Iron (NUMA)

Super ScalingSuper Scaling

Suppose at DOP 1, a query runs for 100 seconds, with one CPU fully pegged

CPU time = 100 sec, elapse time = 100 sec

What is best case for DOP 2?Assuming nearly zero Repartition Threads cost

CPU time = 100 sec, elapsed time = 50?

Super Scaling: CPU time decreases going from Non-Parallel to Parallel plan!No, I have not started drinking, yet

Page 171: System Architecture:  Big Iron (NUMA)

0.0

0.5

1.0

1.5

2.0

2.5

Q7 Q8 Q11 Q21 Q22

DOP 1 DOP 2

DOP 4 DOP 8

DOP 16 DOP 24

DOP 30 DOP 32

Super ScalingSuper Scaling

CPU-sec goes down from DOP 1 to 2 and higher (typically 8)

0

2

4

6

8

10

12

14

16

18

20

22

24

26

Q7 Q8 Q11 Q21 Q22

DOP 1 DOP 2 DOP 4 DOP 8

DOP 16 DOP 24 DOP 30 DOP 32

CPU normalized to DOP 1

Speed up relative to DOP 1

3.5X speedup from DOP 1 to 2 (Normalized to DOP 1)

Page 172: System Architecture:  Big Iron (NUMA)

CPU and Query time in secondsCPU and Query time in seconds

0

2

4

6

8

10

12

14

16

18

20

Q7 Q8 Q11 Q21 Q22

DOP 1 DOP 2 DOP 4 DOP 8

DOP 16 DOP 24 DOP 30 DOP 32

0

2

4

6

8

10

12

Q7 Q8 Q11 Q21 Q22

DOP 1 DOP 2 DOP 4

DOP 8 DOP 16 DOP 24

DOP 30 DOP 32

CPU time

Query time

Page 173: System Architecture:  Big Iron (NUMA)

Super Scaling SummarySuper Scaling Summary

Most probable causeBitmap Operator in Parallel Plan

Bitmap Filters are great, Question for Microsoft:

Can I use Bitmap Filters in OLTP systems with non-parallel plans?

Page 174: System Architecture:  Big Iron (NUMA)

Small Queries – Plan Cost vs ActSmall Queries – Plan Cost vs Act

Query 3 and 16 have lower plan cost than Q17, but not included

0

50

100

150

200

250

Q2 Q4 Q6 Q15 Q17 Q20

DOP 1 DOP 2 DOP 4

DOP 8 DOP 16

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Q2 Q4 Q6 Q15 Q17 Q20

DOP 1 DOP 2 DOP 4 DOP 8

DOP 16 DOP 24 DOP 30 DOP 32

Q4,6,17 great scaling to DOP 4, then weak

Negative scaling also occurs

Query time

Plan Cost

Page 175: System Architecture:  Big Iron (NUMA)

Small Queries CPU & SpeedupSmall Queries CPU & Speedup

What did I get for all that extra CPU?, Interpretation: sharp jump in CPU means poor scaling, disproportionate means negative scaling

0

1

2

3

4

5

6

Q2 Q4 Q6 Q15 Q17 Q20

DOP 1 DOP 2 DOP 4 DOP 8

DOP 16 DOP 24 DOP 30 DOP 32

0

2

4

6

8

10

12

14

16

18

Q2 Q4 Q6 Q15 Q17 Q20

DOP 1 DOP 2 DOP 4

DOP 8 DOP 16 DOP 24

DOP 30 DOP 32

Query 2 negative at DOP 2, Q4 is good, Q6 get speedup, but at CPU premium, Q17 and 20 negative after DOP 8

CPU time

Speed up

Page 176: System Architecture:  Big Iron (NUMA)

High Parallelism – Small QueriesHigh Parallelism – Small Queries

Why? Almost No value

TPC-H geometric mean scoringSmall queries have as much impact as large

Linear sum of weights large queries

OLTP with 32, 64+ coresParallelism good if super-scaling

Default max degree of parallelism 0

Seriously bad news, especially for small Q

Increase cost threshold for parallelism?

Sometimes you do get lucky

Page 177: System Architecture:  Big Iron (NUMA)

Q that go NegativeQ that go Negative

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Q17 Q19 Q20 Q22

DOP 1 DOP 2

DOP 4 DOP 8

DOP 16 DOP 24

DOP 30 DOP 32

0

2

4

6

8

10

12

14

Q17 Q19 Q20 Q22

DOP 1 DOP 2

DOP 4 DOP 8

DOP 16 DOP 24

DOP 30 DOP 32

Query time

“Speedup”

Page 178: System Architecture:  Big Iron (NUMA)

CPUCPU

0

2

4

6

8

10

12

Q17 Q19 Q20 Q22

DOP 1 DOP 2

DOP 4 DOP 8

DOP 16 DOP 24

DOP 30 DOP 32

Page 179: System Architecture:  Big Iron (NUMA)

Other Queries – CPU & SpeedupOther Queries – CPU & Speedup

0

2

4

6

8

10

12

14

16

18

20

22

Q3 Q5 Q10 Q12 Q14 Q16

DOP 1 DOP 2 DOP 4 DOP 8

DOP 16 DOP 24 DOP 30 DOP 32

0

2

4

6

8

10

12

14

16

18

20

22

Q3 Q5 Q10 Q12 Q14 Q16

DOP 1 DOP 2

DOP 4 DOP 8

DOP 16 DOP 24

DOP 30 DOP 32

Q3 has problems beyond DOP 2

CPU time

Speedup

Page 180: System Architecture:  Big Iron (NUMA)

Other - Query Time secondsOther - Query Time seconds

0

2

4

6

8

10

12

14

16

Q3 Q5 Q10 Q12 Q14 Q16

DOP 1 DOP 2 DOP 4 DOP 8

DOP 16 DOP 24 DOP 30 DOP 32

Query time

Page 181: System Architecture:  Big Iron (NUMA)

Scaling SummaryScaling Summary

Some queries show excellent scaling

Super-scaling, better than 2X

Sharp CPU jump on last DOP doubling

Need strategy to cap DOPTo limit negative scaling

Especially for some smaller queries?

Other anomalies

Page 182: System Architecture:  Big Iron (NUMA)
Page 183: System Architecture:  Big Iron (NUMA)

CompressionCompression

PAGE

Page 184: System Architecture:  Big Iron (NUMA)

1.0

1.1

1.2

1.3

1.4

1.5

DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32

1.0

1.1

1.2

1.3

1.4

1.5

DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32

Compression Overhead - OverallCompression Overhead - Overall

40% overhead for compression at low DOP,10% overhead at max DOP???

Query time compressed relative to uncompressed

CPU time compressed relative to uncompressed

Page 185: System Architecture:  Big Iron (NUMA)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22

DOP 1 DOP 2 DOP 4 DOP 8

DOP 16 DOP 24 DOP 32

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22

DOP 1 DOP 2 DOP 4 DOP 8

DOP 16 DOP 24 DOP 32

Query time compressed relative to uncompressed

CPU time compressed relative to uncompressed

Page 186: System Architecture:  Big Iron (NUMA)

Compressed TableCompressed Table

LINEITEM – real data may be more compressibleUncompressed: 8,749,760KB, Average Bytes per row: 149Compressed: 4,819,592KB, Average Bytes per row: 82

Page 187: System Architecture:  Big Iron (NUMA)

PartitioningPartitioning

Orders and Line Item on Order Key

Page 188: System Architecture:  Big Iron (NUMA)

Partitioning Impact - OverallPartitioning Impact - Overall

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32

0.90

0.95

1.00

1.05

1.10

1.15

DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32

Query time partitioned relative to not partitioned

CPU time partitioned relative to not partitioned

Page 189: System Architecture:  Big Iron (NUMA)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22

DOP 1 DOP 2 DOP 4

DOP 8 DOP 16 DOP 24

DOP 32

0

1

2

3

4

5

6

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22

DOP 1 DOP 2

DOP 4 DOP 8

DOP 16 DOP 24

DOP 32

Query time partitioned relative to not partitioned

CPU time partitioned relative to not partitioned

Page 190: System Architecture:  Big Iron (NUMA)

Plan for Partitioned TablesPlan for Partitioned Tables

Page 191: System Architecture:  Big Iron (NUMA)
Page 192: System Architecture:  Big Iron (NUMA)

Scaling DW SummaryScaling DW Summary

Massive IO bandwidth

Parallel options for data load, updates etc

Investigate Parallel Execution PlansScaling from DOP 1, 2, 4, 8, 16, 32 etc

Scaling with and w/o HT

Strategy for limiting DOP with multiple users

Page 193: System Architecture:  Big Iron (NUMA)

Fixes from Microsoft NeededFixes from Microsoft Needed

Contention issues in parallel execution

Table scan, Nested Loops

Better plan cost model for scalingBack-off on parallelism if gain is negligible

Fix throughput degradation with multiple users running big DW queries

Sybase and Oracle, Throughput is close to Power or better

Page 194: System Architecture:  Big Iron (NUMA)

Query PlansQuery Plans

Page 195: System Architecture:  Big Iron (NUMA)

Big QueriesBig Queries

Page 196: System Architecture:  Big Iron (NUMA)

Q1 Pricing Summary ReportQ1 Pricing Summary Report

Page 197: System Architecture:  Big Iron (NUMA)

Q1 Plan Q1 Plan

Non-Parallel

Parallel

Parallel plan 28% lower than scalar, IO is 70%, no parallel plan cost reduction

Page 198: System Architecture:  Big Iron (NUMA)
Page 199: System Architecture:  Big Iron (NUMA)

Q9 Product Type Profit MeasureQ9 Product Type Profit Measure

IO from 4 tables contribute 58% of plan cost, parallel plan is 39% lower

Non-Parallel Parallel

Page 200: System Architecture:  Big Iron (NUMA)

Q9 Non-Parallel PlanQ9 Non-Parallel Plan

Table/Index Scans comprise 64%, IO from 4 tables contribute 58% of plan cost

Join sequence: Supplier, (Part, PartSupp), Line Item, Orders

Page 201: System Architecture:  Big Iron (NUMA)

Q9 Parallel PlanQ9 Parallel Plan

Non-Parallel: (Supplier), (Part, PartSupp), Line Item, OrdersParallel: Nation, Supplier, (Part, Line Item), Orders, PartSupp

Page 202: System Architecture:  Big Iron (NUMA)

Q9 Non-Parallel Plan detailsQ9 Non-Parallel Plan details

Table Scans comprise 64%,IO from 4 tables contribute 58% of plan cost

Page 203: System Architecture:  Big Iron (NUMA)

Q9 Parallel reg vs Partitioned Q9 Parallel reg vs Partitioned

Page 204: System Architecture:  Big Iron (NUMA)
Page 205: System Architecture:  Big Iron (NUMA)

Q13Q13 Why does Q13 have perfect scaling?

Page 206: System Architecture:  Big Iron (NUMA)
Page 207: System Architecture:  Big Iron (NUMA)

Q18 Large Volume CustomerQ18 Large Volume Customer

Non-Parallel

Parallel

Page 208: System Architecture:  Big Iron (NUMA)

Q18 Graphical PlanQ18 Graphical Plan

Non-Parallel Plan: 66% of cost in Hash Match, reduced to 5% in Parallel Plan

Page 209: System Architecture:  Big Iron (NUMA)

Q18 Plan DetailsQ18 Plan Details

Non-Parallel

Parallel

Non-Parallel Plan Hash Match cost is 1245 IO, 494.6 CPUDOP 16/32: size is below IO threshold, CPU reduced by >10X

Page 210: System Architecture:  Big Iron (NUMA)
Page 211: System Architecture:  Big Iron (NUMA)

Q21 Suppliers Who Kept Orders WaitingQ21 Suppliers Who Kept Orders Waiting

Note 3 references to Line Item

Non-Parallel Parallel

Page 212: System Architecture:  Big Iron (NUMA)

Q21 Non-Parallel PlanQ21 Non-Parallel Plan

H1

H1H2H3

H2H3

Page 213: System Architecture:  Big Iron (NUMA)

Q21 ParallelQ21 Parallel

Page 214: System Architecture:  Big Iron (NUMA)

Q21Q21

3 full Line Item clustered index scans

Plan cost is approx 3X Q1, single “scan”

Page 215: System Architecture:  Big Iron (NUMA)

Super ScalingSuper Scaling

Page 216: System Architecture:  Big Iron (NUMA)

Q7 Volume ShippingQ7 Volume Shipping

Non-Parallel Parallel

Page 217: System Architecture:  Big Iron (NUMA)

Q7 Non-Parallel PlanQ7 Non-Parallel Plan

Join sequence: Nation, Customer, Orders, Line Item

Page 218: System Architecture:  Big Iron (NUMA)

Q7 Parallel PlanQ7 Parallel Plan

Join sequence: Nation, Customer, Orders, Line Item

Page 219: System Architecture:  Big Iron (NUMA)
Page 220: System Architecture:  Big Iron (NUMA)

Q8 National Market ShareQ8 National Market Share

Non-Parallel Parallel

Page 221: System Architecture:  Big Iron (NUMA)

Q8 Non-Parallel PlanQ8 Non-Parallel Plan

Join sequence: Part, Line Item, Orders, Customer

Page 222: System Architecture:  Big Iron (NUMA)

Q8 Parallel PlanQ8 Parallel Plan

Join sequence: Part, Line Item, Orders, Customer

Page 223: System Architecture:  Big Iron (NUMA)
Page 224: System Architecture:  Big Iron (NUMA)

Q11 Important Stock IdentificationQ11 Important Stock Identification

Non-Parallel Parallel

Page 225: System Architecture:  Big Iron (NUMA)

Q11Q11

Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp

Page 226: System Architecture:  Big Iron (NUMA)

Q11Q11

Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp

Page 227: System Architecture:  Big Iron (NUMA)

Small QueriesSmall Queries

Page 228: System Architecture:  Big Iron (NUMA)

Query 2 Minimum Cost SupplierQuery 2 Minimum Cost Supplier

Wordy, but only touches the small tables, second lowest plan cost (Q15)

Page 229: System Architecture:  Big Iron (NUMA)

Q2Q2

Clustered Index Scan on Part and PartSupp have highest cost (48%+42%)

Page 230: System Architecture:  Big Iron (NUMA)

Q2Q2

PartSupp is now Index Scan + Key Lookup

Page 231: System Architecture:  Big Iron (NUMA)
Page 232: System Architecture:  Big Iron (NUMA)

Q6 Forecasting Revenue ChangeQ6 Forecasting Revenue Change

Note sure why this blows CPUScalar values are pre-computed, pre-converted

Page 233: System Architecture:  Big Iron (NUMA)
Page 234: System Architecture:  Big Iron (NUMA)

Q20?Q20?

This query may get a poor execution plan

Date functions are usually written as

because Line Item date columns are “date” typeCAST helps DOP 1 plan, but get bad plan for parallel

Page 235: System Architecture:  Big Iron (NUMA)

Q20Q20

Page 236: System Architecture:  Big Iron (NUMA)

Q20Q20

Page 237: System Architecture:  Big Iron (NUMA)

Q20 alternate - parallelQ20 alternate - parallel

Statistics estimation error here

Penalty for mistakeapplied here

Page 238: System Architecture:  Big Iron (NUMA)

Other QueriesOther Queries

Page 239: System Architecture:  Big Iron (NUMA)

Q3Q3

Page 240: System Architecture:  Big Iron (NUMA)

Q3Q3

Page 241: System Architecture:  Big Iron (NUMA)
Page 242: System Architecture:  Big Iron (NUMA)

Q12 Random IO?Q12 Random IO?

Will this generate random IO?

Page 243: System Architecture:  Big Iron (NUMA)

Query 12 PlansQuery 12 PlansNon-Parallel

Parallel

Page 244: System Architecture:  Big Iron (NUMA)

Queries that go NegativeQueries that go Negative

Page 245: System Architecture:  Big Iron (NUMA)

Q17 Small Quantity Order RevenueQ17 Small Quantity Order Revenue

Page 246: System Architecture:  Big Iron (NUMA)

Q17Q17

Table Spool is concern

Page 247: System Architecture:  Big Iron (NUMA)

Q17Q17

the usual suspects

Page 248: System Architecture:  Big Iron (NUMA)
Page 249: System Architecture:  Big Iron (NUMA)

Q19Q19

Page 250: System Architecture:  Big Iron (NUMA)

Q19Q19

Page 251: System Architecture:  Big Iron (NUMA)

Q22Q22

Page 252: System Architecture:  Big Iron (NUMA)

Q22Q22

Page 253: System Architecture:  Big Iron (NUMA)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Tot

DOP 2 DOP 4 DOP 8

DOP 16 DOP 24 DOP 32

0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

32

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Tot

DOP 2 DOP 4 DOP 8

DOP 16 DOP 24 DOP 32 Speedup from DOP 1 query time

CPU relative to DOP 1

Page 254: System Architecture:  Big Iron (NUMA)