system architecture: big iron (numa)

System Architecture: Big Iron (NUMA)

Joe [email protected]

mailto:[email protected]

http://www.qdpma.com/

About Joe ChangAbout Joe Chang

SQL Server Execution Plan Cost Model

True cost structure by system architecture

Decoding statblob (distribution statistics)

SQL Clone – statistics-only database

ToolsExecStats – cross-reference index use by SQL-execution plan

Performance Monitoring,

Profiler/Trace aggregation

Scaling SQL on NUMA TopicsScaling SQL on NUMA Topics

OLTP – Thomas Kejser session“Designing High Scale OLTP Systems”

Data Warehouse

Ongoing Database DevelopmentOngoing Database Development

Bulk Load – SQL CAT paper + TK session

“The Data Loading Performance Guide”Other Sessions with common coverage:Monitoring and Tuning Parallel Query Execution II, R Meyyappan(SQLBits 6) Inside the SQL Server Query Optimizer, Conor CunninghamNotes from the field: High Performance Storage, John LangfordSQL Server Storage – 1000GB Level, Brent Ozar

Server Systems and ArchitectureServer Systems and Architecture

Symmetric Multi-ProcessingSymmetric Multi-Processing

CPU CPU

System Bus

CPU CPU

MCH

ICHPXH PXH

SMP, processors are not dedicated to specific tasks (ASMP), single OS image, each processor can acess all memory

SMP makes no reference to memory architecture?

Not to be confused to Simultaneous Multi-Threading (SMT)Intel calls SMT Hyper-Threading (HT), which is not to be confused with AMD Hyper-Transport (also HT)

Non-Uniform Memory AccessNon-Uniform Memory Access

MemoryController

CPU CPU CPU CPU

Shared Bus or X Bar

MemoryController

CPU CPU CPU CPU

MemoryController

CPU CPU CPU CPU

MemoryController

CPU CPU CPU CPU

Node Controller Node Controller Node Controller Node Controller

NUMA Architecture - Path to memory is not uniform1) Node: Processors, Memory, Separate or combined

Memory + Node Controllers 2) Nodes connected by shared bus, cross-bar, ring

Traditionally, 8-way+ systems

Local memory latency ~150ns, remote node memory ~300-400ns, can cause erratic behavior if OS/code is not NUMA aware

AMD OpteronAMD Opteron

Opteron

Opteron

HT2100

Opteron

Opteron

HT1100HT2100

Local memory latency ~50ns, 1 hop ~100ns, two hop 150ns?Actual: more complicated because of snooping (cache coherency traffic)

Technically, Opteron is NUMA, but remote node memory latency is low, no negative impact or erratic behavior!For practical purposes: behave like SMP system

8-way Opteron Sys Architecture8-way Opteron Sys Architecture

Opteron processor (prior to Magny-Cours) has 3 Hyper-Transport links. Note 8-way top and bottom right processors use 2 HT to connect to other processors, 3rd HT for IO, CPU 1 & 7 require 3 hops to each other

CPU0

CPU2

CPU4

CPU6

CPU1

CPU3

CPU5

CPU7

http://www.techpowerup.com/img/09-08-26/17d.jpg

Nehalem System ArchitectureNehalem System Architecture

Intel Nehalem generation processors have Quick Path Interconnect (QPI) Xeon 5500/5600 series have 2, Xeon 7500 series have 4 QPI 8-way Glue-less is possible

NUMA Local and Remote MemoryNUMA Local and Remote Memory

Local memory is closer than remote

Physical access time is shorter

What is actual access time?With cache coherency requirement!

HT Assist – Probe FilterHT Assist – Probe Filterpart of L3 cache used as directory cache

ZDNET

Source Snoop CoherencySource Snoop Coherency

From HP PREMA Architecture whitepaper:

All reads result in snoops to all other caches, … Memory controller cannot return the data until it has collected all the snoop responses and is sure that no cache provided a more recent copy of the memory line

DL980G7DL980G7

From HP PREAM Architecture whitepaper:Each node controller stores information about* all data in the processor caches, minimizes inter-processor coherency communication, reduces latency to local memory(*only cache tags, not cache data)

HP ProLiant DL980 ArchitectureHP ProLiant DL980 Architecture

Node Controllers reduces effective memory latency

Superdome 2 – Itanium, sx3000Superdome 2 – Itanium, sx3000

Agent – Remote Ownership Tag + L4 cache tags

64M eDRAM L4 cache data

IBM x3850 X5 (Glue-less)IBM x3850 X5 (Glue-less)

Connect two 4-socket Nodes to make 8-way system

OS Memory ModelsOS Memory Models

251791

Node 0

241680

2719113

Node 1

2618102

2921135

Node 2

2820124

3123157

Node 3

3022146

7531

Node 0

6420

1513119

Node 1

1412108

23211917

Node 2

22201816

31292725

Node 3

30282624

SUMA: Sufficiently Uniform Memory AccessMemory interleaved across nodes

NUMA: first interleaved within a node, then spanned across nodes

Memory stripe is then spanned across nodes

1

2

1

2

Windows OS NUMA SupportWindows OS NUMA Support

Memory modelsSUMA – Sufficiently Uniform Memory Access

NUMA – separate memory pools by Node

Node 0

0

24168

1

25179

Node 1

2

261810

3

271911

Node 2

4

282012

5

292113

Node 3

6

302214

7

312315

Node 0

0

642

1

753

Node 1

8

141210

9

151311

Node 2

16

222018

17

232119

Node 3

24

302826

25

312927

Memory is striped across NUMA nodes

Memory Model Example: 4 NodesMemory Model Example: 4 Nodes

SUMA Memory Modelmemory access uniformly distributed

25% of memory accesses local, 75% remote

NUMA Memory ModelGoal is better than 25% local node access

True local access time also needs to be faster

Cache Coherency may increase local access

Architecting for NUMAArchitecting for NUMA

Web determines port for each user by group (but should not be by geography!)

Affinitize port to NUMA node

Each node access localized data (partition?)

OS may allocate substantial chunk from Node 0?

End to End Affinity

North East

Mid Atlantic

South East

Central

Texas

Mountain

California

Pacific NW

1440

1441

1442

1443

1444

1445

1446

1447

Node 0

Node 1

Node 2

Node 3

Node 4

Node 5

Node 6

Node 7

0-0

0-1

1-0

1-1

2-0

2-1

3-0

3-1

4-0

4-1

5-0

5-1

6-0

6-1

7-0

7-1

NE

MidA

SE

Cen

Tex

Mnt

Cal

PNW

App Server TCP Port CPU Memory Table

HP-UX LORAHP-UX LORA

HP-UX – Not Microsoft Windows

Locality-Optimizer Resource Alignment

12.5% Interleaved Memory

87.5% NUMA node Local Memory

System Tech SpecsSystem Tech Specs

8GB $400 ea 18 x 8G = 144GB, $7200, 64 x 8G = 512GB - $26K

16GB $1100 ea 12 x16G =192GB, $13K, 64 x 16G = 1TB – $70K

Processors

2 x Xeon X56x0

4 x Opteron 6100

4 x Xeon X7560

8 x Xeon X7560

Cores DIMM PCI-E G2

6 18 5 x8+,1 x4

12 32 5 x8, 1 x4

8 64 4 x8, 6 x4†

8 128 9 x8, 5 x4‡

Max memory

192G*

512G

1TB

2TB

Total Cores

12

48

32

64

Base

$7K

$14K

$30K

$100K

Max memory for 2-way Xeon 5600 is 12 x 16 = 192GB,† Dell R910 and HP DL580G7 have different PCI-E ‡ ProLiant DL980G7 can have 3 IOH for additional PCI-E slots

Software StackSoftware Stack

Operating SystemOperating System

Windows Server 2003 RTM, SP1Network limitations (default)

Scalable Networking Pack (912222)

Windows Server 2008

Windows Server 2008 R2 (64-bit only)

Breaks 64 logical processor limit

NUMA IO enhancements?Do not bother trying to do DW on 32-bit OS or 32-bit SQL ServerDon’t try to do DW on SQL Server 2000

Impacts OLTP

Search: MSI-X

SQL Server versionSQL Server version

SQL Server 2000Serious disk IO limitations (1GB/sec ?)

Problematic parallel execution plans

SQL Server 2005 (fixed most S2K problems)

64-bit on X64 (Opteron and Xeon)

SP2 – performance improvement 10%(?)

SQL Server 2008 & R2Compression, Filtered Indexes, etc

Star join, Parallel query to partitioned table

ConfigurationConfiguration

SQL Server Startup Parameter: ETrace Flags 834, 836, 2301

Auto_Date_CorrelationOrder date < A, Ship date > A

Implied: Order date > A-C, Ship date < A+C

Port Affinity – mostly OLTP

Dedicated processor ?for log writer ?

Storage Performance for Data Warehousing




StorageStorage

Organization StructureOrganization Structure

In many large IT departmentsDB and Storage are in separate groups

Storage usually has own objectivesBring all storage into one big system under full management (read: control)

Storage as a Service, in the CloudOne size fits all needs

Usually have zero DB knowledge

Of course we do high bandwidth, 600MB/sec good enough for you?

Data Warehouse StorageData Warehouse Storage

OLTP – Throughput with Fast Response

DW – Flood the queues for maximum through-put

Do not use shared storage for data warehouse!Storage system vendors like to give the impression the SAN is a magical, immensely powerful box that can meet all your needs. Just tell us how much capacity you need and don’t worry about anything else.My advice: stay away from shared storage, controlled by different team.

Nominal and Net BandwidthNominal and Net Bandwidth

PCI-E Gen 2 – 5 Gbit/sec signalingx8 = 5GB/s, net BW 4GB/s, x4 = 2GB/s net

SAS 6Gbit/s – 6 Gbit/s x4 port: 3GB/s nominal, 2.2GB/sec net?

Fibre Channel 8 Gbit/s nominal780GB/s point-to-point,

680MB/s from host to SAN to back-end loop

SAS RAID Controller, x8 PCI-E G2, 2 x4 6G

2.8GB/s

Depends on the controller, will change!

Storage Storage –– SAS Direct-Attach SAS Direct-Attach

Many Fat Pipes

Very Many Disks

Balance by pipe bandwidth

Don’t forget fat network pipes

Option A: 24-disks in one enclosure for each x4 SAS port. Two x4 SAS ports per controller

Option B: Split enclosure over 2 x4 SAS ports, 1 controller

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

RAIDPCI-E x8

SAS x4

SAS x4

RAIDPCI-E x8

SAS x4

SAS x4

RAIDPCI-E x8

SAS x4

SAS x4

RAIDPCI-E x8

SAS x4

SAS x4

PCI-E x4

PCI-E x4

RAID

2 x10GbE

PCI-E x4 2 x10GbE

SAS x4

Storage Storage –– FC/SAN FC/SAN

PCI-E x8 Gen 2 Slot with quad-port 8Gb FC

If 8Gb quad-port is not supported, consider system with many x4 slots, or consider SAS!

SAN systems typically offer 3.5in 15-disk enclosures. Difficult to get high spindle count with density.

1-2 15-disk enclosures per 8Gb FC port, 20-30MB/s per disk?

2 x10GbE

PCI-E x4 2 x10GbE

PCI-E x4

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

HBAPCI-E x8

HBAPCI-E x8

8Gb FC

8Gb FC

8Gb FC

8Gb FC

8Gb FC

8Gb FC

8Gb FC

8Gb FC

HBAPCI-E x4

PCI-E x4 HBA

8Gb FC

8Gb FC

8Gb FC

8Gb FC

HBAPCI-E x48Gb FC

8Gb FC

PCI-E x4 HBA8Gb FC

8Gb FC

PCI-E x4 HBA8Gb FC

8Gb FC

Storage Storage –– SSD / HDD Hybrid SSD / HDD Hybrid

Log: Single DB – HDD, unless rollbacks or T-log backups disrupts log writes. Multi DB – SSD, otherwise to many RAID1 pairs to logs

Storage enclosures typically 12 disks per channel. Can only support bandwidth of a few SSD. Use remaining bays for extra storage with HDD. No point expending valuable SSD space for backups and flat files

No RAID w/SSD?

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

SASPCI-E x8

SAS x4

SAS x4

SASPCI-E x8

SAS x4

SAS x4

SASPCI-E x8

SAS x4

SAS x4

SASPCI-E x8

SAS x4

SAS x4

PCI-E x4

PCI-E x4

RAID

2 x10GbE

PCI-E x4 2 x10GbE

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SAS x4

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SS

D

SSDSSD

Current: mostly 3Gbps SAS/SATA SDD

Some 6Gbps SATA SSD

Fusion IO – direct PCI-E Gen2 interface

320GB-1.2TB capacity, 200K IOPS, 1.5GB/s

No RAID ?HDD is fundamentally a single point failure

SDD could be built with redundant components

HP report problems with SSD on RAID controllers, Fujitsu did not?

Big DW Storage – iSCSI Big DW Storage – iSCSI

Are you nuts?

Storage Configuration - ArraysStorage Configuration - Arrays

Shown:two 12-disk Arrays per 24-disk enclosure

Options: between 6-16 disks per array

SAN systems may recommend R10 4+4 or R5 7+1

Very Many Spindles Comment on Meta LUN

Data Consumption Rate: XeonData Consumption Rate: Xeon

TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M

Data consumption rate is much higher for current generation Nehalem and Westmere processors than Core 2 referenced in Microsoft FTDW document. TPC-H Q1 is more compute intensive than the FTDW light query.

ProcessorsTotalCores

Q1sec

SQLTotalMB/s

MB/sper coreGHz

MemGB

SF

2 Xeon 5355

2 Xeon 5570

2 Xeon 5680

8

8

12

85.4

42.2

21.0

5sp2

8sp1

8r2

1,165.5

2,073.5

4,166.7

145.7

259.2

347.2

2.66

2.93

3.33

64

144

192

100

100

100

4 Xeon 7560 32 37.28r2 7,056.5 220.52.26 640 300

Nehalem

Westmere

Neh.-EX

Conroe

8 Xeon 7560 64 183.88r2 14,282 223.22.26 512 3000

Data Consumption Rate: OpteronData Consumption Rate: Opteron

Expected Istanbul to have better performance per core than Shanghai due to HT Assist. Magny-Cours has much better performance per core! (at 2.3GHz versus 2.8 for Istanbul), or is this Win/SQL 2K8 R2?

TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M


Q1sec

SQLTotalMB/s

MB/sper coreGHz

MemGB

SF

4 Opt 8220

8 Opt 8360

8

32

309.7

91.4

5rtm

8rtm

868.7

2,872.0

121.1

89.7

2.8

2.5

128

256

300

300

8 Opt 8384 32 72.58rtm 3,620.7 113.22.7 256 300

8 Opt 8439 48 49.08sp1 5,357.1 111.62.8 256 300

Barcelona

Shanghai

Istanbul

2 Opt 6176 24 20.28r2 4,331.7 180.52.3 192 100 Magny-C

4 Opt 6176 48 31.88r2 8,254.7 172.02.3 512 300 -

8 Opt 8439 48 166.98rtm 5,242.7 109.22.8 512 1000

Data Consumption RateData Consumption RateTPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M


Q1sec

SQLTotalMB/s

MB/sper coreGHz

MemGB

SF

2 Xeon 5355

2 Xeon 5570

2 Xeon 5680

2 Opt 6176

8

8

12

24

85.4

42.2

21.0

20.2

5sp2

8sp1

8r2

8r2

1165.5

2073.5

4166.7

4331.7

145.7

259.2

347.2

180.5

2.66

2.93

3.33

2.3

64

144

192

192

100

100

100

100

4 Opt 8220

8 Opt 8360

8

32

309.7

91.4

5rtm

8rtm

868.7

2872.0

121.1

89.7

2.8

2.5

128

256

300

300

8 Opt 8384

8 Opt 8439

32

48

72.5

49.0

8rtm

8sp1

3620.7

5357.1

113.2

111.6

2.7

2.8

256

256

300

300

4 Opt 6176 48 31.88r2 8254.7 172.02.3 512 300

8 Xeon 7560 64 183.88r2 14282 223.22.26 512 3000

Barcelona

Shanghai

Istanbul

Magny-C

Storage Targets Storage Targets

Processors

2 Xeon X5680

4 Opt 6176

4 Xeon X7560

8 Xeon X7560

Total Cores

12

48

32

64

PCI-Ex8-x4

5 - 1

5 - 1

6 - 4

9 - 5

SASHBA

2

4

6

11†

Storage Units/Disks

2 - 48

4 - 96

6 - 144

10 - 240

Actual Bandwidth

5 GB/s

10 GB/s

15 GB/s

26 GB/s

† 8-way : 9 controllers in x8 slots, 24 disks per x4 SAS port 2 controllers in x4 slots, 12 disk

24 15K disks per enclosure, 12 disks per x4 SAS port requires 100MB/sec per disk,

possible but not always practical24 disks per x4 SAS port requires 50MB/sec,

more achievable in practice

2U disk enclosure 24 x 73GB 15K 2.5in disks $14K, $600 per disk

BWCore

350

175

250

225

Target MB/s

4200

8400

8000

14400

StorageUnits/Disks

4 - 96

8 - 192

12 - 288

20 - 480

Think: Shortest path to metal (iron-oxide)

Your Storage and the Optimizer Your Storage and the Optimizer

Assumptions2.8GB/sec per SAS 2 x4 Adapter, Could be 3.2GB/sec per PCI-E G2 x8HDD 400 IOPS per disk – Big query key lookup, loop join at high queue, and short-stroked, possible skip-seek. SSD 35,000 IOPS

Sequential IOPS

1,350

350,000

350,000

Model

Optimizer

SAS 2x4

SAS 2x4

Disks

-

24

48

BW (KB/s)

10,800

2,800,000

2,800,000

“Random” IOPS

320

9,600

19,200

Sequential- Rand IO ratio

4.22

36.5

18.2

45,000FC 4G 30 360,000 12,000 3.75

350,000SSD 8 2,800,000 280,000 1.25

The SQL Server Query Optimizer make key lookup versus table scan decisions based on a 4.22 sequential-to-random IO ratioA DW configured storage system has a 18-36 ratio, 30 disks per 4G FC about matches the QO, SSD is in the other direction

0

50

100

150

200

250

300

350

400

450

Q1 Q9 Q18 Q21

X 5355 5sp2 X 5570 8sp1

X 5680 8R2 O 6176 8R2

Data Consumption RatesData Consumption Rates

0

50

100

150

200

250

300

Q1 Q9 Q18 Q21

O DC 2.8G 128 5rtm O QC 2.5G 256 8rtm

O QC 2.7G 256 8rtm O 6C 2.8G 256 8sp1

O 12C 2.3G 512 8R2 X7560 R2 640

TPC-H SF100Query 1, 9, 13, 21

TPC-H SF300Query 1, 9, 13, 21

Fast Track Reference ArchitectureFast Track Reference Architecture

Several Expensive SAN systems (11 disks)

Each must be configured independently

$1,500-2,000 amortized per disk

Too many 2-disk Arrays2 LUN per Array, too many data files

Build Indexes with MAXDOP 1 Is this brain dead?

Designed around 100MB/sec per diskNot all DW is single scan, or sequential

My Complaints

Scripting?

FragmentationFragmentation

Weak Storage System 1) Fragmentation could degrade IO performance,2) Defragmenting very large table on a weak storage system could render the database marginally to completely non-functional for a very long time.

Powerful Storage System3) Fragmentation has very little impact.4) Defragmenting has mild impact, and completes within night time window.

What is the correct conclusion?

File

Partition

LUN

Disk

Table

Operating System View of Operating System View of StorageStorage

Operating System Disk ViewOperating System Disk View

Controller 1 Port 0

Controller 1 Port 1

Disk 2Basic396GBOnline


Controller 2 Port 0

Controller 2 Port 1



Controller 3 Port 0

Controller 3 Port 1



Additional disks not shown, Disk 0 is boot drive, 1 – install source?

File LayoutFile Layout

Disk 2, Partition 0

File Group for the big TableFile 1

Partition 1

File Group for all othersFile 1

Partition 2

TempdbFile 1

Partition 4

Backup and Load File 1

Disk 3 Partition 0


Partition 1

Small File GroupFile 2

Partition 2

TempdbFile 2

Partition 4

Backup and LoadFile 2

Disk 4 Partition 0


Partition 1


Partition 2

TempdbFile 3

Partition 4


Disk 5 Partition 0


Partition 1


Partition 2

TempdbFile 4

Partition 4


Disk 6 Partition 0


Partition 1


Partition 2

TempdbFile 5

Partition 4


Disk 7 Partition 0


Partition 1


Partition 2

TempdbFile 6

Partition 4


Each File Group is distributed across all data disks

Log disks not shown, tempdb share common pool with data

File Groups and FilesFile Groups and Files

Dedicated File Group for largest table

Never defragment

One file group for all other regular tables

Load file group?Rebuild indexes to different file group

Partitioning - PitfallsPartitioning - Pitfalls

Common Partitioning Strategy

Partition Scheme maps partitions to File Groups

What happens in a table scan? Read first from Part 1 then 2, then 3, … ?

SQL 2008 HF to read from each partition in parallel?What if partitions have disparate sizes?

Disk 2

File Group 1

Disk 3

File Group 2

Disk 4

File Group 3

Disk 5

File Group 4

Disk 6

File Group 5

Disk 7

File Group 6

Table Partition 1

Table Partition 2

Table Partition 3

Table Partition 4

Table Partition 5

Table Partition 6

Parallel Execution Plans




So you bought a 64+ core boxSo you bought a 64+ core box

Learn all about Parallel ExecutionAll guns (cores) blazing

Negative scaling

Super-scaling

High degree of parallelism & small SQL

Anomalies, execution plan changes etc

Compression

Partitioning

Now

No I have not been smoking pot

Yes, this can happen, how will you know

How much in CPU do I pay for this?

Great management tool, what else?

Parallel Execution PlansParallel Execution Plans

Reference: Adam Machanic PASS

Execution Plan QuickieExecution Plan Quickie

Cost is duration in seconds on some reference platformIO Cost for scan: 1 = 10,800KB/s, 810 implies 8,748,000KBIO in Nested Loops Join: 1 = 320/s, multiple of 0.003125

F4

Estimated Execution Plan

I/O and CPU Cost components

Index + Key Lookup - ScanIndex + Key Lookup - Scan

(926.67- 323655 * 0.0001581) / 0.003125 = 280160 (86.6%)

Actual CPU Time (Data in memory)LU 1919 1919Scan 8736 8727

1,093,729 pages/1350 = 810.17 (8,748MB)

True cross-over approx 1,400,000 rows1 row : page

Index + Key Lookup - ScanIndex + Key Lookup - Scan

8748000KB/8/1350 = 810 (817- 280326 * 0.0001581) / 0.003125 = 247259 (88%)

Actual CPU TimeLU 2138 321Scan 18622 658

Actual Execution PlanActual Execution Plan

Note Actual Number of Rows, Rebinds, Rewinds

Actual

Estimated

Actual Estimated

Row Count and ExecutionsRow Count and Executions

For Loop Join inner source and Key Lookup, Actual Num Rows = Num of Exec × Num of Rows

Inner Source

Outer

Parallel PlansParallel Plans

Parallelism OperationsParallelism Operations

Distribute StreamsNon-parallel source, parallel destination

Repartition StreamsParallel source and destination

Gather StreamsDestination is non-parallel

Parallel Execution PlansParallel Execution Plans

Note: gold circle with double arrow, and parallelism operations

Parallel Scan (and Index Seek)Parallel Scan (and Index Seek)

DOP 1 DOP 2

DOP 4 DOP 8

IO Cost sameCPU reduce by degree of parallelism, except no reduction for DOP 16

2X

4X8X

IO contributes most of cost!

Parallel Scan 2Parallel Scan 2

DOP 16

Hash Match AggregateHash Match Aggregate

CPU cost only reducesBy 2X,

Parallel ScanParallel Scan

IO Cost is the same

CPU cost reduced in proportion to degree of parallelism, last 2X excluded?

On a weak storage system, a single thread can saturate the IO channel, Additional threads will not increase IO (reduce IO duration).A very powerful storage system can provide IO proportional to the number of threads. It might be nice if this was optimizer option?

The IO component can be a very large portion of the overall plan costNot reducing IO cost in parallel plan may inhibit generating favorable plan,i.e., not sufficient to offset the contribution from the Parallelism operations.

A parallel execution plan is more likely on larger systems (-P to fake it?)

Actual Execution Plan - ParallelActual Execution Plan - Parallel

More Parallel Plan DetailsMore Parallel Plan Details

Parallel Plan - ActualParallel Plan - Actual

Parallelism – Hash JoinsParallelism – Hash Joins

Hash Join CostHash Join Cost

DOP 1 DOP 2

DOP 8

DOP 4

Search: Understanding Hash JoinsFor In-memory, Grace, Recursive

Hash Join CostHash Join Cost

CPU Cost is linear with number of rows, outer and inner source

See BOL on Hash Joins for In-Memory, Grace, RecursiveIO Cost is zero for small intermediate data size, beyond set point proportional to server memory(?) IO is proportional to excess data (beyond in-memory limit)Parallel Plan: Memory allocation is per thread!

Summary: Hash Join plan cost depends on memory if IO component is not zero, in which case is disproportionately lower with parallel plans. Does not reflect real cost?

Parallelism Repartition StreamsParallelism Repartition Streams

DOP 2 DOP 4 DOP 8

BitmapBitmap

BOL: Optimizing Data Warehouse Query Performance Through Bitmap Filtering A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. Essentially, the filter performs a semi-join reduction; that is, only the rows in the second table that qualify for the join to the first table are processed.

SQL Server uses the Bitmap operator to implement bitmap filtering in parallel query plans. Bitmap filtering speeds up query execution by eliminating rows with key values that cannot produce any join records before passing rows through another operator such as the Parallelism operator. A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. By removing unnecessary rows early in the query, subsequent operators have fewer rows to work with, and the overall performance of the query improves. The optimizer determines when a bitmap is selective enough to be useful and in which operators to apply the filter. For more information, see Optimizing Data Warehouse Query Performance Through Bitmap Filtering.

Parallel Execution Plan SummaryParallel Execution Plan Summary

Queries with high IO cost may show little plan cost reduction on parallel execution

Plans with high portion hash or sort cost show large parallel plan cost reduction

Parallel plans may be inhibited by high row count in Parallelism Repartition Streams

Watch out for (Parallel) Merge Joins!

Scaling TheoryScaling Theory

Parallel Execution StrategyParallel Execution Strategy

Partition work into little piecesEnsures each thread has same amount

High overhead to coordinate

Partition into big piecesMay have uneven distribution between threads

Small table join to big table

Thread for each row from small table

Partitioned table options

What Should Scale?What Should Scale?

Trivially parallelizable: 1) Split large chunk of work among threads, 2) Each thread works independently,3) Small amount of coordination to consolidate threads

223

More Difficult?More Difficult?

Parallelizable: 1) Split large chunk of work among threads, 2) Each thread works on first stage3) Large coordination effort between threads4) More work…Consolidate

2

2

3

3

4

Partitioned TablesPartitioned TablesNo Repartition Streams

Regular Table

Partitioned TablesNo Repartition Streams operations!

Scaling RealityScaling Reality8-way Quad-Core OpteronWindows Server 2008 R2SQL Server 2008 SP1 + HF 27

Test QueriesTest Queries

TPC-H SF 10 databaseStandard, Compressed, Partitioned (30)

Line Item Table SUM, 59M rows, 8.75GB

Orders Table 15M rows

CPU-secCPU-sec

StandardCPU-sec to SUM 1 or 2 columns in Line Item

0

5

10

15

20

25

30

35

DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32

Sum 1 column

Sum 2 columns

0

4

8

12

16

20

24

28

32


Sum 1 column

Sum 2 columns

Compressed

Speed UpSpeed Up

0

2

4

6

8

10

12

14

16

18

20

22

24

26

DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32

Sum 1

Sum 2

S2 Group

S2 Join

Compressed

0

4

8

12

16

20

24

28

32


Sum 1 column

Sum 2 columns

S2 Group

S2 Join

Standard

Line Item sum 1 columnLine Item sum 1 column

Speed up relative to DOP 1

CPU-sec

0

5

10

15

20

25

30

35


Sum 1 Std

Compressed

Partitioned

0

5

10

15

20

25

30

35


Sum 1 Std

Compressed

Partitioned

Line Item Sum w/Group ByLine Item Sum w/Group By

Speedup

CPU-sec

0

10

20

30

40

50

60


Group Std

Compressed

Hash

0

2

4

6

8

10

12

14

16

18

20

22

24

26


Group Std

Compressed

Hash

Hash JoinHash Join

Speedup

CPU-sec

0

20

40

60

80

100

120


Join Std

Compressed

Partitioned

0

5

10

15

20

25

30


Join Std

Compressed

Partitioned

Key Lookup and Table ScanKey Lookup and Table Scan

Speedup

CPU-sec1.4M rows

0

2

4

6

8

10

12

14

16

18

20


Key Lookup std Key Lookup compr

Table Scan uncmp Table Scan cmpr

02468

101214161820222426283032


Key Lookup std

Key Lookup compr

Table Scan uncmp

Table Scan cmpr

Parallel Execution SummaryParallel Execution Summary

Contention in queries w/low cost per page

Simple scan,

High Cost per Page – improves scaling!

Multiple Aggregates, Hash Join, Compression

Table Partitioning – alternative query plans

Loop Joins – broken at high DOP

Merge Join – seriously broken (parallel)

Scaling DW SummaryScaling DW Summary

Massive IO bandwidth

Parallel options for data load, updates etc

Investigate Parallel Execution PlansScaling from DOP 1, 2, 4, 8, 16, 32 etc

Scaling with and w/o HT

Strategy for limiting DOP with multiple users

Fixes from Microsoft NeededFixes from Microsoft Needed

Contention issues in parallel execution

Table scan, Nested Loops

Better plan cost model for scalingBack-off on parallelism if gain is negligible

Fix throughput degradation with multiple users running big DW queries

Sybase and Oracle, Throughput is close to Power or better

Test SystemsTest Systems

Test SystemsTest Systems

2-way quad-core Xeon 5430 2.66GHzWindows Server 2008 R2, SQL 2008 R2

8-way dual-core Opteron 2.8GHzWindows Server 2008 SP1, SQL 2008 SP1

8-way quad-core Opteron 2.7GHz Barcelona

Windows Server 2008 R2, SQL 2008 SP18-way systems were configured for AD- not good!

Build 2789

Test MethodologyTest Methodology

Boot with all processorsRun queries at MAXDOP 1, 2, 4, 8, etc

Not the same as running on 1-way, 2-way, 4-way server

Interpret results with caution

ReferencesReferences

Search Adam Machanic PASS

SQL Server Scaling on Big Iron (NUMA) Systems


TPC-H



TPC-HTPC-H

TPC-HTPC-H

DSS – 22 queries, geometric mean60X range plan cost, comparable actual range

Power – single streamTests ability to scale parallel execution plans

Throughput – multiple streams

Scale Factor 1 – Line item data is 1GB

875MB with DATE instead of DATETIME

Only single column indexes allowed, Ad-hoc

Observed Scaling BehaviorsObserved Scaling Behaviors

Good scaling, leveling off at high DOP

Perfect Scaling ???

Super Scaling

Negative Scaling especially at high DOP

Execution Plan change Completely different behavior

TPC-H Published ResultsTPC-H Published Results

TPC-H SF 100GBTPC-H SF 100GB

Between 2-way Xeon 5570, all are close, HDD has best throughput, SATA SSD has best composite, and Fusion-IO has be power.Westmere and Magny-Cours, both 192GB memory, are very close

2-way Xeon 5355, 5570, 5680, Opt 6176

0

20,000

40,000

60,000

80,000

100,000

Power Throughput QphH

Xeon 5355 5570 HDD

5570 SSD 5570 Fusion

5680 SSD Opt 6176

TPC-H SF 300GBTPC-H SF 300GB8x QC/6C & 4x12C Opt,

6C Istanbul improved over 4C Shanghai by 45% Power, 73% Through-put, 59% overall.4x12C 2.3GHz improved17% over 8x6C 2.8GHz

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000


Opt 8360 4C Opt 8384 4COpt 8439 6C Opt 6716 12X 7560 8C

TPC-H SF 1000TPC-H SF 1000

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000


Opt 8439 SQL Opt 8439 Sybase

Superdome Superdome 2

TPC-H SF 3TBTPC-H SF 3TBX7460 & X7560

Nehalem-EX 64 cores better than 96 Core 2.

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000


16 x X7460

8 x 7560

POWER6

TPC-H SF 100GB, 300GB & 3TBTPC-H SF 100GB, 300GB & 3TB

0

20,000

40,000

60,000

80,000

100,000


Xeon 5355 5570 HDD

5570 SSD 5570 Fusion

5680 SSD Opt 6176Westmere and Magny-Cours are very closeBetween 2-way Xeon 5570, all are close, HDD has best through-put, SATA SSD has best composite, and Fusion-IO has be power

SF100 2-way

SF300 8x QC/6C & 4x12C6C Istanbul improved over 4C Shanghai by 45% Power, 73% Through-put, 59% overall.4x12C 2.3GHz improved17% over 8x6C 2.8GHz

SF 3TB X7460 & X7560Nehalem-EX 64 cores better than 96 Core 2.

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000


Opt 8360 4C Opt 8384 4COpt 8439 6C Opt 6716 12X 7560 8C

0

50,000

100,000

150,000

200,000

16 x X7460

8 x 7560

32 x Pwr6


SQL Server excels in Power Limited by Geometric mean, anomalies

Trails in ThroughputOther DBMS get better throughput than power

SQL Server throughput below Power

by wide margin

Speculation – SQL Server does not throttle back parallelism with load?

TPC-H SF100TPC-H SF100

PowerThrough

putQphHProcessors

TotalCores

SQLGHzMemGB

SF

23,378.0 13,381.0 17,686.72 Xeon 5355 8 5sp22.66 64 100

67,712.9 38,019.1 50,738.42x5570 HDD 8 8sp12.93 144 100

99,426.3

94,761.5

55,038.2

53,855.6

73,974.6

71,438.3

2 Xeon 5680

2 Opt 6176

12

24

8r2

8r2

3.33

2.3

192

192

100

100

70,048.5 37,749.1 51,422.42x5570 SSD 8 8sp12.93 144 100

72,110.5 36,190.8 51,085.65570 Fusion 8 8sp12.93 144 100

TPC-H SF300TPC-H SF300

PowerThrough

putQphHProcessors

TotalCores

SQLGHzMemGB

SF

25,206.4

67,287.4

75,161.2

109,067.1

13,283.8

41,526.4

44,271.9

76,869.0

18,298.5

52,860.2

57,684.7

91,558.2

4 Opt 8220

8 Opt 8360

8

32

5rtm

8rtm

2.8

2.5

128

256

8 Opt 8384

8 Opt 8439

32

48

8rtm

8sp1

2.7

2.8

256

256

300

300

300

300

129,198.3 89,547.7 107,561.24 Opt 6176 48 8r22.3 512 300

152,453.1 96,585.4 121,345.64 Xeon 7560 32 8r22.26 640 300

All of the above are HP results?, Sun result Opt 8384, sp1, Pwr 67,095.6, Thr 45,343.5, QphH 55,157.5

TPC-H 1TBTPC-H 1TB

PowerThrough

putQphHProcessors

TotalCores

SQLGHzMemGB

SF

95,789.1 69,367.6 81,367.68 Opt 8439 48 8R2?2.8 512 1000

108,436.8 96,652.7 102,375.38 Opt 8439 48 ASE2.8 384 1000

139,181.0 141,188.1 140,181.1Itanium 9350 64 O11R21.73 512 1000

TPC-H 3TBTPC-H 3TB

PowerThrough

putQphHProcessors

TotalCores

SQLGHzMemGB

SF

120,254.8 87,841.4 102,254.816 Xeon 7460 96 8r22.66 1024 3000

185,297.7 142,685.6 162,601.78 Xeon 7560 64 8r22.26 512 3000

142,790.7 171,607.4 156,537.3Itanium 9350 64 Sybase1.73 512 1000

142,790.7 171,607.4 156,537.3POWER6 64 Sybase5.0 512 3000


Power

23,378

72,110.5

99,426.3

94,761.5

25,206.4

67,287.4

75,161.2

109,067.1

129,198.3

185,297.7

Throughput

13,381

36,190.8

55,038.2

53,855.6

13,283.8

41,526.4

44,271.9

76,869.0

89,547.7

142,685.6

QphH

17,686.7

51,085.6

73,974.6

71,438.3

18,298.5

52,860.2

57,684.7

91,558.2

107,561.2

162,601.7


SQLGHzMemGB

2 Xeon 5355

2 Xeon 5570

2 Xeon 5680

2 Opt 6176

8

8

12

24

5sp2

8sp1

8r2

8r2

2.66

2.93

3.33

2.3

64

144

192

192

4 Opt 8220

8 Opt 8360

8

32

5rtm

8rtm

2.8

2.5

128

256

8 Opt 8384

8 Opt 8439

32

48

8rtm

8sp1

2.7

2.8

256

256

4 Opt 6176 48 8r22.3 512

8 Xeon 7560 64 8r22.26 512

SF

100

100

100

100

300

300

300

300

300

3000

SF100 2-way Big Queries (sec)SF100 2-way Big Queries (sec)

0

10

20

30

40

50

60

Q1 Q9 Q13 Q18 Q21

5570 HDD 5570 SSD

5570 FusionIO 5680 SSD

6176 SSD

Xeon 5570 with SATA SSD poor on Q9, reason unknownBoth Xeon 5680 and Opteron 6176 big improvement over Xeon 5570

Qu

ery

tim

e in

se

c

SF100 Middle QSF100 Middle Q

0

1

2

3

4

5

6

7

8

Q3 Q5 Q7 Q8 Q10 Q11 Q12 Q16 Q22

5570 HDD 5570 SSD 5570 FusionIO

5680 SSD 6176 SSD

Xeon 5570-HDD and 5680-SSD poor on Q12, reason unknownOpteron 6176 poor on Q11

Qu

ery

tim

e in

se

c

SF100 Small QueriesSF100 Small Queries

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Q2 Q4 Q6 Q14 Q15 Q17 Q19 Q20

5570 HDD 5570 SSD 5570 FusionIO

5680 SSD 6176 SSD

Qu

ery

tim

e in

se

c

Xeon 5680 and Opteron poor on Q20Note limited scaling on Q2, & 17

SF300 32+ cores Big QueriesSF300 32+ cores Big QueriesQ

ue

ry ti

me

in s

ec

Opteron 6176 poor relative to 8439 on Q9 & 13, same number of total cores

0

20

40

60

80

100

120

Q1 Q9 Q13 Q18 Q21

8 x 8360 QC 2M

8 x 8384 QC 6M

8 x 8439 6C

4 x 6176 12C

4 x 7560 8C

SF300 Middle QSF300 Middle Q

Opteron 6176 much better than 8439 on Q11 & 19Worse on Q12

Qu

ery

tim

e in

se

c

0

4

8

12

16

20

24

28

Q3 Q5 Q7 Q8 Q10 Q11 Q12 Q16 Q19 Q20 Q22

8x8360 QC 2M 8x8384 QC 6M

8x8439 6C 4x6176 12C

4x7560 8C

SF300 Small QSF300 Small Q

Opteron 6176 much better on Q2, even with 8439 on others

Qu

ery

tim

e in

se

c

0

1

2

3

4

5

6

Q2 Q4 Q6 Q14 Q15 Q17

8 x 8360 QC 2M 8 x 8384 QC 6M

8 x 8439 6C 4 x 6176 12C

4 x 7560 8C

SF1000SF1000

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22

SF1000SF1000

0

50

100

150

200

250

300

350

400

Q1 Q9 Q13 Q18 Q21

SQL Server

Sybase

SF1000SF1000

0

10

20

30

40

50

60

70

80

Q3 Q5 Q7 Q8 Q10 Q11 Q12 Q17 Q19

SQL Server

Sybase

SF1000SF1000

0

5

10

15

20

25

30

35

Q2 Q4 Q6 Q14 Q15 Q16 Q20 Q22

SQL Server

Sybase

SF1000 Itanium - SuperdomeSF1000 Itanium - Superdome

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6


SF 3TB – 8SF 3TB – 8××7560 versus 167560 versus 16××74607460

0.0

0.5

1.0

1.5

2.0

2.5


Broadly 50% faster overall, 5X+ on one, slower on 2, comparable on 3

5.6X

64 cores, 7560 relative to PWR664 cores, 7560 relative to PWR6

0

1

2

3

4

5

6


0

100

200

300

400

500

600

Q1 Q9 Q13 Q18 Q21

Uni 16x6

DL980 8x8

Pwr6

0

20

40

60

80

100

120

140

160

180

200

Q3 Q5 Q7 Q8 Q10 Q11 Q12 Q17 Q19

Uni 16x6

DL980 8x8

Pwr6

0

10

20

30

40

50

60

Q2 Q4 Q6 Q14 Q15 Q16 Q20 Q22

Uni 16x6

DL980 8x8

Pwr6

TPC-H SummaryTPC-H Summary

Scaling is impressive on some SQL

Limited ability (value) is scaling small Q

Anomalies, negative scaling

TPC-H QueriesTPC-H Queries

Q1 Pricing Summary ReportQ1 Pricing Summary Report

Query 2 Minimum Cost SupplierQuery 2 Minimum Cost Supplier

Wordy, but only touches the small tables, second lowest plan cost (Q15)

Q6 Forecasting Revenue ChangeQ6 Forecasting Revenue Change

Q7 Volume ShippingQ7 Volume Shipping

Q8 National Market ShareQ8 National Market Share

Q9 Product Type Profit MeasureQ9 Product Type Profit Measure

Q11 Important Stock IdentificationQ11 Important Stock Identification

Non-Parallel Parallel

Q12 Random IO?Q12 Random IO?

Q13Q13 Why does Q13 have perfect scaling?

Q17 Small Quantity Order RevenueQ17 Small Quantity Order Revenue

Q18 Large Volume CustomerQ18 Large Volume Customer

Non-Parallel

Parallel

Q19Q19

Q20?Q20?

This query may get a poor execution plan

Date functions are usually written as

because Line Item date columns are “date” typeCAST helps DOP 1 plan, but get bad plan for parallel

Q21 Suppliers Who Kept Orders WaitingQ21 Suppliers Who Kept Orders Waiting

Note 3 references to Line Item

Q22Q22

TPC-H Studies




TPC-HTPC-H

TPC-HTPC-H

DSS – 22 queries, geometric mean60X range plan cost, comparable actual range

Power – single streamTests ability to scale parallel execution plans

Throughput – multiple streams

Scale Factor 1 – Line item data is 1GB

875MB with DATE instead of DATETIME

Only single column indexes allowed, Ad-hoc

SF 10, test studiesSF 10, test studies

Not valid for publication

Auto-Statistics enabled, Excludes compile time

Big Queries – Line Item Scan

Super Scaling – Mission Impossible

Small Queries & High Parallelism

Other queries, negative scaling

Did not apply T2301, or disallow page locks

0

500

1,000

1,500

2,000

2,500

3,000

3,500

Q1 Q9 Q13 Q18 Q21

DOP 1 DOP 2 DOP 4

DOP 8 DOP 16

Big Q: Plan Cost vs ActualBig Q: Plan Cost vs ActualPlan Cost reduction from DOP1 to 16/32Q1 28%Q9 44%Q18 70%Q21 20%

Plan Cost says scaling is poor except for Q18,

memory affects Hash IO onset

Plan Cost @ 10GB

0

15

30

45

60

75

Q1 Q9 Q13 Q18 Q21

DOP 1 DOP 2 DOP 4

DOP 8 DOP 16 DOP 24

DOP 30 DOP 32

Actual Query timeIn seconds

Plan Cost is poor indicator of true parallelism scaling

Q18 & Q 21 > 3X Q1, Q9

02468

10121416182022242628303234

Q1 Q9 Q13 Q18 Q21

DOP 1 DOP 2 DOP 4 DOP 8


Big Query: Speed Up and CPUBig Query: Speed Up and CPU

Q13 has slightly better than perfect scaling?In general, excellent scaling to DOP 8-24, weak afterwards

Holy Grail

0

10

20

30

40

50

60

70

80

90

Q1 Q9 Q13 Q18 Q21



CPU timeIn seconds


Super ScalingSuper Scaling

Suppose at DOP 1, a query runs for 100 seconds, with one CPU fully pegged

CPU time = 100 sec, elapse time = 100 sec

What is best case for DOP 2?Assuming nearly zero Repartition Threads cost

CPU time = 100 sec, elapsed time = 50?

Super Scaling: CPU time decreases going from Non-Parallel to Parallel plan!No, I have not started drinking, yet

0.0

0.5

1.0

1.5

2.0

2.5

Q7 Q8 Q11 Q21 Q22

DOP 1 DOP 2

DOP 4 DOP 8

DOP 16 DOP 24

DOP 30 DOP 32


CPU-sec goes down from DOP 1 to 2 and higher (typically 8)

0

2

4

6

8

10

12

14

16

18

20

22

24

26

Q7 Q8 Q11 Q21 Q22



CPU normalized to DOP 1


3.5X speedup from DOP 1 to 2 (Normalized to DOP 1)

CPU and Query time in secondsCPU and Query time in seconds

0

2

4

6

8

10

12

14

16

18

20

Q7 Q8 Q11 Q21 Q22



0

2

4

6

8

10

12

Q7 Q8 Q11 Q21 Q22

DOP 1 DOP 2 DOP 4

DOP 8 DOP 16 DOP 24

DOP 30 DOP 32

CPU time

Query time

Super Scaling SummarySuper Scaling Summary

Most probable causeBitmap Operator in Parallel Plan

Bitmap Filters are great, Question for Microsoft:

Can I use Bitmap Filters in OLTP systems with non-parallel plans?

Small Queries – Plan Cost vs ActSmall Queries – Plan Cost vs Act

Query 3 and 16 have lower plan cost than Q17, but not included

0

50

100

150

200

250

Q2 Q4 Q6 Q15 Q17 Q20

DOP 1 DOP 2 DOP 4

DOP 8 DOP 16

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Q2 Q4 Q6 Q15 Q17 Q20



Q4,6,17 great scaling to DOP 4, then weak

Negative scaling also occurs

Query time

Plan Cost

Small Queries CPU & SpeedupSmall Queries CPU & Speedup

What did I get for all that extra CPU?, Interpretation: sharp jump in CPU means poor scaling, disproportionate means negative scaling

0

1

2

3

4

5

6

Q2 Q4 Q6 Q15 Q17 Q20



0

2

4

6

8

10

12

14

16

18

Q2 Q4 Q6 Q15 Q17 Q20

DOP 1 DOP 2 DOP 4

DOP 8 DOP 16 DOP 24

DOP 30 DOP 32

Query 2 negative at DOP 2, Q4 is good, Q6 get speedup, but at CPU premium, Q17 and 20 negative after DOP 8

CPU time

Speed up

High Parallelism – Small QueriesHigh Parallelism – Small Queries

Why? Almost No value

TPC-H geometric mean scoringSmall queries have as much impact as large

Linear sum of weights large queries

OLTP with 32, 64+ coresParallelism good if super-scaling

Default max degree of parallelism 0

Seriously bad news, especially for small Q

Increase cost threshold for parallelism?

Sometimes you do get lucky

Q that go NegativeQ that go Negative

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Q17 Q19 Q20 Q22

DOP 1 DOP 2

DOP 4 DOP 8

DOP 16 DOP 24

DOP 30 DOP 32

0

2

4

6

8

10

12

14

Q17 Q19 Q20 Q22

DOP 1 DOP 2

DOP 4 DOP 8

DOP 16 DOP 24

DOP 30 DOP 32

Query time

“Speedup”

CPUCPU

0

2

4

6

8

10

12

Q17 Q19 Q20 Q22

DOP 1 DOP 2

DOP 4 DOP 8

DOP 16 DOP 24

DOP 30 DOP 32

Other Queries – CPU & SpeedupOther Queries – CPU & Speedup

0

2

4

6

8

10

12

14

16

18

20

22

Q3 Q5 Q10 Q12 Q14 Q16



0

2

4

6

8

10

12

14

16

18

20

22

Q3 Q5 Q10 Q12 Q14 Q16

DOP 1 DOP 2

DOP 4 DOP 8

DOP 16 DOP 24

DOP 30 DOP 32

Q3 has problems beyond DOP 2

CPU time

Speedup

Other - Query Time secondsOther - Query Time seconds

0

2

4

6

8

10

12

14

16

Q3 Q5 Q10 Q12 Q14 Q16



Query time

Scaling SummaryScaling Summary

Some queries show excellent scaling

Super-scaling, better than 2X

Sharp CPU jump on last DOP doubling

Need strategy to cap DOPTo limit negative scaling

Especially for some smaller queries?

Other anomalies

CompressionCompression

PAGE

1.0

1.1

1.2

1.3

1.4

1.5


1.0

1.1

1.2

1.3

1.4

1.5


Compression Overhead - OverallCompression Overhead - Overall

40% overhead for compression at low DOP,10% overhead at max DOP???

Query time compressed relative to uncompressed

CPU time compressed relative to uncompressed

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0



DOP 16 DOP 24 DOP 32

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0




Query time compressed relative to uncompressed

CPU time compressed relative to uncompressed

Compressed TableCompressed Table

LINEITEM – real data may be more compressibleUncompressed: 8,749,760KB, Average Bytes per row: 149Compressed: 4,819,592KB, Average Bytes per row: 82

PartitioningPartitioning

Orders and Line Item on Order Key

Partitioning Impact - OverallPartitioning Impact - Overall

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8


0.90

0.95

1.00

1.05

1.10

1.15


Query time partitioned relative to not partitioned

CPU time partitioned relative to not partitioned

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0


DOP 1 DOP 2 DOP 4

DOP 8 DOP 16 DOP 24

DOP 32

0

1

2

3

4

5

6


DOP 1 DOP 2

DOP 4 DOP 8

DOP 16 DOP 24

DOP 32

Query time partitioned relative to not partitioned

CPU time partitioned relative to not partitioned

Plan for Partitioned TablesPlan for Partitioned Tables

Scaling DW SummaryScaling DW Summary

Massive IO bandwidth

Parallel options for data load, updates etc

Investigate Parallel Execution PlansScaling from DOP 1, 2, 4, 8, 16, 32 etc

Scaling with and w/o HT

Strategy for limiting DOP with multiple users

Fixes from Microsoft NeededFixes from Microsoft Needed

Contention issues in parallel execution

Table scan, Nested Loops

Better plan cost model for scalingBack-off on parallelism if gain is negligible

Fix throughput degradation with multiple users running big DW queries

Sybase and Oracle, Throughput is close to Power or better

Query PlansQuery Plans

Big QueriesBig Queries

Q1 Pricing Summary ReportQ1 Pricing Summary Report

Q1 Plan Q1 Plan

Non-Parallel

Parallel

Parallel plan 28% lower than scalar, IO is 70%, no parallel plan cost reduction

Q9 Product Type Profit MeasureQ9 Product Type Profit Measure

IO from 4 tables contribute 58% of plan cost, parallel plan is 39% lower


Q9 Non-Parallel PlanQ9 Non-Parallel Plan

Table/Index Scans comprise 64%, IO from 4 tables contribute 58% of plan cost

Join sequence: Supplier, (Part, PartSupp), Line Item, Orders

Q9 Parallel PlanQ9 Parallel Plan

Non-Parallel: (Supplier), (Part, PartSupp), Line Item, OrdersParallel: Nation, Supplier, (Part, Line Item), Orders, PartSupp

Q9 Non-Parallel Plan detailsQ9 Non-Parallel Plan details

Table Scans comprise 64%,IO from 4 tables contribute 58% of plan cost

Q9 Parallel reg vs Partitioned Q9 Parallel reg vs Partitioned

Q13Q13 Why does Q13 have perfect scaling?

Q18 Large Volume CustomerQ18 Large Volume Customer

Non-Parallel

Parallel

Q18 Graphical PlanQ18 Graphical Plan

Non-Parallel Plan: 66% of cost in Hash Match, reduced to 5% in Parallel Plan

Q18 Plan DetailsQ18 Plan Details

Non-Parallel

Parallel

Non-Parallel Plan Hash Match cost is 1245 IO, 494.6 CPUDOP 16/32: size is below IO threshold, CPU reduced by >10X

Q21 Suppliers Who Kept Orders WaitingQ21 Suppliers Who Kept Orders Waiting

Note 3 references to Line Item



H1

H1H2H3

H2H3

Q21 ParallelQ21 Parallel

Q21Q21

3 full Line Item clustered index scans

Plan cost is approx 3X Q1, single “scan”

Q7 Volume ShippingQ7 Volume Shipping



Join sequence: Nation, Customer, Orders, Line Item

Q8 National Market ShareQ8 National Market Share



Join sequence: Part, Line Item, Orders, Customer

Q11 Important Stock IdentificationQ11 Important Stock Identification


Q11Q11

Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp

Small QueriesSmall Queries

Query 2 Minimum Cost SupplierQuery 2 Minimum Cost Supplier

Wordy, but only touches the small tables, second lowest plan cost (Q15)

Q2Q2

Clustered Index Scan on Part and PartSupp have highest cost (48%+42%)

Q2Q2

PartSupp is now Index Scan + Key Lookup

Q6 Forecasting Revenue ChangeQ6 Forecasting Revenue Change

Note sure why this blows CPUScalar values are pre-computed, pre-converted

Q20?Q20?

This query may get a poor execution plan

Date functions are usually written as

because Line Item date columns are “date” typeCAST helps DOP 1 plan, but get bad plan for parallel

Q20Q20

Q20 alternate - parallelQ20 alternate - parallel

Statistics estimation error here

Penalty for mistakeapplied here

Other QueriesOther Queries

Q12 Random IO?Q12 Random IO?

Will this generate random IO?

Query 12 PlansQuery 12 PlansNon-Parallel

Parallel

Queries that go NegativeQueries that go Negative

Q17 Small Quantity Order RevenueQ17 Small Quantity Order Revenue

Q17Q17

Table Spool is concern

Q17Q17

the usual suspects

Q19Q19

Q22Q22

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Tot

DOP 2 DOP 4 DOP 8


0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

32

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Tot

DOP 2 DOP 4 DOP 8

DOP 16 DOP 24 DOP 32 Speedup from DOP 1 query time

CPU relative to DOP 1

system architecture: big iron (numa)

Documents

memory architecture

memory controller

remote node memory latency

remote memorylocal memory

cache tags

cache coherency requirement

processor caches

right processors