Parallel Query Executionin SQL Server
Craig Freedman
Software Design Engineer
SQL Server Query Team
Outline
• Why parallel query execution?
• Basics of parallel query execution
• Degree of parallelism (DOP)
• The parallelism operator
• Parallel query plan examples
• What to look out for …
• Q&A
Why Parallel Query Execution?
• Improves response time for big queries
• Completely transparent to the end user
• Appropriate for data warehouse workloads(small numbers of big queries)
• Parallelism is not free– Adds overhead to query execution
– May reduce overall server throughput
• Inappropriate for OLTP workloads(large numbers of small queries)
Scalability Terminology
Perfect linear scalability:
• 2x CPUs = ½ Response Time
• 2x Data + 2x CPUs = Same Response Time
Data Size # CPUs Response Time
Speed Up Same Data More CPUs Reduced RT
Scale Up More Data More CPUs Same RT
Basics of Parallel Query Execution
1. Horizontally partition input data
2. Distribute partitions among CPUs
3. Simultaneously perform the same operationon each partition
“Single Instruction Multiple Data”
Advantages:
• Excellent scalability: threads operate independently
• Easy to adjust the degree of parallelism (DOP)
Parallelism Example
Thread 1
Aggregate
Thread 2
Aggregate
314159
2682 Evens
Odds5359
NOT Pipeline Parallelism
• Parallelism limited by the number of operators
• Operators are dependent on one another
• Does not scale well
Thread 2
Hash Aggregate
Thread 1
Scan
Choosing a Parallel Plan
• Must be running on a multi-processor system
• Optimizer makes a cost based decision whether to generate a parallel plan
Configuration Option Setting
affinity mask Must enable more than one processor
max degree of parallelism Must be greater than one(or overridden by MAXDOP query hint)
cost threshold for parallelism Query cost must exceed threshold (seconds)
max worker threads Must have enough threads to run query
Degree of Parallelism (DOP)
• The number of threads per operatorNOT the total number of threads
• Total number of threads may exceed DOP
• Optimizer estimates DOP for costing
• Actual DOP selected at query startup
Choosing the Degree of Parallelism
Processors
• Number of physical processors
• Limited by “affinity mask” configuration option
MAXDOP
• MAXDOP query hint
• “max degree of parallelism” configuration option
Threads
• “max worker threads” configuration option
• Reduce DOP (or run serially) if insufficient threads available
CPU Utilization
• SQL 2000:
– Any thread runs on any CPU, and …
– There may be more threads than DOP, so …
– Reducing DOP may not reduce CPU usage
• SQL 2005:
– # of schedulers per query limited to DOP, so …
– Query never uses more than DOP CPUs, and …
– Reducing DOP does reduce CPU usage
The Parallelism Operator
• aka Exchange
• Moves (packets of) rows between threads
• Really two operators: producer and consumer
• Producer and consumer execute independently:
– Producers push data to consumers
– Consumers may have to wait for data from producers
– Flow control keeps producers from getting too far ahead of consumers
Producer Thread 1
Producer Thread 2
The Parallelism Operator
Consumer Thread 1
Consumer Thread 2
Types of Parallelism Operators
Type # Producer Threads # Consumer Threads
Gather Streams > 1 (DOP) 1
Repartition Streams > 1 (DOP) > 1 (DOP)
Distribute Streams 1 > 1 (DOP)
Gather Streams Repartition Streams Distribute Streams
Types of Parallelism Operators
Partitioning Type Description
Broadcast Send all rows to all consumers.
Hash Determine where to send each row by evaluating a hash function on one or more columns in the row.
Round Robin Send each packet of rows to the next consumer in sequence.
Demand Send the next row to the next consumer that asks.Used by partitioned tables only. Pull based data flow!
Range Determine where to send each row by evaluating a range function on one column in the row. Used by index builds and stats queries.
Routing of rows between producers and consumers
Types of Parallelism Operators
Order preserving (aka merging) exchange
7531
8642
Non-order preserving (aka non-merging) exchange
87654321
86427531
7531
8642
Parallel Query Plan Examples
Parallel Scan
• Most operators are unaware of parallelism;Parallel scan is an exception
• Threads work together to scan entire table
• Server dynamically assigns pages or rows to threads
• Automatically adjusts to the number of threads
• Automatically adjusts for skew or load imbalances
Thread 1 Thread 2 Thread 3 Thread 4
Parallel Nested Loops Join
• Distribute outer rows to any inner thread:– Generally no exchange needed (e.g., parallel scan)– Occasionally need a round-robin exchange
• Inner side (almost) always runs serially andprocesses a single outer row at a time
SELECT * FROM T1 JOIN T2 ON T1.B = T2.A WHERE T1.A < 100 Parallel Scan
100 Serial Scans
100 Rows
Parallel Nested Loops Join
• Distribute outer rows to any inner thread:– Generally no exchange needed (e.g., parallel scan)– Occasionally need a round-robin exchange
• Inner side (almost) always runs serially andprocesses a single outer row at a time
SELECT * FROM T1 JOIN T2 ON T1.B = T2.A WHERE T1.A < 100 Parallel Scan
100 Serial Scans
100 Rows
Inner Side Parallelism
• Normally, the inner side of a nested loops join runs serially• Exception:
– Single outer row and no correlated parameters– May in some rare cases yield parallelism on the inner side
of a nested loops join
SELECT * FROM T1 JOIN T2 ON T1.A = T2.B WHERE T2.A = 0
Broadcast PartitionOne Row!
Parallel Scan!
Parallel Join
Inner Side Parallelism
• Normally, the inner side of a nested loops join runs serially• Exception:
– Single outer row and no correlated parameters– May in some rare cases yield parallelism on the inner side
of a nested loops join
SELECT * FROM T1 JOIN T2 ON T1.A = T2.B WHERE T2.A = 0
Broadcast PartitionOne Row!
Parallel Scan!
Parallel Join
Inner Side Parallelism
IF EXISTS (SELECT * FROM T1 WHERE T1.A = 0)PRINT 'Hello World!'
One Row!
Exchange and Parallel Scan!
Serial Join
Parallel Hash Join
• Hash build and probe rows that may jointo the same HJ instance
• Merge join uses same approach
SELECT * FROM T1 JOIN T2 ON T1.A = T2.B
Parallel ScanHash Partition
Parallel Hash Join
• Hash build and probe rows that may jointo the same HJ instance
• Merge join uses same approach
SELECT * FROM T1 JOIN T2 ON T1.A = T2.B
Parallel ScanHash Partition
Bitmap Filtering
• Useful when build input has a selective predicate1. Create a bitmap using build input2. Filter probe input using bitmap
• Also used by some parallel merge joins
SELECT * FROM T1 JOIN T2 ON T1.A = T2.B WHERE T1.B < 100000
Create Bitmap
Filter with Bitmap
Bitmap Filtering
• Useful when build input has a selective predicate1. Create a bitmap using build input2. Filter probe input using bitmap
• Also used by some parallel merge joins
SELECT * FROM T1 JOIN T2 ON T1.A = T2.B WHERE T1.B < 100000
Create Bitmap
Filter with Bitmap
Broadcast Hash Join
• If build is small, it may not hash partition welland may lead to skew problems
• Instead broadcast build rows to all hash join instancesand probe at any hash join instance
• Also eliminates an extra exchange!
SELECT * FROM T1 JOIN T2 ON T1.A = T2.B WHERE T1.B < 100
Parallel Scan
Broadcast Partition
No Exchange!
Few Rows
Merging Exchange
• Needed for parallel query with ORDER BY
• Also used by merge join and stream aggregate
• Risk of intra-query parallel deadlocks!
SELECT * FROM T1 ORDER BY T1.A
Merging Exchange
Merging Exchange
• Needed for parallel query with ORDER BY
• Also used by merge join and stream aggregate
• Risk of intra-query parallel deadlocks!
SELECT * FROM T1 ORDER BY T1.A
Merging Exchange
Parallel Aggregation
• Hash rows that belong to the same groupto the same aggregate instance
SELECT T1.A, MIN(T1.B) FROM T1 GROUP BY T1.A
Hash Partition
Local/Global (Partial) Aggregation
• If number of groups is small, data may not hash partition welland may lead to skew problems
• Instead split into two aggregates:– Local aggregate computes partial results for all groups– Global aggregate computes final results
• Also eliminates exchange below the local aggregate
Local Aggregate
SELECT MIN(T1.A) FROM T1
Many RowsNo Exchange!
Few Rows
Global Aggregate
What to look out for …
What to look out for …
• Nested loops join parallelism may be limited due to:– Too few outer rows (e.g., partitioned table with two partitions)– Poor distribution of rows to pages (e.g., skewed parallel scan)– Showplan XML shows the number of rows processed per thread
• Merging exchanges– May not scale as well as non-merging exchanges– Parallel deadlocks (rare – especially with SQL Server 2005)
• Inserts, updates, and deletes are serial (except for index build)– Application level parallelization can help, but …– Cannot manually parallelize bulk load without giving up bulk logging
• Some features, operators, and intrinsics may force serial plans or serial zones (and bottlenecks) within a parallel plan
• On SQL Server 2000, reducing DOP may not reduce CPU usage
Non-Parallelizable “Stuff”
Forces a serial zone (within a parallel plan):
• System table scans • Sequence functions • TOP
• “Backward” scans • Recursive queries • TVFs
• Global scalar aggregate • Multi-consumer spool
Forces a serial plan:
• All TSQL UDFs
• CLR UDFs with data access
• Miscellaneous built-ins such as:OBJECT_ID(), ERROR_NUMBER(), @@TRANCOUNT, …
• Dynamic cursors
Serial zones may lead to bottlenecks and reduced performance
Intra-Query Parallel Deadlocks
• Deadlock among the threads executing a parallel query• Three scenarios:
– Two merging exchanges separated by order preserving operators– A merge join with merging exchanges on both inputs– Merging exchange above index seek with multiple ranges
• Deadlock detector recognizes intra-query deadlocks and tries to resolve, but it’s SLOW!
• To diagnose, check whether all threads within the session are waiting on wait type CXPACKET– On SQL Server 2005, use sys.dm_os_waiting_tasks– On SQL Server 2000, use sysprocesses
• All but a few rare cases fixed in SQL Server 2005:– Merge join with mismatched join key types– Index seek with multiple ranges
Branch 1 Thread 1
Branch 1 Thread 2
Root Thread
Parallel Deadlock Example
Branch 2 Thread 1
Branch 2 Thread 2
Both exchanges are merging
No rows flow through here
Merging blocks without at least one row from each input
Too many rows flow through here
Blogs
• My blog:http://blogs.msdn.com/craigfr
• Query Team blog: http://blogs.msdn.com/sqlqueryprocessing
• SQL Customer Advisory Team blog: http://blogs.msdn.com/sqlcat
Q&A
© 2007 Microsoft Corporation. All rights reserved.Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the dateof this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part
of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION