win310: scaling up your applications rui hu ( [email protected] ) software design engineer windows...
TRANSCRIPT
WIN310: Scaling Up Your WIN310: Scaling Up Your ApplicationsApplications
Rui Hu (Rui Hu ([email protected]@microsoft.com))Software Design EngineerSoftware Design EngineerWindows ClusteringWindows ClusteringScale Out & Enterprise Servers GroupScale Out & Enterprise Servers GroupWindows DivisionWindows DivisionMicrosoft CorporationMicrosoft Corporation
AgendaAgenda
What is scalability, why is it important?What is scalability, why is it important? What are the barriers to scalability?What are the barriers to scalability? How do you write scalable code?How do you write scalable code? Future directionsFuture directions Scaling wide: 64-bit WindowsScaling wide: 64-bit Windows®®
Final thoughtsFinal thoughts
AgendaAgenda
What is scalability, why is it important?What is scalability, why is it important? What are the barriers to scalability?What are the barriers to scalability? How do you write scalable code?How do you write scalable code? Future directionsFuture directions Scaling wide: 64-bit WindowsScaling wide: 64-bit Windows®®
Final thoughtsFinal thoughts
ScalabilityScalability
““A near-linear increase in performance as A near-linear increase in performance as resources are linearly added to a system.”resources are linearly added to a system.”
DisksDisks
ProcessorsProcessors MemoryMemory
I/O HardwareI/O HardwareMachinesMachines
AgendaAgenda
What is scalability, why is it important?What is scalability, why is it important? What are the barriers to scalability?What are the barriers to scalability? How do you write scalable code?How do you write scalable code? Future directionsFuture directions Scaling wide: 64-bit WindowsScaling wide: 64-bit Windows®®
Final thoughtsFinal thoughts
Barriers To ScalabilityBarriers To Scalability
The latency problem The latency problem (Hardware)(Hardware)
a general developing problem with a general developing problem with computer hardware componentscomputer hardware components
The parallelism problem The parallelism problem (Hardware / Software)(Hardware / Software)
sharing system resources without sharing system resources without impeding system performanceimpeding system performance
Memory HierarchyMemory Hierarchy
Register, Internal Register, Internal Caches in CPUCaches in CPU
External Caches External Caches (SRAM)(SRAM)
Main Memory Main Memory (DRAM)(DRAM)
Disk Storage Disk Storage (Magnetic)(Magnetic)
Level 0Level 0
Level 1Level 1
Level 2Level 2
Level 3Level 3
Memory HierarchyMemory Hierarchy
Five parameters to define memory Five parameters to define memory technology and organization:technology and organization: Access timeAccess time Memory capacityMemory capacity Cost per byteCost per byte Data transfer rate (bandwidth)Data transfer rate (bandwidth) Unit of transferUnit of transfer
The Latency ProblemThe Latency ProblemInter-component latencies are increasingInter-component latencies are increasing
0
1000
2000
3000
1992 1994 1996 1998 2000 2002
Disk
RAM
CPU
Years
Pe
rfo
rma
nc
e
Inc
rea
se
(%
)
Performance (1992-2002)
DiskRAMCPU
The Latency Problem The Latency Problem Everything runs at different speedsEverything runs at different speeds
Let’s consider relative speedsLet’s consider relative speeds Let’s substitute clocks for secondsLet’s substitute clocks for seconds
Zero a register (1 sec)Zero a register (1 sec) No register (~6) but L0 hit (2 secs)No register (~6) but L0 hit (2 secs) L0 (~8k) miss but L1 hit (6 secs)L0 (~8k) miss but L1 hit (6 secs) L1 (1Mb) miss but RAM hit (~1-3 mins)L1 (1Mb) miss but RAM hit (~1-3 mins) Random read from disk (~1 year)Random read from disk (~1 year)
Show Me The CacheShow Me The CacheHiding the latency problemHiding the latency problem
How does the hardware cope?How does the hardware cope? Caches: Fast local memory (small)Caches: Fast local memory (small)
Cache-lines (32, 64, 128 bytes)Cache-lines (32, 64, 128 bytes) Caches are highly dependent on localityCaches are highly dependent on locality
Poor locality has a significant costPoor locality has a significant cost True random access spells disasterTrue random access spells disaster False sharingFalse sharing
Good memory layout is very helpfulGood memory layout is very helpful Dense packing (i.e., low overhead)Dense packing (i.e., low overhead) Cache alignment with minimal spanningCache alignment with minimal spanning
Show Me The Cache (Cont.)Show Me The Cache (Cont.)
Three possible causes of cache Three possible causes of cache incoherence:incoherence: The writeThe write by different processors into by different processors into
their cached copies of the same cache their cached copies of the same cache line in memory, asynchronously.line in memory, asynchronously.
Process migration among multiple Process migration among multiple processors without alerting each other.processors without alerting each other.
I/O operations bypassing the owners of I/O operations bypassing the owners of cached copies.cached copies.
Show Me The Cache (Cont.)Show Me The Cache (Cont.)
P1P1
XX
P2P2
XX
XX
P1P1
X’X’
P2P2
XX
X’X’
P1P1
X’X’
P2P2
XX
XX
Before UpdateBefore Update Write-throughWrite-through Write-backWrite-back
Cache incoherence by sharing of Cache incoherence by sharing of writable datawritable data
Show Me The Cache (Cont.)Show Me The Cache (Cont.)
P1P1
XX
P2P2
XX
P1P1
XX
P2P2
YY
YY
P1P1
X’X’
P2P2
XX
XX
Before UpdateBefore Update Write-throughWrite-through Write-backWrite-back
Cache incoherence caused by process Cache incoherence caused by process migrationmigration
Show Me The Cache (Cont.)Show Me The Cache (Cont.)
Cache coherence protocol:Cache coherence protocol: Update policyUpdate policy (more commonly) invalidate policy(more commonly) invalidate policy
The programmer can assume that an The programmer can assume that an effective cache coherence protocol is effective cache coherence protocol is present in the system, although it will present in the system, although it will impact upon the performance of the impact upon the performance of the system.system.
Show Me The Cache (Cont.)Show Me The Cache (Cont.)
Key characteristic: caches are Key characteristic: caches are organized in blocks of contiguous organized in blocks of contiguous locations.locations.
Why blocks are used in caches: Why blocks are used in caches: temporal localitytemporal locality
Major disadvantage of using blocks in Major disadvantage of using blocks in caches: different processors access caches: different processors access different parts of a block. (False different parts of a block. (False sharing) sharing)
Latency Tolerance TechniquesLatency Tolerance Techniques
Locality of distributed data structuresLocality of distributed data structures
/*** Method (1) Array of Structure ***//*** Method (1) Array of Structure ***/typedef struct {double real, image;} typedef struct {double real, image;} COMPLEX;COMPLEX;COMPLEX data[N1][N2][N3];COMPLEX data[N1][N2][N3];
/*** Method (2) Separate Arrays ***//*** Method (2) Separate Arrays ***/double data_real[N1][N2][N3]; double data_real[N1][N2][N3]; double data_image[N1][N2][N3];double data_image[N1][N2][N3];
Latency Tolerance Techniques (Cont.)Latency Tolerance Techniques (Cont.)
Improvement in data locality by loop Improvement in data locality by loop transformationtransformation
/*** matrix multiplication C=A*B ***//*** matrix multiplication C=A*B ***/for (i=0; i<N; i++)for (i=0; i<N; i++)
for (j=0; j<N; j++)for (j=0; j<N; j++)for (k=0; k<N; k++)for (k=0; k<N; k++)
c[i][j] += a[i][k] * b[k][j];c[i][j] += a[i][k] * b[k][j];
Latency Tolerance Techniques (Cont.)Latency Tolerance Techniques (Cont.)
Improvement in data locality by loop Improvement in data locality by loop transformation (Cont.)transformation (Cont.)
/*** matrix multiplication C=A*B ***//*** matrix multiplication C=A*B ***/for (i=0; i<N; i++)for (i=0; i<N; i++)
for (k=0; k<N; k++)for (k=0; k<N; k++)for (j=0; j<N; j++)for (j=0; j<N; j++)
c[i][j] += a[i][k] * b[k][j];c[i][j] += a[i][k] * b[k][j];
Barriers To ScalabilityBarriers To Scalability
The latency problem The latency problem (Hardware)(Hardware)
a general developing problem with a general developing problem with computer hardware componentscomputer hardware components
The parallelism problem The parallelism problem (Hardware / Software)(Hardware / Software)
sharing system resources without sharing system resources without impeding system performanceimpeding system performance
The Parallelism ProblemThe Parallelism Problem
How to share machine resources…How to share machine resources… Without causing another process or Without causing another process or
thread to wait thread to wait While continuing to hide the system While continuing to hide the system
latency (without perturbing the caches)latency (without perturbing the caches)
The system should always be doing The system should always be doing useful work, as fast as possible useful work, as fast as possible maximizing resource utilizationmaximizing resource utilization
Multiple-Context ProcessorsMultiple-Context Processors
Allow a processor to switch from one Allow a processor to switch from one context to another when a long-latency context to another when a long-latency operation is encountered.operation is encountered.
InterconnectInterconnect
PP PP MM PPMM PP PP MM PPMM…………
Multiple-Context Processors (Cont.)Multiple-Context Processors (Cont.)
Interleave the execution of several Interleave the execution of several contexts in order to reduce the value of contexts in order to reduce the value of idleidle time without increasing the time without increasing the magnitude of the magnitude of the switchingswitching time. time.
Context-switching policies:Context-switching policies: Switch on cache missSwitch on cache miss Switch on every loadSwitch on every load Switch on every instructionSwitch on every instruction Switch on block of instructionsSwitch on block of instructions
The Parallelism ProblemThe Parallelism Problem
Amdahl’s Law (Thread safe != Amdahl’s Law (Thread safe != Scalable)Scalable)
Architectural Latency compounds Architectural Latency compounds the problemthe problem
Deadlock / LivelockDeadlock / Livelock Bus SaturationBus Saturation
Amdahl’s LawAmdahl’s Law
Latency Is DeadlyLatency Is Deadly
Avoid blocking API callsAvoid blocking API calls Minimize context switchingMinimize context switching Minimize User/Kernel mode transitionsMinimize User/Kernel mode transitions
Scalability PerformanceScalability Performance
How does your application perform as the How does your application perform as the degree of parallelism increases?degree of parallelism increases?
Ideal
Expected
Actual
AgendaAgenda
What is scalability, why is it important?What is scalability, why is it important? What are the barriers to scalability?What are the barriers to scalability? How do you write scalable code?How do you write scalable code? Future directionsFuture directions Scaling wide: 64-bit WindowsScaling wide: 64-bit Windows®®
Final thoughtsFinal thoughts
Why Scale Up Now?Why Scale Up Now?
Reduce your applications exposure Reduce your applications exposure to hardware latencyto hardware latency
Find your application’s degree of Find your application’s degree of Scalability and it’s limitsScalability and it’s limits
Paradigm shift imminent (we all Paradigm shift imminent (we all become SMP developers)become SMP developers)
Server ConsolidationServer Consolidation Reduced management costReduced management cost More flexible to changes in loadMore flexible to changes in load
Writing Scalable Code Writing Scalable Code
Some new software directionsSome new software directions Event-based programmingEvent-based programming
No longer just for GUIsNo longer just for GUIs Software needs to react to hardware like it responds to user Software needs to react to hardware like it responds to user
mouse clicksmouse clicks
New software ideas borrowed from hardwareNew software ideas borrowed from hardware Out-of-order executionOut-of-order execution
Async I/O and Completion portsAsync I/O and Completion ports ParallelismParallelism
Multiple outstanding high latency eventsMultiple outstanding high latency events Waiting for all events not just oneWaiting for all events not just one
Software pipelinesSoftware pipelines Batching and cache localityBatching and cache locality Share threads using thread poolsShare threads using thread pools
Processes And ThreadsProcesses And Threads
Avoid using many processesAvoid using many processes Limit Threads to “one per processor”Limit Threads to “one per processor” Use shared memory and I/O Use shared memory and I/O
completion ports for IPCcompletion ports for IPC
I/O Completion PortsI/O Completion PortsCreateIoCompletionPort CreateIoCompletionPort Win32 APIWin32 API
““The goal of a server is to incur The goal of a server is to incur as few context switches as as few context switches as
possible by avoiding possible by avoiding unnecessary [thread] unnecessary [thread]
blocking while at the same blocking while at the same time maximizing time maximizing parallelism… the parallelism… the
application must have a application must have a way to activate another way to activate another
thread when a client thread thread when a client thread blocks on I/O” – blocks on I/O” – Inside Inside
Windows 2000, MSPressWindows 2000, MSPress
Removes ‘thread Removes ‘thread block’block’
Use for I/O Use for I/O operationsoperations
Keep concurrency Keep concurrency limit to number of limit to number of processors in the processors in the systemsystem
If using sockets use If using sockets use Winsock2 with I/O Winsock2 with I/O completion portscompletion ports
I/O Completion Port (Cont.)I/O Completion Port (Cont.)
The most complex kernel object The most complex kernel object offered by Win32.offered by Win32.
The kernel creates five different data The kernel creates five different data structures for an I/O completion port.structures for an I/O completion port.
I/O Completion Port (Cont.)I/O Completion Port (Cont.)
Device ListDevice List
Each record containsEach record contains
hDevicehDevice dwCompletionKeydwCompletionKey
Entry is added whenEntry is added when• CreateInCompletionPort is called.CreateInCompletionPort is called.
Entry is removed whenEntry is removed when• Device handle is closed.Device handle is closed.
I/O Completion Port (Cont.)I/O Completion Port (Cont.)
I/O Completion Queue (FIFO)I/O Completion Queue (FIFO)
Each record containsEach record contains
Entry is added whenEntry is added when• I/O request completes.I/O request completes.• PostQueuedCompletionStatus is called.PostQueuedCompletionStatus is called.
Entry is removed whenEntry is removed when• Completion port removes an entry from the Completion port removes an entry from the Waiting Thread Queue.Waiting Thread Queue.
dwBytesTransferreddwBytesTransferred pOverlappedpOverlapped dwErrordwError
I/O Completion Port (Cont.)I/O Completion Port (Cont.)
Waiting Thread Queue (LIFO)Waiting Thread Queue (LIFO)
Each record containsEach record contains
Entry is added whenEntry is added when• Thread calls GetQueuedCompletionStatus.Thread calls GetQueuedCompletionStatus.
Entry is removed whenEntry is removed when• I/O completion port is not empty and the I/O completion port is not empty and the number of running threads is less than thenumber of running threads is less than theMaximum number of concurrent threads.Maximum number of concurrent threads.
dwThreadIddwThreadId
I/O Completion Port (Cont.)I/O Completion Port (Cont.)
Release Thread ListRelease Thread List
Each record containsEach record contains
Entry is added whenEntry is added when• Completion port wakes a thread in the WaitingCompletion port wakes a thread in the WaitingThread Queue.Thread Queue.• Paused thread wakes up.Paused thread wakes up.
Entry is removed whenEntry is removed when• Thread again calls GetQueuedCompletionStatus.Thread again calls GetQueuedCompletionStatus.• Thread calls a function that suspends itself.Thread calls a function that suspends itself.
dwThreadIddwThreadId
I/O Completion Port (Cont.)I/O Completion Port (Cont.)
Paused Thread ListPaused Thread List
Each record containsEach record contains
Entry is added whenEntry is added when• Released thread calls a function that suspends Released thread calls a function that suspends itself.itself.
Entry is removed whenEntry is removed when• Suspended thread wakes up.Suspended thread wakes up.
dwThreadIddwThreadId
Asynchronous I/OAsynchronous I/O
Take a look at Windows Take a look at Windows Asynchronous I/O APIsAsynchronous I/O APIs
Use these APIs with IO Use these APIs with IO Completion PortsCompletion Ports
FILE_FLAG_OVERLAPPED (CreateFile)FILE_FLAG_OVERLAPPED (CreateFile)
Memory ManagementMemory Management
Be sensitive to cache line sizeBe sensitive to cache line size Align your structures on cache Align your structures on cache
line boundariesline boundaries
Use pools of commonly allocated sizesUse pools of commonly allocated sizes Consider using a smart memory Consider using a smart memory
allocation schemeallocation scheme Avoid many small allocations and Avoid many small allocations and
de-allocationsde-allocations Take a look at .NET Server’s new Low Take a look at .NET Server’s new Low
Fragmentation Heap (LFH)Fragmentation Heap (LFH)
AgendaAgenda
What is scalability, why is it important?What is scalability, why is it important? What are the barriers to scalability?What are the barriers to scalability? How do you write scalable code?How do you write scalable code? Future directionsFuture directions Scaling wide: 64-bit WindowsScaling wide: 64-bit Windows®®
Final thoughtsFinal thoughts
Future DirectionsFuture Directions
SMT / SMP is comingSMT / SMP is coming Soon (within the next 2 years) new commodity Soon (within the next 2 years) new commodity
machines will be SMT based. HyperThreading machines will be SMT based. HyperThreading Xeons are here todayXeons are here today
CC – NUMA is comingCC – NUMA is coming Scaling MassiveScaling Massive
.NET Web Services are a great foundation to .NET Web Services are a great foundation to support the next step in application performance support the next step in application performance – the massively parallel application– the massively parallel application
Inter component latencies will get worse Inter component latencies will get worse (and worse, and worse…)(and worse, and worse…)
AgendaAgenda
What is scalability, why is it important?What is scalability, why is it important? What are the barriers to scalability?What are the barriers to scalability? How do you write scalable code?How do you write scalable code? Future directionsFuture directions Scaling wide: 64-bit WindowsScaling wide: 64-bit Windows®®
Final thoughtsFinal thoughts
Scaling WideScaling Wide64-bit Windows64-bit Windows
Running out of address space (4GB) Running out of address space (4GB) can present a scalability bottleneckcan present a scalability bottleneck
Running out of system resources Running out of system resources (e.g., TCP/IP connection handles) can (e.g., TCP/IP connection handles) can present a scalability bottleneckpresent a scalability bottleneck
Being in RAM reduces latency problemBeing in RAM reduces latency problem Improves parallel execution due to Improves parallel execution due to
reduced latencyreduced latency
Why Does 64-Bit Matter?Why Does 64-Bit Matter?Because…Because…
Processor Innovation Is InevitableProcessor Innovation Is Inevitable Faster floating point and Faster floating point and
clock-speed headroomclock-speed headroom Faster access to vast amounts Faster access to vast amounts
of memoryof memory Performance improvements in Performance improvements in
CPU microcodeCPU microcode
64-bit Computing Is Inevitable64-bit Computing Is Inevitable Enables new classes of applicationsEnables new classes of applications Provides new headroom for Provides new headroom for
scale-upscale-up scenarios scenarios
Why Does 64-Bit Windows Matter?Why Does 64-Bit Windows Matter?Because…Because… It enables a new set of scenariosIt enables a new set of scenarios
WorkstationWorkstation Mechanical Design & AnalysisMechanical Design & Analysis Digital Content Creation (Rendering)Digital Content Creation (Rendering) FinancialFinancial
ServerServer Computation Intensive applicationsComputation Intensive applications Large databases (BI/OLAP)Large databases (BI/OLAP) Large websites (caching, SSL)Large websites (caching, SSL)
The hardware architecture has The hardware architecture has taken an evolutionary step forwardtaken an evolutionary step forward
Windows continues to evolve…Windows continues to evolve…
Early Adopter ‘Target Scenarios’Early Adopter ‘Target Scenarios’
Scenarios recommended for early adoptersScenarios recommended for early adopters Computation intensive applicationsComputation intensive applications
Simulations Simulations Engineering, designEngineering, design Other floating point intensive applicationOther floating point intensive application
Large databasesLarge databases Data warehousing Data warehousing Business IntelligenceBusiness Intelligence
Web serving Web serving Large cachingLarge caching Web hostingWeb hosting Secure communicationsSecure communications
File and print server cluster environmentsFile and print server cluster environments Scenarios tested by OEMs/ISVsScenarios tested by OEMs/ISVs
Memory ComparisonMemory Comparison
Address SpaceAddress Space 64-bit 64-bit WindowsWindows
32-bit 32-bit WindowsWindows
Virtual MemoryVirtual Memory 16 TB16 TB 4GB4GBPaging FilePaging File 512 TB512 TB 16TB16TBHyperspaceHyperspace 8 GB8 GB 4 MB4 MBPaged PoolPaged Pool 128 GB128 GB 470 MB470 MBNon-Paged PoolNon-Paged Pool 128 GB128 GB 256 MB256 MBSystem CacheSystem Cache 1 TB1 TB 1 GB1 GB
Scaling For Any EnvironmentScaling For Any EnvironmentScale Up – Scale OutScale Up – Scale Out
Har
dw
are
Sca
lab
ility
Har
dw
are
Sca
lab
ility
Software and Infrastructure ScalabilitySoftware and Infrastructure Scalability
SHV PlatformsSHV Platforms
64-Way SMP 64-Way SMP 64 bit Itanium64 bit Itanium Perf, NUMA EnhancementsPerf, NUMA Enhancements Hardware Reliability (machine check)Hardware Reliability (machine check) Availability (multipath I/O, clustering)Availability (multipath I/O, clustering)
High density serversHigh density servers Clustering, loadClustering, load balancingbalancing Web farm session stateWeb farm session state Application Center managementApplication Center management
Scalable clustersScalable clusters
Developer SupportDeveloper Support
Development environment virtually Development environment virtually identical to Win32identical to Win32®®
Short learning curve makes porting Short learning curve makes porting easy (testing is the greater challenge)easy (testing is the greater challenge)
Allows for single source base for both Allows for single source base for both 32 and 64-bit environments32 and 64-bit environments Simplifies portingSimplifies porting Reduces development costsReduces development costs
Software Development Kit (SDK) and Software Development Kit (SDK) and Driver Development Kit (DDK) provide Driver Development Kit (DDK) provide necessary toolsnecessary tools
Platform SDKPlatform SDK Compiler (currently cross)Compiler (currently cross) LibrariesLibraries http://www.microsoft.com/msdownload/platformsdk/sdkupdate/http://www.microsoft.com/msdownload/platformsdk/sdkupdate/
Visual StudioVisual Studio®® 64-bit release 64-bit release Visual Studio 7.0 (.Net) 32bit RTM plus “some delta”Visual Studio 7.0 (.Net) 32bit RTM plus “some delta”
Command lineCommand line Nmake /f makefile Nmake /f makefile usingusing the pre-defined build environmentthe pre-defined build environment
Visual Studio 6Visual Studio 6 Use the tools on http://msdevlab.msftlabs.com/buildUse the tools on http://msdevlab.msftlabs.com/build
Visual Studio .NET 32-bit (Visual Studio 7)Visual Studio .NET 32-bit (Visual Studio 7) Use the tools on http://msdevlab.msftlabs.com/buildUse the tools on http://msdevlab.msftlabs.com/build
Dev ToolsDev Tools
AgendaAgenda
What is scalability, why is it important?What is scalability, why is it important? What are the barriers to scalability?What are the barriers to scalability? How do you write scalable code?How do you write scalable code? Future directionsFuture directions Scaling wide: 64-bit WindowsScaling wide: 64-bit Windows®®
Final thoughtsFinal thoughts
Final ThoughtsFinal Thoughts
Computers are increasing their degree of Computers are increasing their degree of parallelism over the coming 2-3 yearsparallelism over the coming 2-3 years
Customers will expect a linear increase in Customers will expect a linear increase in performance, and certainly no decrease in performance, and certainly no decrease in performanceperformance
Develop scalability and performance test Develop scalability and performance test suites for your applications – know your suites for your applications – know your limits!limits!
Consider testing your applicationsConsider testing your applications On HIGH END SMP systems with a high degree On HIGH END SMP systems with a high degree
of parallelismof parallelism On the 64-bit environmentOn the 64-bit environment
如果您有任何问题,请上微软如果您有任何问题,请上微软中文新闻组中文新闻组继续讨论继续讨论
加入微软中文新闻组加入微软中文新闻组 http://www.microsoft.com/china/commhttp://www.microsoft.com/china/comm
unityunity
© 2002 Microsoft Corporation. All rights reserved.© 2002 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.