win310: scaling up your applications rui hu ( [email protected] ) software design engineer windows...

WIN310: Scaling Up Your WIN310: Scaling Up Your ApplicationsApplications

Rui Hu (Rui Hu ([email protected]@microsoft.com))Software Design EngineerSoftware Design EngineerWindows ClusteringWindows ClusteringScale Out & Enterprise Servers GroupScale Out & Enterprise Servers GroupWindows DivisionWindows DivisionMicrosoft CorporationMicrosoft Corporation

AgendaAgenda

What is scalability, why is it important?What is scalability, why is it important? What are the barriers to scalability?What are the barriers to scalability? How do you write scalable code?How do you write scalable code? Future directionsFuture directions Scaling wide: 64-bit WindowsScaling wide: 64-bit Windows®®

Final thoughtsFinal thoughts

ScalabilityScalability

““A near-linear increase in performance as A near-linear increase in performance as resources are linearly added to a system.”resources are linearly added to a system.”

DisksDisks

ProcessorsProcessors MemoryMemory

I/O HardwareI/O HardwareMachinesMachines

AgendaAgenda



Barriers To ScalabilityBarriers To Scalability

The latency problem The latency problem (Hardware)(Hardware)

a general developing problem with a general developing problem with computer hardware componentscomputer hardware components

The parallelism problem The parallelism problem (Hardware / Software)(Hardware / Software)

sharing system resources without sharing system resources without impeding system performanceimpeding system performance

Memory HierarchyMemory Hierarchy

Register, Internal Register, Internal Caches in CPUCaches in CPU

External Caches External Caches (SRAM)(SRAM)

Main Memory Main Memory (DRAM)(DRAM)

Disk Storage Disk Storage (Magnetic)(Magnetic)

Level 0Level 0

Level 1Level 1

Level 2Level 2

Level 3Level 3

Memory HierarchyMemory Hierarchy

Five parameters to define memory Five parameters to define memory technology and organization:technology and organization: Access timeAccess time Memory capacityMemory capacity Cost per byteCost per byte Data transfer rate (bandwidth)Data transfer rate (bandwidth) Unit of transferUnit of transfer

The Latency ProblemThe Latency ProblemInter-component latencies are increasingInter-component latencies are increasing

0

1000

2000

3000

1992 1994 1996 1998 2000 2002

Disk

RAM

CPU

Years

Pe

rfo

rma

nc

e

Inc

rea

se

(%

)

Performance (1992-2002)

DiskRAMCPU

The Latency Problem The Latency Problem Everything runs at different speedsEverything runs at different speeds

Let’s consider relative speedsLet’s consider relative speeds Let’s substitute clocks for secondsLet’s substitute clocks for seconds

Zero a register (1 sec)Zero a register (1 sec) No register (~6) but L0 hit (2 secs)No register (~6) but L0 hit (2 secs) L0 (~8k) miss but L1 hit (6 secs)L0 (~8k) miss but L1 hit (6 secs) L1 (1Mb) miss but RAM hit (~1-3 mins)L1 (1Mb) miss but RAM hit (~1-3 mins) Random read from disk (~1 year)Random read from disk (~1 year)

Show Me The CacheShow Me The CacheHiding the latency problemHiding the latency problem

How does the hardware cope?How does the hardware cope? Caches: Fast local memory (small)Caches: Fast local memory (small)

Cache-lines (32, 64, 128 bytes)Cache-lines (32, 64, 128 bytes) Caches are highly dependent on localityCaches are highly dependent on locality

Poor locality has a significant costPoor locality has a significant cost True random access spells disasterTrue random access spells disaster False sharingFalse sharing

Good memory layout is very helpfulGood memory layout is very helpful Dense packing (i.e., low overhead)Dense packing (i.e., low overhead) Cache alignment with minimal spanningCache alignment with minimal spanning

Show Me The Cache (Cont.)Show Me The Cache (Cont.)

Three possible causes of cache Three possible causes of cache incoherence:incoherence: The writeThe write by different processors into by different processors into

their cached copies of the same cache their cached copies of the same cache line in memory, asynchronously.line in memory, asynchronously.

Process migration among multiple Process migration among multiple processors without alerting each other.processors without alerting each other.

I/O operations bypassing the owners of I/O operations bypassing the owners of cached copies.cached copies.


P1P1

XX

P2P2

XX

XX

P1P1

X’X’

P2P2

XX

X’X’

P1P1

X’X’

P2P2

XX

XX

Before UpdateBefore Update Write-throughWrite-through Write-backWrite-back

Cache incoherence by sharing of Cache incoherence by sharing of writable datawritable data


P1P1

XX

P2P2

XX

P1P1

XX

P2P2

YY

YY

P1P1

X’X’

P2P2

XX

XX

Before UpdateBefore Update Write-throughWrite-through Write-backWrite-back

Cache incoherence caused by process Cache incoherence caused by process migrationmigration


Cache coherence protocol:Cache coherence protocol: Update policyUpdate policy (more commonly) invalidate policy(more commonly) invalidate policy

The programmer can assume that an The programmer can assume that an effective cache coherence protocol is effective cache coherence protocol is present in the system, although it will present in the system, although it will impact upon the performance of the impact upon the performance of the system.system.


Key characteristic: caches are Key characteristic: caches are organized in blocks of contiguous organized in blocks of contiguous locations.locations.

Why blocks are used in caches: Why blocks are used in caches: temporal localitytemporal locality

Major disadvantage of using blocks in Major disadvantage of using blocks in caches: different processors access caches: different processors access different parts of a block. (False different parts of a block. (False sharing) sharing)

Latency Tolerance TechniquesLatency Tolerance Techniques

Locality of distributed data structuresLocality of distributed data structures

/*** Method (1) Array of Structure ***//*** Method (1) Array of Structure ***/typedef struct {double real, image;} typedef struct {double real, image;} COMPLEX;COMPLEX;COMPLEX data[N1][N2][N3];COMPLEX data[N1][N2][N3];

/*** Method (2) Separate Arrays ***//*** Method (2) Separate Arrays ***/double data_real[N1][N2][N3]; double data_real[N1][N2][N3]; double data_image[N1][N2][N3];double data_image[N1][N2][N3];

Latency Tolerance Techniques (Cont.)Latency Tolerance Techniques (Cont.)

Improvement in data locality by loop Improvement in data locality by loop transformationtransformation

/*** matrix multiplication C=A*B ***//*** matrix multiplication C=A*B ***/for (i=0; i<N; i++)for (i=0; i<N; i++)

for (j=0; j<N; j++)for (j=0; j<N; j++)for (k=0; k<N; k++)for (k=0; k<N; k++)

c[i][j] += a[i][k] * b[k][j];c[i][j] += a[i][k] * b[k][j];

Latency Tolerance Techniques (Cont.)Latency Tolerance Techniques (Cont.)

Improvement in data locality by loop Improvement in data locality by loop transformation (Cont.)transformation (Cont.)

/*** matrix multiplication C=A*B ***//*** matrix multiplication C=A*B ***/for (i=0; i<N; i++)for (i=0; i<N; i++)

for (k=0; k<N; k++)for (k=0; k<N; k++)for (j=0; j<N; j++)for (j=0; j<N; j++)

c[i][j] += a[i][k] * b[k][j];c[i][j] += a[i][k] * b[k][j];

Barriers To ScalabilityBarriers To Scalability

The latency problem The latency problem (Hardware)(Hardware)

a general developing problem with a general developing problem with computer hardware componentscomputer hardware components

The parallelism problem The parallelism problem (Hardware / Software)(Hardware / Software)

sharing system resources without sharing system resources without impeding system performanceimpeding system performance

The Parallelism ProblemThe Parallelism Problem

How to share machine resources…How to share machine resources… Without causing another process or Without causing another process or

thread to wait thread to wait While continuing to hide the system While continuing to hide the system

latency (without perturbing the caches)latency (without perturbing the caches)

The system should always be doing The system should always be doing useful work, as fast as possible useful work, as fast as possible maximizing resource utilizationmaximizing resource utilization

Multiple-Context ProcessorsMultiple-Context Processors

Allow a processor to switch from one Allow a processor to switch from one context to another when a long-latency context to another when a long-latency operation is encountered.operation is encountered.

InterconnectInterconnect

PP PP MM PPMM PP PP MM PPMM…………

Multiple-Context Processors (Cont.)Multiple-Context Processors (Cont.)

Interleave the execution of several Interleave the execution of several contexts in order to reduce the value of contexts in order to reduce the value of idleidle time without increasing the time without increasing the magnitude of the magnitude of the switchingswitching time. time.

Context-switching policies:Context-switching policies: Switch on cache missSwitch on cache miss Switch on every loadSwitch on every load Switch on every instructionSwitch on every instruction Switch on block of instructionsSwitch on block of instructions

The Parallelism ProblemThe Parallelism Problem

Amdahl’s Law (Thread safe != Amdahl’s Law (Thread safe != Scalable)Scalable)

Architectural Latency compounds Architectural Latency compounds the problemthe problem

Deadlock / LivelockDeadlock / Livelock Bus SaturationBus Saturation

Amdahl’s LawAmdahl’s Law

Latency Is DeadlyLatency Is Deadly

Avoid blocking API callsAvoid blocking API calls Minimize context switchingMinimize context switching Minimize User/Kernel mode transitionsMinimize User/Kernel mode transitions

Scalability PerformanceScalability Performance

How does your application perform as the How does your application perform as the degree of parallelism increases?degree of parallelism increases?

Ideal

Expected

Actual

AgendaAgenda



Why Scale Up Now?Why Scale Up Now?

Reduce your applications exposure Reduce your applications exposure to hardware latencyto hardware latency

Find your application’s degree of Find your application’s degree of Scalability and it’s limitsScalability and it’s limits

Paradigm shift imminent (we all Paradigm shift imminent (we all become SMP developers)become SMP developers)

Server ConsolidationServer Consolidation Reduced management costReduced management cost More flexible to changes in loadMore flexible to changes in load

Writing Scalable Code Writing Scalable Code

Some new software directionsSome new software directions Event-based programmingEvent-based programming

No longer just for GUIsNo longer just for GUIs Software needs to react to hardware like it responds to user Software needs to react to hardware like it responds to user

mouse clicksmouse clicks

New software ideas borrowed from hardwareNew software ideas borrowed from hardware Out-of-order executionOut-of-order execution

Async I/O and Completion portsAsync I/O and Completion ports ParallelismParallelism

Multiple outstanding high latency eventsMultiple outstanding high latency events Waiting for all events not just oneWaiting for all events not just one

Software pipelinesSoftware pipelines Batching and cache localityBatching and cache locality Share threads using thread poolsShare threads using thread pools

Processes And ThreadsProcesses And Threads

Avoid using many processesAvoid using many processes Limit Threads to “one per processor”Limit Threads to “one per processor” Use shared memory and I/O Use shared memory and I/O

completion ports for IPCcompletion ports for IPC

I/O Completion PortsI/O Completion PortsCreateIoCompletionPort CreateIoCompletionPort Win32 APIWin32 API

““The goal of a server is to incur The goal of a server is to incur as few context switches as as few context switches as

possible by avoiding possible by avoiding unnecessary [thread] unnecessary [thread]

blocking while at the same blocking while at the same time maximizing time maximizing parallelism… the parallelism… the

application must have a application must have a way to activate another way to activate another

thread when a client thread thread when a client thread blocks on I/O” – blocks on I/O” – Inside Inside

Windows 2000, MSPressWindows 2000, MSPress

Removes ‘thread Removes ‘thread block’block’

Use for I/O Use for I/O operationsoperations

Keep concurrency Keep concurrency limit to number of limit to number of processors in the processors in the systemsystem

If using sockets use If using sockets use Winsock2 with I/O Winsock2 with I/O completion portscompletion ports

I/O Completion Port (Cont.)I/O Completion Port (Cont.)

The most complex kernel object The most complex kernel object offered by Win32.offered by Win32.

The kernel creates five different data The kernel creates five different data structures for an I/O completion port.structures for an I/O completion port.


Device ListDevice List

Each record containsEach record contains

hDevicehDevice dwCompletionKeydwCompletionKey

Entry is added whenEntry is added when• CreateInCompletionPort is called.CreateInCompletionPort is called.

Entry is removed whenEntry is removed when• Device handle is closed.Device handle is closed.


I/O Completion Queue (FIFO)I/O Completion Queue (FIFO)


Entry is added whenEntry is added when• I/O request completes.I/O request completes.• PostQueuedCompletionStatus is called.PostQueuedCompletionStatus is called.

Entry is removed whenEntry is removed when• Completion port removes an entry from the Completion port removes an entry from the Waiting Thread Queue.Waiting Thread Queue.

dwBytesTransferreddwBytesTransferred pOverlappedpOverlapped dwErrordwError


Waiting Thread Queue (LIFO)Waiting Thread Queue (LIFO)


Entry is added whenEntry is added when• Thread calls GetQueuedCompletionStatus.Thread calls GetQueuedCompletionStatus.

Entry is removed whenEntry is removed when• I/O completion port is not empty and the I/O completion port is not empty and the number of running threads is less than thenumber of running threads is less than theMaximum number of concurrent threads.Maximum number of concurrent threads.

dwThreadIddwThreadId


Release Thread ListRelease Thread List


Entry is added whenEntry is added when• Completion port wakes a thread in the WaitingCompletion port wakes a thread in the WaitingThread Queue.Thread Queue.• Paused thread wakes up.Paused thread wakes up.

Entry is removed whenEntry is removed when• Thread again calls GetQueuedCompletionStatus.Thread again calls GetQueuedCompletionStatus.• Thread calls a function that suspends itself.Thread calls a function that suspends itself.



Paused Thread ListPaused Thread List


Entry is added whenEntry is added when• Released thread calls a function that suspends Released thread calls a function that suspends itself.itself.

Entry is removed whenEntry is removed when• Suspended thread wakes up.Suspended thread wakes up.


Asynchronous I/OAsynchronous I/O

Take a look at Windows Take a look at Windows Asynchronous I/O APIsAsynchronous I/O APIs

Use these APIs with IO Use these APIs with IO Completion PortsCompletion Ports

FILE_FLAG_OVERLAPPED (CreateFile)FILE_FLAG_OVERLAPPED (CreateFile)

Memory ManagementMemory Management

Be sensitive to cache line sizeBe sensitive to cache line size Align your structures on cache Align your structures on cache

line boundariesline boundaries

Use pools of commonly allocated sizesUse pools of commonly allocated sizes Consider using a smart memory Consider using a smart memory

allocation schemeallocation scheme Avoid many small allocations and Avoid many small allocations and

de-allocationsde-allocations Take a look at .NET Server’s new Low Take a look at .NET Server’s new Low

Fragmentation Heap (LFH)Fragmentation Heap (LFH)

AgendaAgenda



Future DirectionsFuture Directions

SMT / SMP is comingSMT / SMP is coming Soon (within the next 2 years) new commodity Soon (within the next 2 years) new commodity

machines will be SMT based. HyperThreading machines will be SMT based. HyperThreading Xeons are here todayXeons are here today

CC – NUMA is comingCC – NUMA is coming Scaling MassiveScaling Massive

.NET Web Services are a great foundation to .NET Web Services are a great foundation to support the next step in application performance support the next step in application performance – the massively parallel application– the massively parallel application

Inter component latencies will get worse Inter component latencies will get worse (and worse, and worse…)(and worse, and worse…)

AgendaAgenda



Scaling WideScaling Wide64-bit Windows64-bit Windows

Running out of address space (4GB) Running out of address space (4GB) can present a scalability bottleneckcan present a scalability bottleneck

Running out of system resources Running out of system resources (e.g., TCP/IP connection handles) can (e.g., TCP/IP connection handles) can present a scalability bottleneckpresent a scalability bottleneck

Being in RAM reduces latency problemBeing in RAM reduces latency problem Improves parallel execution due to Improves parallel execution due to

reduced latencyreduced latency

Why Does 64-Bit Matter?Why Does 64-Bit Matter?Because…Because…

Processor Innovation Is InevitableProcessor Innovation Is Inevitable Faster floating point and Faster floating point and

clock-speed headroomclock-speed headroom Faster access to vast amounts Faster access to vast amounts

of memoryof memory Performance improvements in Performance improvements in

CPU microcodeCPU microcode

64-bit Computing Is Inevitable64-bit Computing Is Inevitable Enables new classes of applicationsEnables new classes of applications Provides new headroom for Provides new headroom for

scale-upscale-up scenarios scenarios

Why Does 64-Bit Windows Matter?Why Does 64-Bit Windows Matter?Because…Because… It enables a new set of scenariosIt enables a new set of scenarios

WorkstationWorkstation Mechanical Design & AnalysisMechanical Design & Analysis Digital Content Creation (Rendering)Digital Content Creation (Rendering) FinancialFinancial

ServerServer Computation Intensive applicationsComputation Intensive applications Large databases (BI/OLAP)Large databases (BI/OLAP) Large websites (caching, SSL)Large websites (caching, SSL)

The hardware architecture has The hardware architecture has taken an evolutionary step forwardtaken an evolutionary step forward

Windows continues to evolve…Windows continues to evolve…

Early Adopter ‘Target Scenarios’Early Adopter ‘Target Scenarios’

Scenarios recommended for early adoptersScenarios recommended for early adopters Computation intensive applicationsComputation intensive applications

Simulations Simulations Engineering, designEngineering, design Other floating point intensive applicationOther floating point intensive application

Large databasesLarge databases Data warehousing Data warehousing Business IntelligenceBusiness Intelligence

Web serving Web serving Large cachingLarge caching Web hostingWeb hosting Secure communicationsSecure communications

File and print server cluster environmentsFile and print server cluster environments Scenarios tested by OEMs/ISVsScenarios tested by OEMs/ISVs

Memory ComparisonMemory Comparison

Address SpaceAddress Space 64-bit 64-bit WindowsWindows

32-bit 32-bit WindowsWindows

Virtual MemoryVirtual Memory 16 TB16 TB 4GB4GBPaging FilePaging File 512 TB512 TB 16TB16TBHyperspaceHyperspace 8 GB8 GB 4 MB4 MBPaged PoolPaged Pool 128 GB128 GB 470 MB470 MBNon-Paged PoolNon-Paged Pool 128 GB128 GB 256 MB256 MBSystem CacheSystem Cache 1 TB1 TB 1 GB1 GB

Scaling For Any EnvironmentScaling For Any EnvironmentScale Up – Scale OutScale Up – Scale Out

Har

dw

are

Sca

lab

ility

Har

dw

are

Sca

lab

ility

Software and Infrastructure ScalabilitySoftware and Infrastructure Scalability

SHV PlatformsSHV Platforms

64-Way SMP 64-Way SMP 64 bit Itanium64 bit Itanium Perf, NUMA EnhancementsPerf, NUMA Enhancements Hardware Reliability (machine check)Hardware Reliability (machine check) Availability (multipath I/O, clustering)Availability (multipath I/O, clustering)

High density serversHigh density servers Clustering, loadClustering, load balancingbalancing Web farm session stateWeb farm session state Application Center managementApplication Center management

Scalable clustersScalable clusters

Developer SupportDeveloper Support

Development environment virtually Development environment virtually identical to Win32identical to Win32®®

Short learning curve makes porting Short learning curve makes porting easy (testing is the greater challenge)easy (testing is the greater challenge)

Allows for single source base for both Allows for single source base for both 32 and 64-bit environments32 and 64-bit environments Simplifies portingSimplifies porting Reduces development costsReduces development costs

Software Development Kit (SDK) and Software Development Kit (SDK) and Driver Development Kit (DDK) provide Driver Development Kit (DDK) provide necessary toolsnecessary tools

Platform SDKPlatform SDK Compiler (currently cross)Compiler (currently cross) LibrariesLibraries http://www.microsoft.com/msdownload/platformsdk/sdkupdate/http://www.microsoft.com/msdownload/platformsdk/sdkupdate/

Visual StudioVisual Studio®® 64-bit release 64-bit release Visual Studio 7.0 (.Net) 32bit RTM plus “some delta”Visual Studio 7.0 (.Net) 32bit RTM plus “some delta”

Command lineCommand line Nmake /f makefile Nmake /f makefile usingusing the pre-defined build environmentthe pre-defined build environment

Visual Studio 6Visual Studio 6 Use the tools on http://msdevlab.msftlabs.com/buildUse the tools on http://msdevlab.msftlabs.com/build

Visual Studio .NET 32-bit (Visual Studio 7)Visual Studio .NET 32-bit (Visual Studio 7) Use the tools on http://msdevlab.msftlabs.com/buildUse the tools on http://msdevlab.msftlabs.com/build

Dev ToolsDev Tools

http://www.microsoft.com/msdownload/%0Bplatformsdk/sdkupdate/

AgendaAgenda



Final ThoughtsFinal Thoughts

Computers are increasing their degree of Computers are increasing their degree of parallelism over the coming 2-3 yearsparallelism over the coming 2-3 years

Customers will expect a linear increase in Customers will expect a linear increase in performance, and certainly no decrease in performance, and certainly no decrease in performanceperformance

Develop scalability and performance test Develop scalability and performance test suites for your applications – know your suites for your applications – know your limits!limits!

Consider testing your applicationsConsider testing your applications On HIGH END SMP systems with a high degree On HIGH END SMP systems with a high degree

of parallelismof parallelism On the 64-bit environmentOn the 64-bit environment

如果您有任何问题，请上微软如果您有任何问题，请上微软中文新闻组中文新闻组继续讨论继续讨论

加入微软中文新闻组加入微软中文新闻组 http://www.microsoft.com/china/commhttp://www.microsoft.com/china/comm

unityunity

http://www.microsoft.com/china/community

http://www.microsoft.com/china/community

© 2002 Microsoft Corporation. All rights reserved.© 2002 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

win310: scaling up your applications rui hu ( [email protected] ) software design engineer windows...

Documents

cache cont

cache line

internal caches

memory technology

system resources

increasingthe latency

latency problemhow

different speedslets