dev441 writing faster managed code jonathan hawkins lead program manager common language runtime

DEV441

Writing Faster Managed Code

Jonathan HawkinsLead Program ManagerCommon Language Runtime

Outline

Introduction and design patterns

Managed code performance issues

Cost model

Tools

Wrap-up

Slow Software is BadDon’t Ship It

SymptomsLocked UI – splash screen, wait cursor

Bad citizenship – paging, CPU utilization

Scalability – server farms

Ultimate causesInattentive engineering

Bad design – bad architecture, interfaces, data structures, algorithms

Waste not

Premature optimization...

Design PatternsFaster Code, Smaller Data

Measure it – time and space

Speedup techniquesCache, batch, precompute, defer

Smarter recalcIncremental, progressive, background

Smaller dataDon’t hoard; size appropriately

Arrays vs. links; frugal interfaces

Performance Anti-PatternsThink (Twice)

Waiting on remote data

XML

Excessive OOP

Ignorance and ApathyNot measuring

Not setting performance goals

Perf Process Patterns“That which gets measured gets done”

Perf budgets, goals, requirements

Perf unit tests, regression tests

Process of “constant” improvementMeasuring, tracking, refining, trend lines

Perf cultureUsers: perf as a key feature

Devs: perf as a correctness issue

Outline



Cost model

Tools

Wrap-up

Why Managed Code?

Programmer productivityGoodbye, corrupt heap debugging

Target modern requirements

FX: ++clean, ++consistent, ++streamlined

Better apps sooner

Performance barrier to adoption?Real – improves with each release

Perceived – “blame it on managed code”

Reality – its “pedal to the metal”

The Challenge of Writing Fast Managed Code

We’re all newbies!

Learning how

May not be learning how much things cost

Everything is easier...

The KnowledgeIldasm, debuggers, CLR Profiler, profilers, timing, vadump, events, Rotor

Managed CodeClose to the Machine

Not your father’s bytecode interpreterSource → IL → native (JIT or NGEN)

Optimizing JIT compilerConstant folding; Constant and copy propagation;Common subexpression elimination;Code motion of loop invariants;Dead store and dead code elimination;Register allocation; Method inlining;Loop unrolling (small loops/small bodies)

.NET Framework 1.1 NGEN does same opts

Disabled when debugging

Managed DataAutomatic Storage Management

Fast new; fast garbage collectionGC traces and compacts reachable object graph

>50 million objects per second

Generational GC HeapsGen0 – new objects – cache sized; fast GC

Gen1 – objects survived a GC of gen0

Gen2 – objects survived a GC of gen1,2

Large object heap

Server GCCache affinitive; concurrent; ASP.NET/hosted

Managed data costs space & time over its lifetime

Managed DataBest Practices

Often performance == allocation profileShort lived objects are cheap (not free)

Try not to churn old objects

Inspect with CLR Profiler

GC “gotchas”Keeping refs to “dead” object graphs

Caches; weak references

Pinning

Boxing

Finalization ...

Managed DataFinalization and the Dispose Pattern

Finalization: ~C(): non-det. res. clean upGC; object unref’d; promote; queue finalizer

Costs – retains object and its objects; finalizer thread; bookkeeping; call

Use rarely; use Dispose PatternImplement IDisposable

Call GC.SuppressFinalize

Hold few obj fields and null them out ASAP

Dispose early; try/finally; C# using

Managed CodeThreading and Synchronization

Use the ThreadPoolEasy, self tuning, good citizen

QueueUserWorkItem, BeginInvoke

lock()– not cheapGranularity trade-off – concurrency vs. cost

Scales much better in .NET 1.1

Consider Interlocked.Exchange, R.W.Lock

Managed CodeReflection

Slower and larger than direct use

Prefer is/as to typeof() ==

Member lookup/enum slow but cached

Reflective invoke is quite slowLookup, overload res., security, stack frame

Activator.CreateInstance too

Prefer MethodInfo.Invoke to Type.InvokeMember

Beware of code that uses reflectionLate binding in VB.NET, use Option Explicit On and Option Strict On

Managed CodeP/Invoke and COM Interop

Efficient, but frequent calls add up

Costs depend on marshalingPrimitive types and arrays of same: cheap

Others, not; e.g. Unicode to ANSI strings

COM interop – learn threading modelsAvoid STA threaded components

Avoid calling or being callable via IDispatch

Mitigate interop call costsChunky interfaces; move to managed code

Outline



Cost model

Tools

Wrap-up

C/C++ Cost ModelsThe Gut-Feel Cost of a Line of Code

C – close to the machineWYWIWYG; int = * + call → instructions

C++ (OOP)C features: same cost

New features: additional, hidden costsCtors; SI, MI, VI; virtual; PMs; EH; RTTI

What does a function cost?

Towards a Managed Code Cost Model

Optimized native codeC features: similar cost?

OOP features: similar cost?

Let’s measure itSimple timing loops, unrolled some

Modified to prevent CSE/dead code elim.

50 ms each (218 to 230 iterations)

Measured on 1.1 GHz P-III laptop, Win XP

Disclaimers: uncertainty, subj. to change

Costs: MathAvg Primitive Avg Primitive

1.0 int add 1.3 float add 1.0 int sub 1.4 float sub 2.7 int mul 2.0 float mul

35.9 int div 27.7 float div 2.1 int shift 2.1 long add 1.5 double add 2.1 long sub 1.5 double sub

34.2 long mul 2.1 double mul 50.1 long div 27.7 double div

5.1 long shift Nicely optimized and run at full native code speed

Costs: Method Calls

Inlining – !virtual, small, simple, no try

Instance method call-site null this check

Virtual – like C++: (*this->MT[m])(…)

Interface – quadruple indirect(*this->MT->itfmap[i]->MT[m])(…)

Disclaimers: !inlining, branch prediction, arguments

Avg Primitive Avg Primitive 0.2 inlined static call 5.4 virtual call 6.1 static call 6.6 interface call 1.1 inlined instance call 6.8 instance call

Costs: Construction

class A { int a; } // L1class B : A { int b; } // L2class C : B { int c; } // L3 etc.

Allocation / management / GC cost

Value types: “0”

Ref types: fast but ~proportional to size

Construction cost

All fields 0-initialized

Small ctors can be inlined

Larger ctors incur up to 1 call/level

Avg Primitive Avg Primitive 2.6 new valtype L1 22.9 new rt ctor L1 6.4 new valtype L3 32.7 new rt ctor L3

22.0 new reftype L1 28.6 new rt no-inl L1 30.2 new reftype L3 50.6 new rt no-inl L3

Costs: Casts and IsInsts

Safe, secure, verifiable → type safety

Cast may throw exceptionIsInst will not – is and as operators

Up casts always safe and free

Down casts incur a helper function call

Avg Primitive Avg Primitive 0.4 cast up 1 0.8 isinst up 1 0.3 cast down 0 0.8 isinst down 0 8.9 cast down 1 6.3 isinst down 1 9.8 cast (up 2) down 1 10.7 isinst (up 2) down 1 8.7 cast down 3 6.1 isinst down 3

Costs: Write Barriers

Gen0 GC: trace roots and gen0 only?Could miss gen0 refs from gen1/gen2

Write barrier notes obj ref field storescontact.address = newAddress;

Tracks refs to newer gen objectsNot needed for locals, non-objects

Incurs a helper function call

Avg Primitive 6.4 write barrier

Costs: Array Bounds ChecksAvg Min Primitive 1.9 1.9 load int array elem 1.9 1.9 store int array elem 2.5 2.5 load obj array elem

16.0 16.0 store obj array elem

For productivity and runtime integrity

Checks index against array.Length

Inlined, optimized – inexpensive

Range check elimination:for (i=0; i < a.Length; i++)…a[i]…

Helper call for store object array elt.Bounds check, type check, write barrier

SummaryA Managed Code Cost Model

Like C/C++, close to the machine~1 ns: int, float = * + - * ==

~6 ns: calls (perfectly predicted)

Unlike C++~20-40 ns: new small obj + gen0 GC, box

~6-8 ns: casts, write barriers

~16 ns: object[] stores

Reflection? Think 100 times slower

(Get Real)Consider Computer Architecture

Cache misses, page faults

1983: 1 MIPS; 300 ns DRAM; 25 ms disk

2003: 10 BOPS; 100 ns DRAM; 10ms disk“Branch-predicting out-of-order superscalar trace-cache RISC w/ 3L data caches”

Issue 10,000 ops in 1 μs – or 10 DRAM reads

Full cache miss – 1,000 ops

Page fault – 100 M ops!

100 ns full cache miss >> any CLR op’n

Locality of reference matters

Outline



Cost model

Tools

Wrap-up

Tools

InspectorsIldasm, debuggers – beware “debug mode”

MeasurersCode profilers, perfmon, CLR Profiler, vadump

Simple timing loops[...InteropServices.DllImport("KERNEL32")]private static extern boolQueryPerformanceCounter(ref long lpCount);QueryPerformanceFrequency(ref long lpFreq);

Rotor

Outline

Introduction


Cost model

Tools

Wrap-up

In Conclusion…The Secret to Faster Managed Code

“There is no magic faster pixie dust!”Look in the mirror

You have the power and the responsibility

Mantra: Set goals, measure, understand the platform

ResourcesResourcesManaged code perf papers atMSDN .NET Developer Center[http://msdn.microsoft.com/netframework/]

“GC Basics and Performance Hints”

“Writing Faster Managed Code: Know What Things Cost”

CLR Profiler (same site)

SSCLI [http://msdn.microsoft.com/net/sscli]

Stutz et al., Shared Source CLI Essentials

Community Resources

Community Resourceshttp://www.microsoft.com/communities/default.mspx

Most Valuable Professional (MVP)http://www.mvp.support.microsoft.com/

NewsgroupsConverse online with Microsoft Newsgroups, including Worldwidehttp://www.microsoft.com/communities/newsgroups/default.mspx

User GroupsMeet and learn with your peershttp://www.microsoft.com/communities/usergroups/default.mspx

http://www.microsoft.com/communities/default.mspx

http://www.mvp.support.microsoft.com/

http://www.mvp.support.microsoft.com/

http://www.microsoft.com/communities/newsgroups/default.mspx

http://www.microsoft.com/communities/usergroups/default.mspx

http://www.microsoft.com/communities/usergroups/default.mspx

evaluationsevaluations

© 2003 Microsoft Corporation. All rights reserved.© 2003 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.

dev441 writing faster managed code jonathan hawkins lead program manager common language runtime

Documents