dev441 writing faster managed code jonathan hawkins lead program manager common language runtime
TRANSCRIPT
DEV441
Writing Faster Managed Code
Jonathan HawkinsLead Program ManagerCommon Language Runtime
Outline
Introduction and design patterns
Managed code performance issues
Cost model
Tools
Wrap-up
Slow Software is BadDon’t Ship It
SymptomsLocked UI – splash screen, wait cursor
Bad citizenship – paging, CPU utilization
Scalability – server farms
Ultimate causesInattentive engineering
Bad design – bad architecture, interfaces, data structures, algorithms
Waste not
Premature optimization...
Design PatternsFaster Code, Smaller Data
Measure it – time and space
Speedup techniquesCache, batch, precompute, defer
Smarter recalcIncremental, progressive, background
Smaller dataDon’t hoard; size appropriately
Arrays vs. links; frugal interfaces
Performance Anti-PatternsThink (Twice)
Waiting on remote data
XML
Excessive OOP
Ignorance and ApathyNot measuring
Not setting performance goals
Perf Process Patterns“That which gets measured gets done”
Perf budgets, goals, requirements
Perf unit tests, regression tests
Process of “constant” improvementMeasuring, tracking, refining, trend lines
Perf cultureUsers: perf as a key feature
Devs: perf as a correctness issue
Outline
Introduction and design patterns
Managed code performance issues
Cost model
Tools
Wrap-up
Why Managed Code?
Programmer productivityGoodbye, corrupt heap debugging
Target modern requirements
FX: ++clean, ++consistent, ++streamlined
Better apps sooner
Performance barrier to adoption?Real – improves with each release
Perceived – “blame it on managed code”
Reality – its “pedal to the metal”
The Challenge of Writing Fast Managed Code
We’re all newbies!
Learning how
May not be learning how much things cost
Everything is easier...
The KnowledgeIldasm, debuggers, CLR Profiler, profilers, timing, vadump, events, Rotor
Managed CodeClose to the Machine
Not your father’s bytecode interpreterSource → IL → native (JIT or NGEN)
Optimizing JIT compilerConstant folding; Constant and copy propagation;Common subexpression elimination;Code motion of loop invariants;Dead store and dead code elimination;Register allocation; Method inlining;Loop unrolling (small loops/small bodies)
.NET Framework 1.1 NGEN does same opts
Disabled when debugging
Managed DataAutomatic Storage Management
Fast new; fast garbage collectionGC traces and compacts reachable object graph
>50 million objects per second
Generational GC HeapsGen0 – new objects – cache sized; fast GC
Gen1 – objects survived a GC of gen0
Gen2 – objects survived a GC of gen1,2
Large object heap
Server GCCache affinitive; concurrent; ASP.NET/hosted
Managed data costs space & time over its lifetime
Managed DataBest Practices
Often performance == allocation profileShort lived objects are cheap (not free)
Try not to churn old objects
Inspect with CLR Profiler
GC “gotchas”Keeping refs to “dead” object graphs
Caches; weak references
Pinning
Boxing
Finalization ...
Managed DataFinalization and the Dispose Pattern
Finalization: ~C(): non-det. res. clean upGC; object unref’d; promote; queue finalizer
Costs – retains object and its objects; finalizer thread; bookkeeping; call
Use rarely; use Dispose PatternImplement IDisposable
Call GC.SuppressFinalize
Hold few obj fields and null them out ASAP
Dispose early; try/finally; C# using
Managed CodeThreading and Synchronization
Use the ThreadPoolEasy, self tuning, good citizen
QueueUserWorkItem, BeginInvoke
lock()– not cheapGranularity trade-off – concurrency vs. cost
Scales much better in .NET 1.1
Consider Interlocked.Exchange, R.W.Lock
Managed CodeReflection
Slower and larger than direct use
Prefer is/as to typeof() ==
Member lookup/enum slow but cached
Reflective invoke is quite slowLookup, overload res., security, stack frame
Activator.CreateInstance too
Prefer MethodInfo.Invoke to Type.InvokeMember
Beware of code that uses reflectionLate binding in VB.NET, use Option Explicit On and Option Strict On
Managed CodeP/Invoke and COM Interop
Efficient, but frequent calls add up
Costs depend on marshalingPrimitive types and arrays of same: cheap
Others, not; e.g. Unicode to ANSI strings
COM interop – learn threading modelsAvoid STA threaded components
Avoid calling or being callable via IDispatch
Mitigate interop call costsChunky interfaces; move to managed code
Outline
Introduction and design patterns
Managed code performance issues
Cost model
Tools
Wrap-up
C/C++ Cost ModelsThe Gut-Feel Cost of a Line of Code
C – close to the machineWYWIWYG; int = * + call → instructions
C++ (OOP)C features: same cost
New features: additional, hidden costsCtors; SI, MI, VI; virtual; PMs; EH; RTTI
What does a function cost?
Towards a Managed Code Cost Model
Optimized native codeC features: similar cost?
OOP features: similar cost?
Let’s measure itSimple timing loops, unrolled some
Modified to prevent CSE/dead code elim.
50 ms each (218 to 230 iterations)
Measured on 1.1 GHz P-III laptop, Win XP
Disclaimers: uncertainty, subj. to change
Costs: MathAvg Primitive Avg Primitive
1.0 int add 1.3 float add 1.0 int sub 1.4 float sub 2.7 int mul 2.0 float mul
35.9 int div 27.7 float div 2.1 int shift 2.1 long add 1.5 double add 2.1 long sub 1.5 double sub
34.2 long mul 2.1 double mul 50.1 long div 27.7 double div
5.1 long shift Nicely optimized and run at full native code speed
Costs: Method Calls
Inlining – !virtual, small, simple, no try
Instance method call-site null this check
Virtual – like C++: (*this->MT[m])(…)
Interface – quadruple indirect(*this->MT->itfmap[i]->MT[m])(…)
Disclaimers: !inlining, branch prediction, arguments
Avg Primitive Avg Primitive 0.2 inlined static call 5.4 virtual call 6.1 static call 6.6 interface call 1.1 inlined instance call 6.8 instance call
Costs: Construction
class A { int a; } // L1class B : A { int b; } // L2class C : B { int c; } // L3 etc.
Allocation / management / GC cost
Value types: “0”
Ref types: fast but ~proportional to size
Construction cost
All fields 0-initialized
Small ctors can be inlined
Larger ctors incur up to 1 call/level
Avg Primitive Avg Primitive 2.6 new valtype L1 22.9 new rt ctor L1 6.4 new valtype L3 32.7 new rt ctor L3
22.0 new reftype L1 28.6 new rt no-inl L1 30.2 new reftype L3 50.6 new rt no-inl L3
Costs: Casts and IsInsts
Safe, secure, verifiable → type safety
Cast may throw exceptionIsInst will not – is and as operators
Up casts always safe and free
Down casts incur a helper function call
Avg Primitive Avg Primitive 0.4 cast up 1 0.8 isinst up 1 0.3 cast down 0 0.8 isinst down 0 8.9 cast down 1 6.3 isinst down 1 9.8 cast (up 2) down 1 10.7 isinst (up 2) down 1 8.7 cast down 3 6.1 isinst down 3
Costs: Write Barriers
Gen0 GC: trace roots and gen0 only?Could miss gen0 refs from gen1/gen2
Write barrier notes obj ref field storescontact.address = newAddress;
Tracks refs to newer gen objectsNot needed for locals, non-objects
Incurs a helper function call
Avg Primitive 6.4 write barrier
Costs: Array Bounds ChecksAvg Min Primitive 1.9 1.9 load int array elem 1.9 1.9 store int array elem 2.5 2.5 load obj array elem
16.0 16.0 store obj array elem
For productivity and runtime integrity
Checks index against array.Length
Inlined, optimized – inexpensive
Range check elimination:for (i=0; i < a.Length; i++)…a[i]…
Helper call for store object array elt.Bounds check, type check, write barrier
SummaryA Managed Code Cost Model
Like C/C++, close to the machine~1 ns: int, float = * + - * ==
~6 ns: calls (perfectly predicted)
Unlike C++~20-40 ns: new small obj + gen0 GC, box
~6-8 ns: casts, write barriers
~16 ns: object[] stores
Reflection? Think 100 times slower
(Get Real)Consider Computer Architecture
Cache misses, page faults
1983: 1 MIPS; 300 ns DRAM; 25 ms disk
2003: 10 BOPS; 100 ns DRAM; 10ms disk“Branch-predicting out-of-order superscalar trace-cache RISC w/ 3L data caches”
Issue 10,000 ops in 1 μs – or 10 DRAM reads
Full cache miss – 1,000 ops
Page fault – 100 M ops!
100 ns full cache miss >> any CLR op’n
Locality of reference matters
Outline
Introduction and design patterns
Managed code performance issues
Cost model
Tools
Wrap-up
Tools
InspectorsIldasm, debuggers – beware “debug mode”
MeasurersCode profilers, perfmon, CLR Profiler, vadump
Simple timing loops[...InteropServices.DllImport("KERNEL32")]private static extern boolQueryPerformanceCounter(ref long lpCount);QueryPerformanceFrequency(ref long lpFreq);
Rotor
Outline
Introduction
Managed code performance issues
Cost model
Tools
Wrap-up
In Conclusion…The Secret to Faster Managed Code
“There is no magic faster pixie dust!”Look in the mirror
You have the power and the responsibility
Mantra: Set goals, measure, understand the platform
ResourcesResourcesManaged code perf papers atMSDN .NET Developer Center[http://msdn.microsoft.com/netframework/]
“GC Basics and Performance Hints”
“Writing Faster Managed Code: Know What Things Cost”
CLR Profiler (same site)
SSCLI [http://msdn.microsoft.com/net/sscli]
Stutz et al., Shared Source CLI Essentials
Community Resources
Community Resourceshttp://www.microsoft.com/communities/default.mspx
Most Valuable Professional (MVP)http://www.mvp.support.microsoft.com/
NewsgroupsConverse online with Microsoft Newsgroups, including Worldwidehttp://www.microsoft.com/communities/newsgroups/default.mspx
User GroupsMeet and learn with your peershttp://www.microsoft.com/communities/usergroups/default.mspx
evaluationsevaluations
© 2003 Microsoft Corporation. All rights reserved.© 2003 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.