building big: lessons learned from windows azure customers – part two

39
Building Big: Lessons learned from Windows Azure customers – Part Two Mark Simms(@mabsimms) Simon Davies(@simongdavies) Principal Program Manager Windows Azure Technical Specialist Microsoft Microsoft 3-030

Upload: nelson

Post on 20-Feb-2016

118 views

Category:

Documents


0 download

DESCRIPTION

Building Big: Lessons learned from Windows Azure customers – Part Two. Mark Simms(@ mabsimms )Simon Davies(@ simongdavies ) Principal Program ManagerWindows Azure Technical Specialist MicrosoftMicrosoft. 3-030. Session Objectives. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Building Big: Lessons learned from Windows Azure  customers – Part Two

Building Big: Lessons learned from Windows Azure customers – Part TwoMark Simms(@mabsimms) Simon Davies(@simongdavies)Principal Program Manager Windows Azure Technical Specialist Microsoft Microsoft3-030

Page 2: Building Big: Lessons learned from Windows Azure  customers – Part Two

Session ObjectivesDesigning large-scale services requires careful design and architecture choicesThis session will explore customer deployments on Azure and illustrate the key choices, tradeoffs and learningsTwo part session:• Part 1: Building for Scale• Part 2: Building for Availability

Page 3: Building Big: Lessons learned from Windows Azure  customers – Part Two

Other Great SessionsThis session will focus on architecture and design choices for delivering highly available services.If this isn’t a compelling topic, there are many other great sessions happening right now!

Room Level Title PresenterNexus/Normandy

300 Designing awesome XAML apps in Visual Studio and Blend for Windows 8 and Windows Phone 8

Jeffrey Ferman

Trident/Thunder 300 Developing Mobile Solutions with Windows Azure Part II

Nick HarrisChris Risner

Odyssey 200 Desktop apps: WPF 4.5 and Visual Studio 2012 Pete Brown (DPE)

Magellan 200 WP8 HTML5/IE10 for Developers Rick XuJorge Peraza

Page 4: Building Big: Lessons learned from Windows Azure  customers – Part Two

Building Big – the availability challengeEverything will Fail –design for failureGet Insight – instrument everything

Agenda

Page 5: Building Big: Lessons learned from Windows Azure  customers – Part Two

Designing and Deploying Internet Scale ServicesJames Hamilton, https://www.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdf

Partition the serviceSupport geo-distribution

Design for Failure• Do not trust underlying

components• Decouple components• Avoid single points of failureInstrument everything• Implement inter-service

monitoring and alerting• Instrument for production testing• Configurable logging

Part 1: Design for Scale Part 2: Design for Availability

Optimize for density

Page 6: Building Big: Lessons learned from Windows Azure  customers – Part Two

What are the 9’s?

Page 7: Building Big: Lessons learned from Windows Azure  customers – Part Two

The Hard Reality of the 9’s

Page 8: Building Big: Lessons learned from Windows Azure  customers – Part Two

Design for FailureGiven enough time and pressure, everything failsHow will your application behave?• Gracefully handle failure modes, continue to

deliver value• Not so gracefully …Fault types:• Transient. Temporary service interruptions,

self-healing• Enduring. Require intervention.

Page 9: Building Big: Lessons learned from Windows Azure  customers – Part Two

Failure ScopeRegion

Service

Node Individual Nodes May FailConnectivity Issues (transient failures), hardware failures, configuration and code errors

Entire Services May FailService dependencies (internal and external)

Regions may become unavailableConnectivity Issues, acts of nature

Page 10: Building Big: Lessons learned from Windows Azure  customers – Part Two

Use fault-handling frameworks that recognize transient errors:

CloudFXP+P TFH

Appropriate retry and backoff policies

Node Failures

Page 11: Building Big: Lessons learned from Windows Azure  customers – Part Two

Don’t do this – why?

Page 12: Building Big: Lessons learned from Windows Azure  customers – Part Two

Sample Retry PoliciesPlatform Context Sample Target

e2e latency max“Fast First”

Retry Count

Delay Backoff

SQL Database

Synchronous (e.g. render web page)

200 ms Yes 3 50 ms Linear

Asynchronous (e.g. process queue item)

60 seconds No 4 5 s Exponential

Azure Cache

Synchronous (e.g. render web page)

100 ms Yes 3 10 ms Linear

Asynchronous (e.g. process queue item)

500 ms Yes 3 100 ms

Exponential

Page 13: Building Big: Lessons learned from Windows Azure  customers – Part Two

At some point, your request is blocking the line

Fail gracefully, and get out of the queue!

Too much retry, too much trust of downstream service

Decoupling Components

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 290

50000

100000

150000

200000

250000

300000

350000

400000

450000 Web Request Response Latency

Avg Latency Response latency

Page 14: Building Big: Lessons learned from Windows Azure  customers – Part Two

Decoupling ComponentsLeverage asynchronous I/O Beware – not all apparently async calls are “purely” async

Ensure that all external service calls are boundedBound the overall call latency (including retries); beware of thread pool pressure

Beware of convoy effects on failure recoveryTrying too hard to catch up can flood newly recovered services

Page 15: Building Big: Lessons learned from Windows Azure  customers – Part Two

Service Level FailuresEntire Services will have outagesSQL Azure ,Windows Azure Storage – SLA < 100%External services may be unavailable or unreachable

Application needs to workaround theseReturn fail code to user (please try again later)Queue and try later (we’ve received your order…)

Page 16: Building Big: Lessons learned from Windows Azure  customers – Part Two

Region Level FailureRegional failure will occur

Load needs to be spread over multiple regions

Route around failures

Page 17: Building Big: Lessons learned from Windows Azure  customers – Part Two

8datacentres

Digital Watermar

ks

Mobile Integratio

n

Digimarc

Page 18: Building Big: Lessons learned from Windows Azure  customers – Part Two

Example Distribution with Traffic Manager

Slide 18

Global load does not necessarily give uniform distribution

Page 19: Building Big: Lessons learned from Windows Azure  customers – Part Two

• Hosted service(s) per data centre

• Each service is autonomous –services independently receive or pull data from source

• Azure traffic manager can direct traffic to “nearest” service

• Use probing to determine service health*

Information publishingAzure Traffic Manager

Source Data

Web Role

Worker Role

Cache Role

DBAzure Storage

Web Role

Worker Role

Cache Role

DBAzure Storage

Web Role

Worker Role

Cache Role

DBAzure Storage

Region 1 Region 2 Region 3

Page 20: Building Big: Lessons learned from Windows Azure  customers – Part Two

Service InsightDeep and detailed data needed for management, monitoring, alerting and failure diagnosis

Capture, transport, storage and analysis of this data requires careful design

Page 21: Building Big: Lessons learned from Windows Azure  customers – Part Two

Characterizing Insight•

••

••

•••

•••

Page 22: Building Big: Lessons learned from Windows Azure  customers – Part Two

Build and Buy (or rent)No “one size fits all” for all perspectives at scaleNear real-time monitoring & alerting, deep diagnostics, long term trending

Mix of platform components and servicesWindows Azure Diagnostics, application logging, Azure portal, 3rd party services

Page 23: Building Big: Lessons learned from Windows Azure  customers – Part Two

New RelicFree/$24/$149 pricing model(/month/server)Agent installation on server (role instance)Hooks application via Profiling API

Page 24: Building Big: Lessons learned from Windows Azure  customers – Part Two

App DynamicsFree -> $979.00 (6 agents) Agent based, hooking profiling APICross-instance correlation

Page 25: Building Big: Lessons learned from Windows Azure  customers – Part Two

OpsTeraLeverages Windows Azure Diagnostics (WAD) dataGraphing, alerts, auto-scaling

Page 26: Building Big: Lessons learned from Windows Azure  customers – Part Two

PagerDutyOn-call scheduling alerting and incident management$9\$18 per user per monthIntegration with monitoring tools e.g. NewRelic , others , HTTP API, email

Page 27: Building Big: Lessons learned from Windows Azure  customers – Part Two

• Azure platform service (agent) for collection and distribution of telemetry• Standard structured storage formats

(perf counters, events)• Code or XML driven configuration• Partially dynamic (post updated file to

blob store)

Windows Azure Diagnostics (WAD)

Page 28: Building Big: Lessons learned from Windows Azure  customers – Part Two

Windows Azure Diagnostics (WAD)

Perf Counters

Windows Events

Diag Events

IIS Log FilesFailed Logs

Crash Dumps

WAD Performance Counters Table

WAD Windows Events Logs Table

WAD Logs Table

Wad-iis-log filesWad-iis-failed log files

Wad-crash-dumps

Page 29: Building Big: Lessons learned from Windows Azure  customers – Part Two

Limitations of Default Configuration

Page 30: Building Big: Lessons learned from Windows Azure  customers – Part Two

• Azure table storage is the target for performance counter and application log data

• General maximum throughput is 1000 entities / partition / table• Performance Counters:

• Uses part of timestamp as partition key (limits number of concurrent entity writes)• Each partition key is 60 seconds wide, and are written asynchronously in bulk

• The more entities in a partition (i.e. the number of performance counter entries * the number of role instances) the slower the queries

• Impact: to maintain acceptable read performance in large scale sites may need to • Increase performance counter collection period (1 minute -> 5 minutes)• Decrease the number of log records written into the activity table (by increasing the filtering level

– WARN or ERROR, no INFO)

Understanding Azure Table Store

Page 31: Building Big: Lessons learned from Windows Azure  customers – Part Two

Managing the DelugePer-Application Server

Data Sources- IIS logs- Application logs- Performance counters

High value data- Filter- Aggregate - Publish

High volume data- Batch- Partition- Archive

High value data consumer- Generate alerts- Display dashboard- Operational intelligence

High volume data consumer- Data mining / analysis- Historical trends- Root Cause Analysis

Page 32: Building Big: Lessons learned from Windows Azure  customers – Part Two

• Add high-bandwidth (chunky) logging and telemetry channels for verbose data logging• Capture tracing via core System.Diagnostics (or log4net, NLog, etc) with:

• WARN/ERROR -> Table storage• VERBOSE/INFO -> Blob storage

• Run-time configurable logging channels to enable selective verbose logging to table (i.e. just log database information)

• Leverage the features of the core Diagnostic Monitor• Use custom directory monitoring to copy files to blob storage

Extending the Experience

Page 33: Building Big: Lessons learned from Windows Azure  customers – Part Two

Extending Diagnostics

Perf Counters

Windows Events

Diag Events

IIS Log FilesFailed Logs

Crash Dumps

WAD Performance Counters Table

WAD Windows Events Logs Table

WAD Logs Table

Wad-iis-log filesWad-iis-failed log files

Wad-crash-dumps

Verbose Perf Ctrs

Verbose Event logsVerbose Perfcounter logsVerbose Events

Page 34: Building Big: Lessons learned from Windows Azure  customers – Part Two

Handling transient failures

Logging transient failures

Logging all external API calls with timingLogging full exception

(not .ToString())

Logging and Retry with CloudFX

Page 35: Building Big: Lessons learned from Windows Azure  customers – Part Two

Demo: Multiple Logging Channelsusing NLog and WAD

Page 36: Building Big: Lessons learned from Windows Azure  customers – Part Two

Logging ConfigurationTraditional .NET log configuration (System.Diagnostics) is hard coded against System.Configuration (app.config/web.config)Anti-pattern for Azure deploymentLeverage external configuration store (e.g. Service Configuration or blob storage) for run-time dynamic configuration

Page 37: Building Big: Lessons learned from Windows Azure  customers – Part Two

Recap and ResourcesBuilding big: • The Availability Challenge• Design for Failure• Get Insight into Everything

Resources:Best Practices for the Design of Large-Scale Services on Windows Azure Cloud ServicesTODO: failsafe doc link

Page 38: Building Big: Lessons learned from Windows Azure  customers – Part Two

• Follow us on Twitter @WindowsAzure

• Get Started: www.windowsazure.com/build

Resources

Please submit session evals on the Build Windows 8 App or at http://aka.ms/BuildSessions

Page 39: Building Big: Lessons learned from Windows Azure  customers – Part Two

© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.