gail warren director, online services microsoft corporation session code: cos201
TRANSCRIPT
A Day in the Life: Running Hosted Services in the CloudGail WarrenDirector, Online ServicesMicrosoft Corporation
SESSION CODE: COS201
Agenda
BusinessProductivityOnline (BPO)
Carrier-classData Centers
World-classSecurity
World-classArchitecture
Best-of-BreedHardware
OperationalBest Practices
World-classSupport
Agenda
BusinessProductivityOnline (BPO)
Carrier-classData Centers
World-classSecurity
World-classArchitecture
Best-of-BreedHardware
OperationalBest Practices
World-classSupport
Microsoft’s Significant Investment
Carrier-ClassData Centers
Microsoft is making a significant investment in building on-line compute capacity
Microsoft has more than 10 and less than 100 global data centers that range from 1 megawatt to 60+ megawatts of power
Some of the data centers are massive and relatively the size of 9–10 football fields and contain enough wire to wrap around the earth several times
Carrier-Class Data Centers Carrier-ClassData Centers
Features
Dual power feeds
Multiple generators
Battery backup
Dual power to each rack
Computer controlled cooling
1
2
3
4
5
21
3
4
5
Agenda
BusinessProductivityOnline (BPO)
Carrier-classData Centers
World-classSecurity
World-classArchitecture
Best-of-BreedHardware
OperationalBest Practices
World-classSupport
Microsoft Online Thinks About Security from 3 Perspectives:
World-classSecurity
1. Secure from the ground upCarrier-class data centersNine layers of security protecting your dataSecure development life cycle
2. Secure in knowing your data will be there when you need itOperational best practicesComplete n+1 redundancyBest-of-breed hardware
3. Security through peace of mindAudited by third partiesInternal auditsDedicated SOC24x7 support any time you need helpFinancially backed service level agreements (SLAs)
Service Security
It starts with the data centerData Center within a Data Center
Motion sensors
24x7 secured access
Biometric controlled access systems
Video camera surveillance
Security breach alarms
World-classSecurity
Service SecurityThen we add multiple layers of logical security…
Filtering RoutersFirewallsIntrusion DetectionSeparate Data NetworksPenetration testingScanning and monitoring
AVConfiguration/patch
Host Security (hardened operating system)Application-LevelCountermeasuresApplication AuthenticationAuthentication to Data
World-classSecurity
Data
Service Security World-classSecurity
Data Centers are SAS70 and ISO27001 certifiedService is SAS70 certifiedService is ISO27001 certifiedFISMA targeted for 2010Customer’s own their data…our job is to protect it
Security
Risk ManagementPrivacy
Data
Service Security World-classSecurity
Data hygiene supported by multi-layers antivirus and spam filteringHighly secure data accessfor users via HTTPS
Geo-redundant datacenters certified with SAS70 and ISO27001
Agenda
BusinessProductivityOnline (BPO)
Carrier-classData Centers
World-classSecurity
World-classArchitecture
Best-of-BreedHardware
OperationalBest Practices
World-classSupport
BPO Capacity and Reliability
Capacity Management
Continuous capacity review
Buffer capacity for unexpected load
Capacity modeling implements capacity at least 3 months in advance of forecast
N+1 Redundancy Throughout
Network
Storage
Servers
Result: 99.9%+ reliabilityFinancially backed SLA
World-classArchitecture
Agenda
BusinessProductivityOnline (BPO)
Carrier-classData Centers
World-classSecurity
World-classArchitecture
Best-of-BreedHardware
OperationalBest Practices
World-classSupport
BPO Logical Architecture Best-of-BreedHardware
Dual power supplies Dual network interfacesFull lights-out management capabilities
RAID 1 + 5Optimized for performance and availabilityDisk to disk to disk backup
Full failover capabilitiesN+1 throughout the network stack
Servers
Storage
Network
Agenda
BusinessProductivityOnline (BPO)
Carrier-classData Centers
World-classSecurity
World-classArchitecture
Best-of-BreedHardware
OperationalBest Practices
World-classSupport
Operational Best Practices OperationalBest Practices
Operations practices based on Information Technology Infrastructure Library (ITIL) /Microsoft® Operations Framework (MOF)
Change managementIncident managementProblem management
Dedicated Service Operations Center (SOC) Focused on BPOExperts in online collaboration services
Dedicated service administration teamISO 27001 certified operational procedures
Monitoring OperationalBest Practices
Significant investment in tools to ensure the service is there 24x7, and if there are problems, we know ASAPComplete monitoring suite
Microsoft® Systems Center Operations ManagerTransaction monitors around the world Holistic network monitoringSecurity monitoring
Custom built tools to provide further insightCustom Microsoft® Operations Manager (MOM) packsSynthetic transactions
Incident Management
Issue discoveryMonitoringSyntxCustomer reported
Operations monitoring infrastructureIssue handling
Issue documentationIssue escalationService restoration
OperationalBest Practices
Issue Discovery – Monitoring
System Event monitoring with heavy tuning for what goes to the console, using a failure-mode approach
Review how the components could failBuild rules for each failure modeBuild knowledge for each failure mode to drive quicker resolutionsOne can never predict all failure modes, so a closed-loop system is a necessity. If we have an outage without a failure-mode alert, we treat it as a bug and drive it until we have a corresponding rule and TSG (Technical Support Guide) for that specific failure mode in place.
Heavy customizations on top of SCOM platforms. For example:Transactions added to SCOM specific to mailflow and administrative services
Currently ~20K unique rules for the service
OperationalBest Practices
Issue Discovery – SyntxWhat are the capabilities of the service that end users consume?
E.g. search sharepoint, create a list, post a document, search for a document that was posted yesterday, etc
How do we emulate the consumption of those capabilities?Code that emulation = “synthetics”Run synthetics every X minutesAlert if the capability is not performing within specificationsExpose synthetic success/failure and performance data for trending
Monitor DIPs and VIPs from LANMonitor VIPs from internet
Ideally, two alerts for every issue: Synthetic alert telling us that the capability is impactedFailure mode alert telling us what happened
OperationalBest Practices
Issue Discovery – Customer
Despite monitoring and syntx, customers do find and report errors to our Support organization
OperationalBest Practices
Continuous Improvement
If a service event is missed by monitoring a bug is opened and tracked for resolution
OperationalBest Practices
Issue Discovery – Infrastructure
Geo-redundant Tier 1 team and SOC LeadsConsole, email, and phone monitored 24x7x365SOC Leads (Ops Managers) are also 24x7x365
Geo-redundant SCOM infrastructureAlerts to console
Geo-redundant synthetic monitoring infrastructure (separate from SCOM)
Synthetic alerts go to email currently We will integrate the alert stream into the console, but we will always want visibility outside of the console for resiliency
OperationalBest Practices
Issue DocumentationIssues are logged into a tool called Product Studio (specific database is “Service Delivery Escalation” or SDE)
OperationalBest Practices
Issue EscalationEmails to critical teams within Microsoft Online Services are automatically triggered for all escalations entered in SDE
OperationalBest Practices
Issue EscalationFor high-severity issues, pagers are triggered and phone bridges are spun up to work on immediate service restoration
OperationalBest Practices
Issue Escalation
Emails to critical teams within Microsoft Online Services are sent out every 30 minutes until Service is restoredLinked bugs are opened in SDE for any follow-up work items
OperationalBest Practices
Customer View
Sample RSS feed
OperationalBest Practices
Problem Management Processes
Present Microsoft Online Services Problem Management processes:Issue-to-Problem escalation flowMinimize repeat occurrences (incidents & alerts)Build a better service (continuous improvement)
Present Microsoft Online Services Service Intelligence Processes:What is SI?Sample ReportsHow is the data used to improve service health?
OperationalBest Practices
Issue-to-Problem Escalation
Issues are logged into a tool called Product Studio
OperationalBest Practices
Issue-to-Problem Escalation Flow
Questions asked of each issue:Are there coding changes required?Are there configuration changes required?Are there infrastructure changes required?Are there operational changes required?Are there short-term preventative measures required while a longer-term solution is put in place?Was the issue caught by monitoring? Was the issue responded to correctly?
OperationalBest Practices
Service Intelligence - Definition
Business Intelligence vs. Service Intelligence
Let customers focus on their business while we focus on our service and resourcesBI pulls data from the SI platform
“Any metric from any datasource”Availability, Incidents, Alerts, TTR, TTE
OperationalBest Practices
Minimize Repeat OccurrencesLook for trendsTarget preventative actions
OperationalBest Practices
Build a Better Service
MOM Alert
Syntx Alert
Customer
Report
Bug in SDE
Operational Process Change
Code Change
Configuration Change
Infrastructure Change
+Bug
+Bug
+Bug
+Bug
Monitor &
Measure
Impact
OperationalBest Practices
Agenda
BusinessProductivityOnline (BPO)
Carrier-classData Centers
World-classSecurity
World-classArchitecture
Best-of-BreedHardware
OperationalBest Practices
World-classSupport
World-Class Support
Dedicated BPO Support organizationDeep service knowledge
Tightly aligned with operations and development organizationsPromotes faster resolution timesEnsures the voice of the customer is heard
24x7 Phone Support andElectronic SupportSupport requests can be entered directly into the Service PortalContinuously updated Knowledge Base articles
World-classSupport
Track ResourcesRead more about Microsoft Online Services – www.microsoft.com/onlineSign up for a 30-Day Trial of the Business Productivity Online Suite:
https://mocp.microsoftonline.comUse Promo Code TENA2010
Continue the conversationMicrosoft Online Services Team Blog – http://blogs.technet.com/msonlineFacebook Fan Page – http://www.facebook.com/MicrosoftOnlineServices You Tube Channel – http://www.youtube.com/user/msonlineservices Twitter – http://twitter.com/msonline
Resources
www.microsoft.com/teched
Sessions On-Demand & Community Microsoft Certification & Training Resources
Resources for IT Professionals Resources for Developers
www.microsoft.com/learning
http://microsoft.com/technet http://microsoft.com/msdn
Learning
Complete an evaluation on CommNet and enter to win!
Sign up for Tech·Ed 2011 and save $500 starting June 8 – June 31st
http://northamerica.msteched.com/registration
You can also register at the
North America 2011 kiosk located at registrationJoin us in Atlanta next year
© 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to
be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
JUNE 7-10, 2010 | NEW ORLEANS, LA