gail warren director, online operations microsoft its213
TRANSCRIPT
Critical Infrastructure and Operations for Delivering Secure, Enterprise-Class Software Services
Gail WarrenDirector, Online OperationsMicrosoftITS213
Agenda
BusinessProductivityOnline (BPO)
Carrier-classData Centers
World-classSecurity
World-classArchitecture
Best-of-BreedHardware
OperationalBest Practices
World-classSupport
Agenda
BusinessProductivityOnline (BPO)
Carrier-classData Centers
World-classSecurity
World-classArchitecture
Best-of-BreedHardware
OperationalBest Practices
World-classSupport
Microsoft’s Significant Investment
Microsoft is making a huge investment in data center and network capacity
There are currently 13 global data centers that use70 megawatts of power. By the end of 2009, there will be 20 data centers that use 180 megawatts of power.
The data centers are massive in size, relatively the size of 9–10 football fields with significant network capacity (primary facilities all maintain at least OC-192 capacity)
Carrier-ClassData Centers
Carrier-Class Data Centers
Features
Multiple Generators
Dualpower feeds
Batterybackup
Dual powerto each rack
Computer controlled cooling
1
1
2
2
3
4
5
3
4
5
Carrier-ClassData Centers
Carrier-Class Data Centers Carrier-ClassData Centers
North America
Central andSouth America
Europe Asia
Africa
Australia
Current locations Future location
Agenda
BusinessProductivityOnline (BPO)
Carrier-classData Centers
World-classSecurity
World-classArchitecture
Best-of-BreedHardware
OperationalBest Practices
World-classSupport
Microsoft Online Thinks About Security from 3 Perspectives:1. Secure from the ground up
Carrier-class data centersMultiple layers of security protecting your dataSecure development life cycle
2. Secure in knowing your data will be there when you need itOperational best practicesComplete n+1 redundancyBest-of-breed hardware
3. Security through peace of mindAudited by a third partyInternal auditsDedicated service administration resources24x7 support any time you need helpFinancially backed service level agreements (SLAs)
World-classSecurity
Service Security
It starts with the data center
World-classSecurity
Data Center within a Data Center
Motion sensors
24x7 secured access
Biometric controlled access systems
Video camera surveillance
Security breach alarms
Service SecurityThen we add multiple layers of logical security…
World-classSecurity
Filtering RoutersFirewallsIntrusion DetectionSeparate Data NetworksPenetration testingScanning and monitoring
AVConfiguration/patch
Host Security (hardened operating system)Application-LevelCountermeasuresApplication AuthenticationAuthentication to Data
Data
Service Security World-classSecurity
CyberTrust—Leading security certification providerCyberTrust provides both application and physical security validation
4 of the 5 largest banks use CyberTrustCertifies more than 95% of all information security software
What they found“…not discover a single device with any high-severity vulnerabilities … I can comfortably say in three years of conducting internal scans I have never seen an internal scan without any high-severity vulnerabilities” —CyberTrust
Service Security World-classSecurity
Data hygiene supported by multi-layers antivirus and spam filteringHighly secure data accessfor users via HTTPS
Geo-redundant datacenters certified with SAS70 and ISO27001
Agenda
BusinessProductivityOnline (BPO)
Carrier-classData Centers
World-classSecurity
World-classArchitecture
Best-of-BreedHardware
OperationalBest Practices
World-classSupport
BPO Logical Architecture World-classArchitecture
1 Administration through defined set of tools promotes availability and security
234
Rich set of tools available to IT Profession to promote visibility into the service
Integrated with world-class services such as Live Meeting and Exchange Hosted Services
Significant investment in monitoring and management
HMC MPS/MPF
HMC namespaces
Providers
SharepointOCS
Exchange
Provisioning web ServiceBPO Admin portal
AD
Live Meeting
Syndicated services
BPO specific KB
Alert publishing
Ticket mgmt
Customer centric service health
Service Object model
Sign in service
Customer Premise
IT Generalist End User
Service applet SSOand client config
Service Administration
OLS Interface
Trials VL
Deployment and configuration
Service monitoring
Audit collection
Performance loggingand collection
Capacity mgmt
Patch mgmt
Backups
EHS
1
23
4
BPO Physical Overview World-classArchitecture
All Services Protected by Microsoft® Forefront™
LocationService runs in isolation
A data center located within a data center
ServicesAdministration and user portals
SharePoint®
Instant messaging (IM)
Web conferencing
AvailabilityEach service runs with complete n+1 redundancy within the data center
Multiple data copies to protect against data loss
Full service geo-replication for disaster recover
BPO Capacity and Reliability World-classArchitecture
Capacity Management
Continuous capacity review
Buffer capacity for unexpected load
Capacity modeling implements capacity at least 3 months in advance of forecast
N+1 Redundancy Throughout
Network
Storage
Servers
Result: 99.9%+ reliabilityFinancially backed SLA
Agenda
BusinessProductivityOnline (BPO)
Carrier-classData Centers
World-classSecurity
World-classArchitecture
Best-of-BreedHardware
OperationalBest Practices
World-classSupport
BPO Logical Architecture Best-of-BreedHardware
Dual power supplies Dual network interfacesFull lights-out management capabilities
RAID 1 + 5Optimized for performance and availabilityDisk to disk to disk backup
Full failover capabilitiesN+1 throughout the network stack
Servers
Storage
Network
Agenda
BusinessProductivityOnline (BPO)
Carrier-classData Centers
World-classSecurity
World-classArchitecture
Best-of-BreedHardware
OperationalBest Practices
World-classSupport
Operational Best PracticesOperations practices based on Information Technology Infrastructure Library (ITIL) /Microsoft® Operations Framework (MOF)
Change managementIncident managementProblem management
Dedicated Service Operations Center (SOC) Focused on BPOExperts in online collaboration services
Dedicated service administration teamISO 27001 aligned operational procedures
OperationalBest Practices
Monitoring
Significant investment in tools to ensure the service is there 24x7, and if there are problems, we know ASAPComplete monitoring suite
Microsoft® Systems Center Operations ManagerTransaction monitors around the world Holistic network monitoringSecurity monitoring
Custom built tools to provide further insightCustom Microsoft® Operations Manager (MOM) packsSynthetic transactions
OperationalBest Practices
Incident Management
Issue discoveryMonitoringSyntxCustomer reported
Operations monitoring infrastructureIssue handling
Issue documentationIssue escalationService restoration
OperationalBest Practices
Issue Discovery – Monitoring System Event monitoring with heavy tuning for what goes to the console, using a failure-mode approach
Review how the components could failBuild rules for each failure modeBuild knowledge for each failure mode to drive quicker resolutionsOne can never predict all failure modes, so a closed-loop system is a necessity. If we have an outage without a failure-mode alert, we treat it as a bug and drive it until we have a corresponding rule and TSG (Technical Support Guide) for that specific failure mode in place.
Heavy customizations on top of SCOM platforms. For example:Transactions added to SCOM specific to mailflow and administrative services
Currently ~20K unique rules for the service
OperationalBest Practices
Issue Discovery – SyntxWhat are the capabilities of the service that end users consume?
E.g. search sharepoint, create a list, post a document, search for a document that was posted yesterday, etc
How do we emulate the consumption of those capabilities?Code that emulation = “synthetics”Run synthetics every X minutesAlert if the capability is not performing within specificationsExpose synthetic success/failure and performance data for trending
Monitor DIPs and VIPs from LANMonitor VIPs from internet
Ideally, two alerts for every issue: Synthetic alert telling us that the capability is impactedFailure mode alert telling us what happened
OperationalBest Practices
Issue Discovery – CustomerDespite monitoring and syntx, customers do find and report errors to our Support organization
OperationalBest Practices
Continuous ImprovementIf a service event is missed by monitoring a bug is opened and tracked for resolution
OperationalBest Practices
Issue Discovery – Infrastructure
Geo-redundant Tier 1 team and SOC LeadsConsole, email, and phone monitored 24x7x365SOC Leads (Ops Managers) are also 24x7x365
Geo-redundant SCOM infrastructureAlerts to console
Geo-redundant synthetic monitoring infrastructure (separate from SCOM)
Synthetic alerts go to email currently We will integrate the alert stream into the console, but we will always want visibility outside of the console for resiliency
OperationalBest Practices
Issue DocumentationIssues are logged into a tool called Product Studio (specific database is “Service Delivery Escalation” or SDE)
OperationalBest Practices
Issue EscalationEmails are automatically triggered for all escalations entered in SDE
OperationalBest Practices
Issue EscalationFor high-severity issues, pagers are triggered and phone bridges are spun up to work on immediate service restoration
OperationalBest Practices
Issue EscalationEmails sent out every 30 minutes until Service is restoredLinked bugs opened in SDE for any follow-up work items
OperationalBest Practices
Customer View
Provide customer with service stateMailSharePoint
Really Simple Syndication (RSS) feeds
OperationalBest Practices
Customer ViewSample RSS feed
OperationalBest Practices
Problem Management Processes
Present Microsoft Online Services Problem Management processes:
Issue-to-Problem escalation flowMinimize repeat occurrences (incidents & alerts)Build a better service (continuous improvement)
Present Microsoft Online Services Service Intelligence Processes:
What is SI?Sample ReportsHow is the data used to improve service health?
OperationalBest Practices
Issue-to-Problem EscalationIssues are logged into a tool called Product Studio
OperationalBest Practices
Issue-to-Problem Escalation Flow
Questions asked of each issue:Are there coding changes required?Are there configuration changes required?Are there infrastructure changes required?Are there operational changes required?Are there short-term preventative measures required while a longer-term solution is put in place?Was the issue caught by monitoring? Was the issue responded to correctly?
OperationalBest Practices
Service Intelligence - DefinitionBusiness Intelligence vs. Service Intelligence
Let customers focus on their business while we focus on our service and resourcesBI pulls data from the SI platform
“Any metric from any datasource”
Availability, Incidents, Alerts, TTR, TTE
OperationalBest Practices
Minimize Repeat OccurrencesLook for trendsTarget preventative actions
OperationalBest Practices
Build a Better Service OperationalBest Practices
MOM Alert
Syntx Alert
Customer
Report
Bug in SDE
Operational Process Change
Code Change
Configuration Change
Infrastructure Change
+Bug
+Bug
+Bug
+Bug
Monitor &
Measure
Impact
Agenda
BusinessProductivityOnline (BPO)
Carrier-classData Centers
World-classSecurity
World-classArchitecture
Best-of-BreedHardware
OperationalBest Practices
World-classSupport
World-Class Support
Dedicated BPO Support organizationDeep service knowledge
Tightly aligned with operations and development organizations
Promotes faster resolution timesEnsures the voice of the customer is heard
24x7 Phone Support andElectronic SupportSupport requests can be entered directly into the Service PortalContinuously updated Knowledge Base articles
World-classSupport
question & answer
www.microsoft.com/teched
Sessions On-Demand & Community
http://microsoft.com/technet
Resources for IT Professionals
http://microsoft.com/msdn
Resources for Developers
www.microsoft.com/learning
Microsoft Certification & Training Resources
Resources
Related ContentBreakout Sessions
• UNC203 - 11/09/2009 09:00-10:15 [Cyril Sultan]Implementing and Administering Microsoft Online Services
• OFS209 - 11/10/2009 17:00-18:15 [Kimmo Forss]SharePoint Online Overview
• SIA08-IS - 11/11/2009 10:45-12:00 [Mike Chan]Security Services in the Cloud
• UNC205 - 11/11/2009 17:30-18:45 [Cyril Sultan]Tips and Tricks for Planning, Deploying, and Troubleshooting the Office Live Meeting Service
• UNC310 - 11/12/2009 13:30-14:45 [David Anderson]Migrating Data, Co-Existence, and Directory Synchronization with Microsoft Online Services
• ITS213 - 11/12/2009 17:00-18:15 [Gail Warren]Critical Infrastructure and Operations for Delivering Secure, Enterprise-Class Software Services
Complete an evaluation on CommNet and enter to win an Xbox 360 Elite!
© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS,
IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.