no outages: maintenance without impact · improving the maintenance timeline no risk doing things...
TRANSCRIPT
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
No Outages:Maintenance without Impact
Troy Anthony, Ian Cookson, Carol ColrainOracle Database DevelopmentOctober, 2018
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
2
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
Continuous Availability
Are your building blocks in place?
Automating Maintenance
Patching OJVM
Customer Story - Epsilon
1
2
3
4
3
5
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
From High Availability to Continuous Availability
4
• Minimizes downtime
• In-flight work is lost
• Rolling maintenance at DB
• Predictable runtime performance
• Errors may be visible
• Designed for single failure
• Basic HA building blocks
• No downtime for users
• In-flight work is preserved
• Maintenance is hidden
• Predictable performance
• Errors visible only if unrecoverable
• Designed for multiple failures
• Builds on top of HA
High Availability Continuous Availability
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Continuous Availability
Continuous Availability is not Absolute Availability.
Probable outages and maintenance events at the database level are masked from the application, which continues to operate with no errors and within the specified response time objectives while processing these events.
Key points:
• Planned maintenance and likely unplanned outages are hidden from applications
• There is neither data loss nor data inconsistency, guaranteed
• Majority of work (% varies by customer) completes within recovery time SLA
• May appear as a slightly delayed execution
Many customers are achieving Continuous Availability Today
5
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
There is no reason for users to see downtime during scheduled database maintenance
• Service is unavailable
• Application owners unable to agree maintenance windows
• Long running jobs see errors
• DBA’s and engineers work off hours
• Application and middleware components need to be restarted
6
No Outages: Maintenance without Impact
Family Holidays website is down"Our online transaction services are currently unavailable. Our server may be temporarily down or we may be performing routine maintenance functions scheduled every Sunday from 12 a.m. to 5 a.m. (Eastern Standard Time). We apologize for any inconvenience."
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
How Do We Solve This?
• Move work to different instance/database with no errors reported to applications
• Transparent to applications and mid-tiers
• Support all server side maintenance
– Patches, PSUs, upgrades, repairs, changes, unplug/plug, migration, expansion, h/w replacement
• Configure once, same for all commands
7
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Are Your Building Blocks in Place?
8
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
What Needs to be Configured?
• RAC or RAC One, Active Data Guard, GoldenGate, GDSWhich Server Stack for me?
• All databases on Flex ASMFlex ASM
• Services for Location TransparencyServices
• Connections appear continuous Continuous Connections
• FAN or 18c Database for drainingDraining
• Application Continuity Inflight work continues
• Drain in a timely mannerSLA’s
9
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Draining and Failover Locally – Switchover between sites
1010
Active Data Guard– Scheduled switchover– Data Protection, DR– Query Offload
Data Guard – Scheduled switchover– Data Protection, DR
GoldenGate– Scheduled switchover– Active-active replication– Heterogeneous
Sharding– Massive OLTP– Scheduled switchover– Active-active replication– Heterogeneous
Fast Application Notification– Notify draining, failover, load balancing
Transparent Application Continuity– Application HA
Global Data Services – Cross Site Placement
RAC– Online Rolling Maintenance– Scalability– Server HA
RAC One– Online Rolling Maintenance– Server HA
Production Site
Drain within RAC
Switchover ADG
Drain within RAC
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
TIP: Use Services for Location Transparency
• Regardless of location, application keeps the name
• Moving, reshaping, prioritizing controls how a service is offered
• Batch and OLTP separated
• DB and PDB names for admin only
11
Node 1
RAC instance
Node 2
RAC instance
OLTP service
Batch service
Services provide a “dial in number” for your application
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
TIP: Drain sessions before maintenance
Drain connections where applications do not notice
• Web requests
• Transaction boundaries
• Connection tests
• Connection pools
• Custom rules
12
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Moving Work Since Oracle Database 10.2
Repeat for each service allowing time to drain
• Stop servicesrvctl stop service –db .. -instance .. -service ..
• Relocate service srvctl relocate service –db .. -service .. –oldinst .. –newinst
srvctl relocate service –db .. -service .. –currentnode.. –targetnode
Wait for sessions to drain…
For remaining sessions, stop transactionalexec DBMS_APP_CONT.disconnect_session( ‘… your service ..‘,
DBMS_APP_CONT.POST_TRANSACTION);
Stop the instances using your preferred tool
13
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 14
Drain and Rebalance
TIP: Use Oracle Pools – Full Lifecycle
FAN Planned
Drain & rebalance
Applications using …
Oracle – WebLogic Active GridLink, UCP, ODP.NET managed and unmanaged, OCI Session Pool, Tuxedo, CMAN TDM
3rd party App Servers using UCP: IBM WebSphere, Apache Tomcat, NEC WebOTX, Red Hat JBoss, Spring
DBA Step srvctl [relocate|stop] service (no –force)
Sessions Drain
Immediately new work is redirected
Gradually
Active sessions are released when returned to pools
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
TIP: UCP with other Java-based Application Servers
15
A simple data source replacement
• IBM WebSphere
• Apache Tomcat
• NEC WebOTX
• Red Hat JBoss
• Spring
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
TIP: Configure TNS for Continuous Connections
16
Everything Oracle uses FAN
JDBC Universal Connection Pool
OCI/OCCI driver
ODP.NET Unmanaged Provider (OCI)
ODP.NET Managed Provider (C#)
OCI Session Pool
WebLogic Active GridLink
Tuxedo
JDBC Thin Driver
CMAN in Traffic Director mode
FAN is Auto-Configured
DESCRIPTION =
(CONNECT_TIMEOUT=90)
(RETRY_COUNT=20)(RETRY_DELAY=3)
(TRANSPORT_CONNECT_TIMEOUT=3)
(ADDRESS_LIST =
(LOAD_BALANCE=on)
( ADDRESS = (PROTOCOL = TCP)
(HOST=primary-scan)(PORT=1521)))
(ADDRESS_LIST =
(LOAD_BALANCE=on)
(ADDRESS = (PROTOCOL = TCP)
(HOST=standby-scan)
(PORT=1521)))
(CONNECT_DATA=(SERVICE_NAME=gold)))
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
TIP: Are You Receiving FAN Notifications?• Create a FAN callout in GRID_HOME/racg/usrco
• Download FANwatcher from OTN oracle.com/goto/ac
18
FANwatcher..
VERSION=1.0 event_type=SERVICEMEMBER service=orcl_swing_pdb2 instance=orcl1 database=orcl db_domain= host=sun01 status=down reason=USER timestamp=2014-07-30 12:02:51 timezone=-07:00
VERSION=1.0 event_type=SERVICEMEMBER service=orcl_swing_pdb10 instance=orcl1 database=orcl db_domain= host=sun01 status=down reason=USER timestamp=2014-07-30 12:02:52 timezone=-07:00
VERSION=1.0 event_type=SERVICE service=orcl_swing_pdb10 database=orcl db_domain= host=sun01 status=down reason=USER
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Plan B
The application “tests” the connection
Database responds connection is bad
New work continues on another connection
Connection Tests, All Drivers
19
Application or mid-tier tests connections
Draining by the Oracle Database
Tip: enable in DBMS_APP_CONT
view in DBA_CONNECTION_TESTS
Stop or relocate services/PDBs
Database marks sessions to drain
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 20
Application Server Test Name Connection Test to DB
Oracle WebLogic –Generic & Multi data sources
TestConnectionsOnReserve
TestConnectionsOnCreate
isUsable
SQL – SELECT 1 FROM DUAL
Oracle WebLogic Active GridLink
embedded isUsable
IBM WebSphere PreTest Connections SQL - SELECT 1 FROM DUAL
Red Hat JBoss check-valid-connection-sql SQL - SELECT COUNT(*) FROM DUAL
Apache TomCatTestonBorrow
TestonReleaseSQL - SELECT 1 FROM DUAL
ODP.NET Unmanaged Connection.status OCI_ATTR_SERVER_STATUS
TIP: Enable Connection Tests for Application Servers
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Application Condition Connection Test to DB
eBusiness Suite Connection borrowed from WeblogicTestConnectionsOnReserve
with "BEGIN NULL; END;"
Fusion Applications Connection returned to WebLogic and C++
pools and checked
TestConnectionsOnReserve
with isValid
OCIPing
OCI_ATTR_SERVER_STATUS
Siebel Connection requested OCI_ATTR_SERVER_STATUS
Peoplesoft Connection requested OCIPing
Customer Custom pool with Meta data table
Checks status every 60sOCI_ATTR_SERVER_STATUS
TIP: Enable Connection Tests for Applications
2121
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
End of Request
22
Application or mid-tier ends a request
Stop or relocate services/PDBs
Database & Driver marks sessions to drain
Draining by the Oracle Database and Drivers
Tip: enable in DBMS_APP_CONT
view in DBA_CONNECTION_TESTS
Plan B continued
Request boundaries sent by all Oracle 12c Pools & JDK9
18c Database or the 18c Drivers close connection after end request
New work continues on another connection
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
•Group operations pdb,
instance, node, or database
•New service attributes
drain_timeout (seconds)
stopoption (immediate,
transactional)
23
DBA Operations Simplified
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
DBA commands: Group commands
Relocate all services by database/node/pdb
srvctl relocate service -database -instance -drain_timeout.. –stopoption
[immediate|transactional]
srvctl relocate service -node . -drain_timeout.. –stopoption
srvctl stop service –pdb . -drain_timeout.. –stopoption
Start/Stop everything at a node
srvctl stop service -node <node_name> -drain_timeout.. –stopoption
srvctl stop database -node <node_name> -drain_timeout.. –stopoption
Data Guard Switchover
Switchover to <db_resource_name> [wait [xx]];
24
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 25
Drain… Connect… Failover
Time
TPM
Node 1
Node 2
Drain Timeout
(1) Drain Node 1
(2) Connect to Node 2
(4) Replay in-flight requests
(3) Terminate in-flight requests
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Hides errors, timeouts, and maintenance
No application knowledge or changes to use
Rebuilds session state & in-flight transactions
Adapts as applications change: protected for the future
12c use Application Continuity
TIP: Use Transparent Application Continuity 18c
26
DatabaseRequest
Errors/Timeouts hidden
TAC
My application has not drained on time
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
TIP: Some Applications have Built-in Read-Only Failover
27
Example Applications Set Once and Done
Siebel
Recommended TNS + STOPOPTION = Transactional
Drain_Timeout = n
PeopleSoft
JD Edwards
Informatica FAN is autoconfigured for unplanned
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
With stopoption transactional:
• COMMIT_OUTCOME=true for Transaction Guard
• FAILOVER_RESTORE = LEVEL1
TIP: TAF SELECT with Transaction Guard – 12.2
28
DatabaseRequest
Disconnects
TAF + TG restore a new session + common states
If you have TAF SELECT now
For applications that do not change session state after connect
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Align Application Timeouts
29
TIP: Nothing to do when
Active/Active RAC
Switchover
DG Broker
RAC Primary RAC Standby
Stop or Relocate
Service
to
Drain Work
Drain Work
Switchover
TIP: Application Timeout
Drain + Switch to DG
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Automating Maintenance
30
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Improving the Maintenance Timeline
No RiskDoing things outside the maintenance window is the preferred solution for necessary steps such as• real-time checks• create gold image• out of place distribution• switch service• drain service
Performance RiskDoing things during this phase is acceptable for any necessary steps that cannot be done ahead of time e.g.• final pre-check• restart on new home
Maintenance Window
Move tasks to the left
Eliminate StepsEliminating unnecessary
steps is the best solution whenever possible such as backup existing home
Shrink as much as possible
Preparation
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Software Maintenance in the Time of Cloud
• Challenges• Complex, Error-prone Process• Maintaining standardization• Minimizing Downtime• Takes too much time and effort
OpenStackCloud
ProprietaryCloud
Datacenter
DatacenterDatacenter
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 3310/29/2018
Rapid Home Provisioning
Local target Connected Targets 11.2, 12.1, 12.2,
18c
RHP Server
Gold Image Repository
11.2.0.4
12.1.0.2
12.2.0.1
18c
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 34
RHP Support
Database & Grid Infrastructure
11.2.0.3.
11.2.0.4.
12.1.0.2
12.2.0.1
18c
VM VM
VM VM
VM VM
VM VM
• Single Instance• Oracle Restart• Oracle RAC One• Oracle RAC
BM
Non-CDB CDB/PDB
VM
• Generic Software
• Data Guard Aware
• Customizable
Multi-OS
Rapid Home Provisioning
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
OJVM Patching
35
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
OJVM PSU – Flow chart
36
RAC Rolling Install Process for the "Oracle JavaVM Component Database PSU" (OJVM PSU) Patches (Doc ID 2217053.1)
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
TIP: OJVM Rolling
• RAC will coordinate the apply of the OJVM patch across all nodes in a rolling fashion
• Non-OJVM work not affected
• RAC tracks sessions that use OJVM in the Database
• Phased approach: prepare, execute update, post-update
• During the execute update phase OJVM work is blocked
– < 10 seconds brownout
An Automated Approach to patching the OJVM
37
Non-OJVM
OJVM
Prepared Prepared
Exec Update
Blocked
RAC Rolling
©2014 Epsilon. Private & Confidential
Hiding Scheduled Maintenance at Epsilon
38
©2014 E
psilo
n D
ata
Managem
ent, L
LC
. Priv
ate
& C
onfid
entia
l
39
• Epsilon is comprised of three divisions that deliver a global platform for customer experience marketing
• More than 3,500 associates and 37 offices worldwide
• Largest permission-based e-mailer in the world, delivering over 40 billion emails annually
• World’s leading source of data with information covering over 250 million consumers and 22 million businesses
• More than 2,000 global clients, including 26 of the Fortune 100
9 out of 10 Top Banks
8 out of 10 Top Retailers
The Top 10 Pharmaceutical Companies
The Top 10 Automotive Companies
Epsilon at a Glance
©2014 E
psilo
n D
ata
Managem
ent, L
LC
. Priv
ate
& C
onfid
entia
l
40
• New client opportunity with extreme performance and availability requirements
• Real time POS integration with over 10000 sites
• Decrease time to market
• Real-time monitoring and reporting of system performance and health
• Need to run OLTP, batch and reporting workload concurrently against real time
data without impacting user experience
• Support over 2000 real-time transactions per second with SLA of less than 100 ms
and 99.95% availability
• Less than 8 hours RPO and RTO
High Level Business Requirements
©2014 E
psilo
n D
ata
Managem
ent, L
LC
. Priv
ate
& C
onfid
entia
l
41
• Dedicated connection model from application server to database
• Application was using ODAC 11gR5
• Database hardware platform using HP DL980 server with PCIe Flash storage and Hitachi storage array
• Database version 11gR2 (11.2.0.3)
Previous State and Challenges
• No draining was available using dedicated connection model
• Every planned maintenance/unplanned outage needs application layer restart –a major pain point
• Every application executable needs FAN notification port in odp.net
• Due to large number of dedicated connection , application server CPU utilization was high
• No support for commit outcome in case of failure in 11g
• Running mixed workload was not very user friendly on non-engineered system as there is no IO prioritization in tradition storage array
• Premier support of 11.2.0.3 was coming to an end ( Aug 2015 )
©2014 E
psilo
n D
ata
Managem
ent, L
LC
. Priv
ate
& C
onfid
entia
l
42
• Used connection pool instead of dedicated connection
• Application is now using ODAC 12cR3
• Database hardware platform using Exadata X4-2 half rack
• Database version 12cR1 ( 12.1.0.2 BP10 )
Current State and Resolutions
• Connection pool drains quickly after receiving FAN events
• No longer application server restart required for planned maintenance or unplanned outage of oracle stack – a big relief
• Only one port required for ONS remote communication ( 6200 ) in 12c instead of many ports in 11g
• CPU utilization reduced by 40% in application servers after connection pool implementation
• Fast Application Notification and Application Continuity provides a robust error handling mechanism
• Exadata is ideal platform for running mixed workload with both CPU and IO prioritization handling
• Moved to a supported release good for next 3 years
©2014 E
psilo
n D
ata
Managem
ent, L
LC
. Priv
ate
& C
onfid
entia
l
43
Technology Stack
• Exadata X4-2 half rack for both Primary and DR site
• Web servers
• Application server uses :
• ODP.NET ( ODAC 12cR3 )
• WebLogic server using Active Gridlink
• Oracle Database 12c ( 12.1.0.2 )
Real Application Cluster (RAC )
Data Guard
Fast Application Notification ( FAN )
Application Continuity ( AC )
Transaction Guard ( TG )
• Database backup uses ZFS Backup Appliance (ZS3-BA )
©2014 E
psilo
n D
ata
Managem
ent, L
LC
. Priv
ate
& C
onfid
entia
l
44
Web Server cluster
Application Server Cluster
ODAC 12cR3
Web Service Call
OLTP BatchRead
only
OLTP
Service
Batch
Service
RRAC Node 1
12.1.0.2
RRAC Node 2
12.1.0.2
Primary Site
Primary DB
Web Server cluster
Application Server Cluster
ODAC 12cR3
OLTP BatchRead
only
Report
Service
RRAC Node 1
12.1.0.2
RRAC Node 2
12.1.0.2
Standby Site
Standby DB
Normal Operation : Application service placement
©2014 E
psilo
n D
ata
Managem
ent, L
LC
. Priv
ate
& C
onfid
entia
l
45
Web Server cluster
Application Server Cluster
ODAC 12cR3
Web Service Call
OLTP BatchRead
only
OLTP
Service
Batch
Service
RAC Node 1
12.1.0.2
RAC Node 2
12.1.0.2
Primary Site
Primary DB
Web Server cluster
Application Server Cluster
ODAC 12cR3
OLTP BatchRead
only
Report
Service
RRAC Node 1
12.1.0.2
RRAC Node 2
12.1.0.2
Standby Site
Standby DB
Scheduled maintenance: Application service placement
Fast
Application
Notification
©2014 E
psilo
n D
ata
Managem
ent, L
LC
. Priv
ate
& C
onfid
entia
l
46
Web Server cluster
Application Server Cluster
ODAC 12cR3
Web Service Call
OLTP BatchRead
only
OLTP
Service
Batch
Service
RAC Node 1
12.1.0.2
RAC Node 2
12.1.0.2
Primary Site
Primary DB
Web Server cluster
Application Server Cluster
ODAC 12cR3
OLTP BatchRead
only
Report
Service
RRAC Node 1
12.1.0.2
RRAC Node 2
12.1.0.2
Standby Site
Standby DB
Scheduled maintenance: Application service placement
Fast
Application
Notification
©2014 E
psilo
n D
ata
Managem
ent, L
LC
. Priv
ate
& C
onfid
entia
l
47
Case study – Transaction response time
©2014 E
psilo
n D
ata
Managem
ent, L
LC
. Priv
ate
& C
onfid
entia
l
48
Case study – Overall transaction response time
©2014 E
psilo
n D
ata
Managem
ent, L
LC
. Priv
ate
& C
onfid
entia
l
49
Business Benefits
• Scheduled maintenance of Oracle technology stack can be done
without disrupting business user experience.
• Application restart is no longer required for scheduled maintenance
which is a major relief.
• Usage of connection pool reduces CPU utilization of middle tier servers
by 40%
• Database and operating system can be patched periodically
without taking any system downtime ( meet security compliance as
well as uptime SLA )
• We have similar success for Java based applications using WebLogic
Server Active Gridlink to hide scheduled maintenance operation without
any application code change
©2014 E
psilo
n D
ata
Managem
ent, L
LC
. Priv
ate
& C
onfid
entia
l
50
More Successes
• Application Continuity for ODP.NET to hide unplanned outages and
scheduled maintenance without any application code change
• Application Continuity for Java with WebLogic Server Active Gridlink to
hide unplanned outages and scheduled maintenance without any
application code change
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 51