2k04hp it-symposium 2006 2 page 2 automatic failover across sites with data guard fast-start...
TRANSCRIPT
HP IT-Symposium 2006
www.decus.de 1
Page 1
“This presentation is for informational purposes only and may not be incorporated into a contract or agreement.”
This document is for informational purposes. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development,
release, and timing of any features or functionality described in this document remains at the sole discretion of Oracle. This document in any form, software or printed matter, contains proprietary information
that is the exclusive property of Oracle. This document and information contained herein may not be disclosed, copied,
reproduced or distributed to anyone outside Oracle without priorwritten consent of Oracle. This document is not part of your license agreement nor can it be incorporated into any contractual agreement
with Oracle or its subsidiaries or affiliates.
HP IT-Symposium 2006
www.decus.de 2
Page 2
Automatic Failoveracross sites with
Data Guard Fast-Start Failover
DECUS, Duesseldorf 2006
Larry M. CarpenterPrincipal Product Manager
High Availability & Disaster RecoveryOracle USA
HP IT-Symposium 2006
www.decus.de 3
Page 3
Agenda
• A quick look at Data Guard• How do users perform Failover today?• Fast-Start Failover – An Overview• Fast-Start Failover – The Details• Client Failover• User Experiences
A quick look at Data Guard
HP IT-Symposium 2006
www.decus.de 4
Page 4
What is Oracle Data Guard?
• Oracle’s Disaster Recovery solution for Oracle.• A Feature of Oracle Database Enterprise Edition• Automates the creation and maintenance of one
or more transactionally consistent Standby databases.
• Provides comprehensive role management.• Role transitions
• Standby to Primary and back to Standby• For planned and unplanned outages
A Data Guard Configuration
• Managed as a single configuration• Primary and standby databases can be Real Application Clusters
or single-instance Oracle• Up to nine standby databases supported in a single configuration
PrimaryDatabase
StandbyDatabase
Standby Site A
Standby Site B
Primary Site
StandbyDatabase
Broker
HP IT-Symposium 2006
www.decus.de 5
Page 5
How do users perform Failover today?
Failover Implications
• Faster is better - Downtime is bad• If manual intervention is required, the time it
takes to notify administrative staff can be lengthy• Reliability is a must-have
• Correct procedure for failover must be followed to meet data loss (recovery point) objective
• Simplicity is preferred• Determining if failure condition warrants failover
adds time & complexity to the failover process
HP IT-Symposium 2006
www.decus.de 6
Page 6
Best PracticeAdd Standby Redo Logs
• A separate pool of log file groups on a standby site
• Used just like the online redo logs on a primary• Requires local archiving on the standby database• Requires same size and number of Primary
database online redo logs but more is better• Cannot be assigned to a thread in 9i• Are required for Zero Data loss configurations as
of Oracle Database10g Release 1.
SRL Architecture
Redo from primary database
ARCH
StandbyRedo Logs
ArchivedRedo Logs
Physical &
Logicalstandby
databases
RFSLGWR
ARCH RFS
New!10g
10g
HP IT-Symposium 2006
www.decus.de 7
Page 7
Benefits
• Better Performance• Standby redo logs are pre-allocated files• Can reside on raw devices
• Better Protection• Can have multiple members• If primary database failure occurs, redo data
written to standby redo logs can be fully recovered.
Failover
• Failover needed when switchover not possible• i.e. The primary is gone!• Same basic steps as switchover but some
processing might be done manually• Remember!
• Don’t plan for DR by expecting to be able to return to the Primary and ‘get’ something.
• You won’t be able to ‘get’ anything.
HP IT-Symposium 2006
www.decus.de 8
Page 8
Choose a Standby• Choose a standby site with the most up to date redo
information• If Primary was ‘protected’ then one site must have the
‘latest’ redo information in its Standby On-line Redo logs• Archive Log file ‘GAPs’ at this site must be resolved from
the other standby sites
• Choose a Physical and other standby databases will come along if possible.
• Choose a Logical and none of the other standby databases can come along
Physical Standby Failover
HP IT-Symposium 2006
www.decus.de 9
Page 9
Failover to a Physical Standby
PrimaryDatabase
Physical Standby Database
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH;
ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;
1
2
3RESTART DATABASE
Step 1 Improvements
• In Oracle Database10g Release 2 • New FORCE keyword
• RECOVER MANAGED STANDBY DATABASE FINISH FORCE;
• The new FORCE option stops active RFS processes on the target standby database so the failover will proceed immediately, without waiting for network connections to time out, once logs have been applied.
• SQLNET.EXPIRE_TIME no longer necessary.
10.2.0.1
HP IT-Symposium 2006
www.decus.de 10
Page 10
Step 3 Improvements
• In Oracle Database10g Release 2 • No longer necessary to restart the Standby for it
to become the Primary, just do an:• ALTER DATABASE OPEN;
• Requires that the standby was not opened read only since it was last started.
• Speeds up failover time considerably.
10.2.0.1
Failover to a Physical Standby
PrimaryDatabase
Physical Standby Database
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE;
ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;
1
2
3ALTER DATABASE OPEN;
10.2.0.1
HP IT-Symposium 2006
www.decus.de 11
Page 11
Logical Standby Failover
Failover to a Logical Standby
PrimaryDatabase
Logical Standby Database
ALTER DATABASE STOP LOGICAL STANDBY APPLY;
ALTER DATABASE START LOGICAL STANDBY APPLY FINISH;
1
2
ALTER DATABASE ACTIVATE LOGICAL STANDBY DATABASE;
4
ALTER DATABASE STOP LOGICAL STANDBY APPLY;
3
HP IT-Symposium 2006
www.decus.de 12
Page 12
Reduced Steps
• In Oracle Database10g Release 2 • Apply finish and failover now in one command.• No longer necessary to start or stop the Apply.
10.2.0.1
Failover to a Logical Standby
PrimaryDatabase
Logical Standby Database
ALTER DATABASE ACTIVATE LOGICAL STANDBY DATABASE FINISH APPLY;
1
10.2.0.1
HP IT-Symposium 2006
www.decus.de 13
Page 13
Using the Data Guard Broker
One Step always!
• Login to DGMGRL by connecting to any surviving database in the configuration
• And execute the failover!• DGMGRL> FAILOVER TO <database>;
• You still need to decide which standby to use as the target of the failover!
HP IT-Symposium 2006
www.decus.de 14
Page 14
What about Grid Control?
HP IT-Symposium 2006
www.decus.de 15
Page 15
Fast-Start Failover
Eliminate the Manual Steps!
Remember the Requirements?
• Faster is better - Downtime is bad?• Site failover time measured in seconds
• Not minutes• Failover is automatic, no manual intervention
• Reliability is a must-have?• Eliminates human error
• Simplicity is preferred?• Automatically determines if failover criteria met• Original primary database is automatically
reinstated as a new standby database.
HP IT-Symposium 2006
www.decus.de 16
Page 16
Fast-Start Failover Architecture
• Primary Database
• Target Standby Database
• Observer Process
Standby SitePrimary Site
Observer
database database
Fast-Start Failover
The Details
HP IT-Symposium 2006
www.decus.de 17
Page 17
Fast-Start Failover Requirements
• Primary and Standby are managed by the Data Guard Broker
• Primary database must be in Maximum Availability mode
• Primary and standby must have Flashback Database enabled
• Observer host must have DGMGRL utility installed and must have Oracle Net connectivity to both the primary and standby
Setting it up using Grid Control
HP IT-Symposium 2006
www.decus.de 18
Page 18
HP IT-Symposium 2006
www.decus.de 19
Page 19
HP IT-Symposium 2006
www.decus.de 20
Page 20
HP IT-Symposium 2006
www.decus.de 21
Page 21
HP IT-Symposium 2006
www.decus.de 22
Page 22
HP IT-Symposium 2006
www.decus.de 23
Page 23
Using the Broker Directly
DGMGRL command line interface
Set the Target and Threshold
• Configure• “FastStartFailoverTarget” is the “DB_UNIQUE_NAME” of the
target standby database. Using DGMGRL;• DGMGRL> EDIT DATABASE 'North_Sales‘
SET PROPERTY FastStartFailoverTarget =‘DR_Sales’;
• “FastStartFailoverThreshold” is the Number of seconds Observer attempts to reconnect to the primary database before initiating fast-start failover• DGMGRL> EDIT CONFIGURATION
SET PROPERTY FastStartFailoverThreshold = 45;
HP IT-Symposium 2006
www.decus.de 24
Page 24
Enable and Startup
• Enable• Can be done before or after the Observer
• DGMGRL> ENABLE FAST_START FAILOVER;
• Start• Log in to the Observer host
• DGMGRL> START OBSERVER;
• Control is not returned to the user until the observer is stopped
• Specify the -logfile parameter on the command line so that output generated while acting as the observer is not lost.
Post Failover
• Reinstate After Failover• Auto reinstatement of old primary as a new
standby will happen when the original Primary database is available again.
• Can also be performed manually• DGMGRL> REINSTATE DATABASE <database>;
HP IT-Symposium 2006
www.decus.de 25
Page 25
How does it work?
Fast-Start Failover
Standby SitePrimary Site
Observer
1. Data Guard in steady state – transmitting redo2. Observer monitoring state of the configuration
HP IT-Symposium 2006
www.decus.de 26
Page 26
Fast-Start Failover
Standby SitePrimary Site
Observer
3. Disaster strikes the primary – connections lost
Fast-Start Failover
Standby SitePrimary Site
Observer
4. Observer <=> primary connection times out (timeout threshold configurable)5. Observer asks target standby if it is ready to fail over6. Observer begins Fast-Start Failover
HP IT-Symposium 2006
www.decus.de 27
Page 27
Fast-Start Failover
Observer
Primary Site
7. Target standby automatically becomes new primary
Fast-Start Failover
Observer
Standby Site Primary Site
8. After old primary is repaired, Observer re-establishes connection9. Observer automatically reinstates old primary to be a new standby10. Redo transmission starts from new primary to new standby
HP IT-Symposium 2006
www.decus.de 28
Page 28
When is a Fast-Start Failover Triggered?• Primary Site Failure• Primary Database Conditions:
• Instance Failure• Last surviving instance if RAC
• Shutdown abort of the last available instance• Datafiles taken offline due to I/O errors
• Threshold ignored when performing a failover due to offline datafiles
When is a Fast-Start Failover Triggered?• Network Related Conditions:
• Failover occurs only if link between primary and observer as well as primary and standby are down
• Requires a connection between Observer and standby to enable the Observer to confirm that the configuration is in a synchronized state
• By ensuring that at least two fast-start failover partners are present, conditions such as split-brain scenarios are avoided
HP IT-Symposium 2006
www.decus.de 29
Page 29
Fast-Start Failover Monitoring
Fast-Start Failover Monitoring
• Monitor current state of configuration via FS_FAILOVER_STATUS column of V$DATABASE• SYNCHRONIZED – Primary and Standby are in sync• UNSYNCHRONIZED – Standby does not have all of
the primary database redo• Monitor the Observer via the FS_FAILOVER_OBSERVER_PRESENT column of the V$DATABASE view
HP IT-Symposium 2006
www.decus.de 30
Page 30
Reinstatement afterFast-Start Failover• Any attempt to start old primary will stop at the
mount state thus preventing split brain• Once Observer sees the old primary is at the
mount state, reinstatement is begun• The old primary is automatically reinstated as
the new standby using flashback database• Once reinstated and synchronized then a
switchover can occur if desired – returning all systems to their original roles
Best Practices – Primary Database
• Maximum Availability Protection Mode• Redo Transport: LGWR SYNC AFFIRM
• Synchronous Redo Shipping . . . but• Primary is not affected by network or standby outages• Set net_timeout parameter to override TCP timeout
• Configure Flashback Database • Set DB_FLASBACK_RETENTION_TARGET = 10 minutes
Note: If Flashback Database serves additional function of protection against user error & corruption, then an extended flashback retention period should be set for an amount of time required to achieve these goals
HP IT-Symposium 2006
www.decus.de 31
Page 31
Best Practices – Network Transport
• Tune OS & network parameters • Set SDU=32K• Tune network parameters that affect network
buffer sizes and queue lengths • Ensure sufficient network bandwidth for
maximum database redo rate + other activities
Refer to Primary Site and Network Configuration Best Practiceshttp://www.oracle.com/technology/deploy/availability/pdf/MAA_DG_NetBestPrac.pdf
Impact of Network Tuning
Impact of Network Tuning
937
10.8
0 200 400 600 800 1000
Tuned
Default
Network throughputMbits/sec
Oracle MAA Test Result
HP IT-Symposium 2006
www.decus.de 32
Page 32
Best Practices – Standby Database• Use Standby Redo Logs• Use Real-time Apply• Configure Flashback Database
• Set DB_FLASBACK_RETENTION_TARGET = 10 (minutes)
• Optimize Apply Performance using MAA Best Practices for:
• Redo Apply (physical standby): Data Guard Redo Apply and Media Recovery
• SQL Apply (logical standby): Oracle Database 10g Data Guard SQL Apply
• MAA Home Page on OTN: http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm
Best Practices – Observer
• Install in a separate location from Primary & Standby data centers
• Do not locate the Observer at or near the primary site• Proximity to the standby site is preferred, but far
enough away to be isolated from events that typically impact the standby site
• Oracle Client Administrator install is all that is required for Observer install
• If using Enterprise Manger, also install the Enterprise Manager Agent
HP IT-Symposium 2006
www.decus.de 33
Page 33
Best Practices – Setting FastStartFailoverThreshold
• Failover occurs when observer and standby lose contact for specified time (seconds)
• Recommended settings:• Single Instance primary with low latency reliable
network = 10 – 15 seconds• Single Instance primary with high latency network
over WAN = 30 – 45 seconds • RAC primary = (misscount+reconfiguration time)
+ 20-40 seconds
Best Practices – Multiple Standbys
• Ensure data protection at all times by maintaining a 2nd Data Guard standby at a remote location
• When regulatory & business requirements mandate that data be protected at all times
• At failover time, the remote standby automatically becomes a standby for the new primary
• New primary must have begun as a physical standby• Configure the remote standby for Maximum
Performance• Eliminates overhead of WAN network latency• Recommended redo transport is LGWR ASYNC
HP IT-Symposium 2006
www.decus.de 34
Page 34
Best Practices – HA & DR
• Use RAC & Data Guard Together• The best possible combination of HA & DR
• Scalable, flexible, secure• Foundation for MAA
Client Failover
Oracle Database 10g Release 1vs
Oracle Database 10g Release 2
HP IT-Symposium 2006
www.decus.de 35
Page 35
Oracle Database 10g Release 1 Client Failover
Standby SitePrimary Site
1. Primary site failure 2. Both FAN ONS (JDBC) and OCI clients wait for TCP timeout
JDBC/OCI Clients
Oracle Database 10g Release 1 Client Failover
New Primary SiteFailed Primary Site
3. Data Guard manual failover is executed, standby database transitions to primary role4. FAN ONS (JDBC) and OCI clients are NOT notified of new primary cluster5. Clients are redirected manually using TAF or some other mechanism6. Old primary database is rebuilt from a new backup
JDBC/OCI Clients
HP IT-Symposium 2006
www.decus.de 36
Page 36
A Sample Hardware Solution
Cisco Distributed Directory
Data Guard
InternetInternet
Cisco Local Directory
Cisco Local Directory
Application Servers
Application Servers
RAC RAC
Primary Standby
A DNS Solution
Primary Standby
Data Guard
InternetInternet
Application Servers
Application Servers
RAC RAC
DNS Server
HP IT-Symposium 2006
www.decus.de 37
Page 37
A TNS Method
• Use the ‘service_name’ parameter to have whomever is the Primary defined with the name the applications look for.
• Works with Dynamic registering of the database with the listener when the database is mounted.
• Requires that the parameter be changed accordingly.
• Requires a 2nd tnsnames entry for Data Guard transport services.
On the Primary System
• Configure the listener and start it.• LISTENER.ORA
• Status
LISTENER =(DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST = Primary)(PORT = 1521))
)
Listener Parameter File /private2/oracle/OraHome/network/admin/listener.oraListening Endpoints Summary...(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=Primary)(PORT=1521)))
Services Summary...Service “payroll.us.oracle.com" has 1 instance(s).Service “payrollDR.us.oracle.com" has 1 instance(s).The command completed successfully
HP IT-Symposium 2006
www.decus.de 38
Page 38
On the Standby System
• Configure the listener and start it.• LISTENER.ORA
• Status
LISTENER =(DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST = Standby)(PORT = 1521))
)
Listener Parameter File /private2/oracle/OraHome92/network/admin/listener.oraListening Endpoints Summary...(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=Standby)(PORT=1521)))
Services Summary...Service “payrollDR.us.oracle.com" has 1 instance(s).The command completed successfully
Primary System TNS
• Primary tnsnames.oraPAYROLLDR =(DESCRIPTION =(ADDRESS_LIST =(ADDRESS = (PROTOCOL = TCP)(HOST = Standby.us.oracle.com)(PORT = 1521))
)(CONNECT_DATA =(SERVICE_NAME = payrollDR.us.oracle.com)
))
PAYROLL =(DESCRIPTION =(ADDRESS_LIST =(ADDRESS = (PROTOCOL = TCP)(HOST = Standby.us.oracle.com)(PORT = 1521))(ADDRESS = (PROTOCOL = TCP)(HOST = Primary.us.oracle.com)(PORT = 1521))
)(CONNECT_DATA =(SERVICE_NAME = payroll.us.oracle.com)
))
HP IT-Symposium 2006
www.decus.de 39
Page 39
Standby System TNS
• Standby tnsnames.oraPAYROLLDR =(DESCRIPTION =(ADDRESS_LIST =(ADDRESS = (PROTOCOL = TCP)(HOST = Primary.us.oracle.com)(PORT = 1521))
)(CONNECT_DATA =(SERVICE_NAME = payrollDR.us.oracle.com)
))
PAYROLL =(DESCRIPTION =(ADDRESS_LIST =(ADDRESS = (PROTOCOL = TCP)(HOST = Standby.us.oracle.com)(PORT = 1521))(ADDRESS = (PROTOCOL = TCP)(HOST = Primary.us.oracle.com)(PORT = 1521))
)(CONNECT_DATA =(SERVICE_NAME = payroll.us.oracle.com)
))
After Switchover or Failover
• Reset the service_names parameter• New Primary (The ‘Old Standby’)
• New standby (The Old Primary)
• The LOG_ARCHIVE_DEST definitions point to each other using payrollDR
• The application client systems only have the Payroll definition so they try system 1 first then system 2.
ALTER SYSTEM SET SERVICE_NAMES=‘payroll,payrollDR’;
ALTER SYSTEM SET SERVICE_NAMES=‘payrollDR’;
HP IT-Symposium 2006
www.decus.de 40
Page 40
A TNS setup
Data Guard
UsersUsers
DBLINKS
Trading
Operational Data Store
Primary Site Secondary Site
What about RAC?
• Multiple addresses are required for the Primary RAC nodes to facilitate node failover
• Cannot really have multiple Primary RAC hosts and then Standby RAC hosts in the same connect string.
• Would require too many connect timeouts to get to the standby
• Need a better more proactive method• Let’s talk about Oracle Database10g Release 2
HP IT-Symposium 2006
www.decus.de 41
Page 41
Oracle Database 10g Release 2 Improved Client Failover
New Primary SiteFailed Primary Site
1. Observer detects failure, executes database failover when threshold is exceeded2. DB_ROLE_CHANGE trigger fires: enables primary service, updates Oracle Net alias to
point to new primary host, restarts JDBC mid-tier clients, calls any other application or pre-failover steps User writes the trigger code
3. DB_DOWN event is sent to FAN OCI clients4. Both FAN ONS (JDBC) and OCI clients drop connections and re-attach to the new primary5. Upon restart, the old primary database is reinstated automatically by Fast-Start Failover
JDBC/OCI Clients
Observer
Client Failover Components
• Connect Time Failover• Redirects failed connection requests to a
secondary listener• Transparent Application Failover (TAF)
• Client applications automatically reconnect to a database if the original connection fails.
• Fast Application Notification (FAN)• Provides quick notification when a resource (an
instance, service, node, or database) fails.
HP IT-Symposium 2006
www.decus.de 42
Page 42
Client Failover Components
• Fast Connection Failover (FCF)• Provides fast failover of database connections by
allowing you to configure FAN-integrated clients to automatically subscribe to FAN HA events.
• DB_ROLE_CHANGE system event• Fired when the primary database is first opened
after a Data Guard role transition has occurred.• DB_DOWN Event
• Fired by the Broker after a failover
DB_ROLE_CHANGE Event
• New DB_ROLE_CHANGE system event fires. • A Trigger written around DB_ROLE_CHANGE
event can automatically:• Enable primary service name• Modify LDAP or other naming methods• Restart JDBC mid-tier clients• Start user applications
• Happens at all role changes.• All details with examples are described in the
MAA Best Practices paper published on OTN
HP IT-Symposium 2006
www.decus.de 43
Page 43
A Sample TriggerSQL> CREATE OR REPLACE TRIGGER set_rc_svc AFTER
DB_ROLE_CHANGE ON DATABASE DECLARE role VARCHAR(30); BEGIN SELECT DATABASE_ROLE INTO role FROM V$DATABASE; IF role = 'PRIMARY' THENDBMS_SERVICE.START_SERVICE(‘payroll'); begin dbms_scheduler.create_job( job_name=>'change_ldap', job_type=>'executable',job_action=>'/u01/oracle/10.2.0/bin/change_ldap.sh',enabled=>TRUE);
end;begin
dbms_scheduler.create_job( job_name=>'publish_events', job_type=>'executable', job_action=>'/u01/oracle/10.2.0/bin/cfo.sh',enabled=>TRUE );
end;ELSE DBMS_SERVICE.STOP_SERVICE(‘payroll'); END IF; END;
Data Guard BrokerClient Redirection
• New DB_DOWN event is posted after the new primary is open
• Event notifies FAN OCI clients that the old primary is down
• Clients reconnect to the new primary/service• Done via AQ notifications
• Occurs only during a Broker Failover• A Fast-Start Failover• Manual “Failover to <database>” command
HP IT-Symposium 2006
www.decus.de 44
Page 44
Fast-Start Failover
How Available?How Fast?
How Simple?How Reliable?
Amazon.comFannie Mae
Thomson Legal & RegulatoryAirbus Deutschland GmbH
How Available? - Amazon.com
The capability of fast, guaranteed zero-data-loss failover with Fast-Start Failover in Oracle Data Guard takes the availability of an Oracle database platform to new levels. Our initial tests running Oracle Database 10g Release 2, show that Fast-Start Failover offers a magnitude of improvement in availability.
Rajesh ShethManager, Database EngineeringAmazon.com
HP IT-Symposium 2006
www.decus.de 45
Page 45
How Fast? - Fannie Mae
Fast-start Failover takes the DBA off the critical path. Database failover is automatic. Data Guard can now address recovery time objectives measured in seconds.
Ranjit Singh VeenManager, Enterprise Systems ManagementFannie Mae
How Simple? - Thomson Legal & Regulatory
Fast-start failover testing has shown great potential. The original primary database can be reinstated as a new standby in less than 5 minutes once the initial failure has been corrected.
Thomson Legal & Regulatory
HP IT-Symposium 2006
www.decus.de 46
Page 46
How Reliable? - Airbus
Failover executed automatically without manual intervention in less than a minute. This was much faster than a cold failover using third party cluster technology. With Data Guard, Airbus can achieve continuous data protection and high levels of availability using a standard feature of the Oracle Database.
Werner Kawollek Application Management Operations Airbus Deutschland GmbH
Q U E S T I O N SQ U E S T I O N SA N S W E R SA N S W E R S
HP IT-Symposium 2006
www.decus.de 47
Page 47