a deep dive into asm redundancy in exadata

33
1 Overview 2 Failure 3 Second Failure 4 Usable Space 5 ASMCMD "lsdg" Output Emre Baransel – Advanced Support Engineer, Employee ACE- Oracle A Deep Dive into ASM Redundancy in Exadata

Upload: emre-baransel

Post on 25-Jul-2015

792 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: A Deep Dive into ASM Redundancy in Exadata

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output

Emre Baransel – Advanced Support Engineer, Employee ACE- Oracle

A Deep Dive into ASM Redundancy in Exadata

Page 2: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output

Storage Server 1 Storage Server 2 Storage Server 3

We’ll consider 3 storage servers in examples

Storage Servers Notation

Page 3: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

12

1

2

3

4

5

6

7

8

9

10

11

Storage Server 1 Storage Server 2 Storage Server 3

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Disks on Storage Servers

Page 4: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

Storage Server 1 Storage Server 2 Storage Server 3

PHYSICAL DISC

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Physical Disks

Page 5: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

Storage Server 1 Storage Server 2 Storage Server 3

SYSTEM PARTITIONS DBFS DG RECO DG DATA DG

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Logical Partitions/Diskgroups

Page 6: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

Storage Server 1 Storage Server 2 Storage Server 3

RECO DG DATA DG

GRID/ASM DISCS

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Grid Disks (Partitions)

SYSTEM PARTITIONS DBFS DG

Page 7: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

Storage Server 1 Storage Server 2 Storage Server 3

RECO DG DATA DG

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Disks Usage Notation

SYSTEM PARTITIONS DBFS DG

Page 8: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

FAILGROUP 1 FAILGROUP 2 FAILGROUP 3

NORMAL REDUNDANCY

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Normal Redundancy Diskgroups

Page 9: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

HIGH REDUNDANCY

FAILGROUP 1 FAILGROUP 2 FAILGROUP 3

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output High Redundancy Diskgroups

Page 10: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

- Disk Failure - transient disk failure

- physical disk failure

- Storage Server Failure

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Types of Failures

This presentation examines failures in groups, in order to provide clarity. There may be exceptional cases.

Page 11: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

TRANSIENT FAILURE (OFFLINE)

Storage Server 1 Storage Server 2 Storage Server 3

RECO DG DATA DG

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Transient Disk Failures

SYSTEM PARTITIONS DBFS DG

Page 12: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

FAILURE CORRECTED or NEW DISK

Storage Server 1 Storage Server 2 Storage Server 3

FAILURE CORRECTED or DISK REPLACED BEFORE DISK_REPAIR_TIME EXCEEDS

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Transient Disk Failures

Page 13: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

Storage Server 1 Storage Server 2 Storage Server 3

DISK IS RESYNCED WITH ASM FAST MIRROR RESYNC

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Transient Disk Failures

Page 14: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

Storage Server 1 Storage Server 2 Storage Server 3

IF DISK_REPAIR_TIME EXCEEDS THEN

ASM DROPS THE DISKS AND REBALANCE DATA IF THERE IS ENOUGH SPACE

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Transient Disk Failures

Page 15: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

• DISK_REPAIR_TIME is a diskgroup attribute.

• Default is 3.6 hours.

• alter diskgroup data set attribute 'disk_repair_time' = '4.5h‘

• Altering the DISK_REPAIR_TIME attribute has no effect on offline disks

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output DISK_REPAIR_TIME Attribute

Page 16: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

PHYSICAL DISC FAILURE

Storage Server 1 Storage Server 2 Storage Server 3

RECO DG DATA DG

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Physical Disk Failures

SYSTEM PARTITIONS DBFS DG

Page 17: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

Storage Server 1 Storage Server 2 Storage Server 3

ASM DOESN’T WAIT FOR DISK_REPAIR_TIME,

DROPS THE DISK AND REBALANCE DATA IF THERE IS ENOUGH SPACE

(Pro-Active Disk Quarantine - 11.2.1.3.1)

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Physical Disk Failures

Page 18: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

Storage Server 1 Storage Server 2 Storage Server 3

WHEN DISK IS REPLACED GRID DISCS ARE CREATED & 2. REBALANCE STARTS AUTOMATICALLY

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Physical Disk Failures

Page 19: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

AUTO DISK MANAGEMENT feature in EXADATA

Exadata Automation Manager (XDMG)

initiates automation tasks. Monitors all configured storage cells for state changes.

Exadata Automation Worker (XDWK)

performs automation tasks requested by XDMG.

_AUTO_MANAGE_EXADATA_DISKS controls the auto disk management feature. To disable the feature

set this parameter to FALSE. Range of values: TRUE [default] or FALSE.

_AUTO_MANAGE_NUM_TRIES controls the maximum number of attempts to perform an automatic

operation. Range of values: 1-10. Default value is 2.

_AUTO_MANAGE_MAX_ONLINE_TRIES controls maximum number of attempts to ONLINE a disk.

Range of values: 1-10. Default value is 3.

NOTE:1484274.1 - Auto disk management feature in Exadata

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Auto Disk Management

Page 20: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

F A I L E D

Storage Server 1 Storage Server 2 Storage Server 3

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Storage Server Failures

Page 21: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

• WHEN A STORAGE SERVER FAILS IT MEANS THE FAILURE OF THE

WHOLE FAILGROUP IN ASM

• ASM DOES NOT DROP DISKS BEFORE DISK_REPAIR_TIME EXCEEDS

• SAME WHEN REBOOTING THE STORAGE SERVER

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Storage Server Failures

Page 22: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

Storage Server 1 Storage Server 2 Storage Server 3

IF SERVER IS ALIVE BEFORE DISK_REPAIR_TIME EXCEEDS,

DISKS WILL BE SYNCED – NO REBALANCE

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Storage Server Failures

Page 23: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

F A I L E D

Storage Server 1 Storage Server 2 Storage Server 3

IF DISK_REPAIR_TIME EXCEEDS,

ASM WILL REBALANCE DATA IF THERE IS ENOUGH SPACE

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Storage Server Failures

Page 24: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

Storage Server 1 Storage Server 2 Storage Server 3

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Storage Server Failures

WHEN STORAGE SERVER COMES BACK THERE WILL BE A SECOND REBALANCE

Page 25: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

In Normal Redundancy;

What happens at second failure, is first related with when it occurs.

- If after rebalance/sync is completed,

then procedure is same with the first failure.

- If before rebalance/sync is completed,

then what happens is related with which disk is failed.

- If first & second failed disks are not partner disks, a new rebalance is

in question, if there’s enough space

- If first & second failed disks are partner disks data loss occurs.

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Second Failure / Bad Chance

• This is a small possibility but needs consideration. • Partner disks are on different storage servers (failgroups). • First incident doesn’t have to be a failure, storage server reboot causes the same.

Exadata Database Machine : How to identify cell failgroups and Partner disks for a grid disk (Doc ID 1431697.1)

Page 26: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

In High Redundancy;

There are three copies of each extent

So second failure never cause a data loss in High Redundancy

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Second Failure / Bad Chance

Page 27: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

”MOUNT RESTRICTED FORCE FOR RECOVERY” feature

>= 11.2.0.4 BP16

>= 12.1.0.2 BP4

Applicable to NORMAL redundancy diskgroups only.

Potential Use Cases that this procedure will be applicable to :

1. Exadata cell rolling upgrade/patching and a partner disk failure at the same time

2. Transient disk failure in a cell followed by a permanent partner disk failure before the first failed disk

comes back online.

NOTE:1968642.1 - Recover from diskgroup failure using the 12.1.0.2 “mount restricted force for recovery” feature - An Example

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output A New Feature

Page 28: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

”MOUNT RESTRICTED FORCE FOR RECOVERY” example:

o Cell 1 CellCLI> Alter cell shutdown services all;

o Cell 2 alter physicaldisk <disk> simulate failureType=failed; database crashes

o SQL> alter diskgroup datac1 mount restricted force for recovery;

o CellCLI> Alter cell start services all;

o SQL> alter diskgroup datac1 online disks in failgroup CELLFG1;

o Wait until MODE_STATUS column in v$asm_disk for the disks being onlined changes to

ONLINE from SYNCING.

o Do NOT execute the subsequent steps if the mode_status column shows SYNCING. It

will lead to data corruption.

o In resync, due to the second disk failure, Arb0 will not be able to read some of the required extents

(which are in the failed second disk) and hence marks those missing extents with BADFDA7A.

(arb0 trace file => WARNING: group 1, file 258, extent 100: filling extent with BADFDA7A during recovery)

o SQL> alter diskgroup datac1 dismount;

SQL> alter diskgroup datac1 mount;

o Start database & Perform RMAN block media recovery

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Example Procedure

Page 29: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

In an Exadata ASM Diskgroup, we can mention following disk spaces:

Total Raw Size (TRS)

Used Raw Size (URS)

Free Raw Size (FRS)

Total Allocatable Size (TAS) TRS / Redundancy Factor

Used Allocatable Size (UAS) URS / Redundancy Factor

Free Allocatable Size (FAS) FRS / Redundancy Factor

Size Needed for Disk Failure Coverage (SNDFC) Largest Disk (or 2 Disks for High R.)

Size Needed for Cell Failure Coverage (SNCFC) Largest Cell (or 2 Cells for High R.)

Total Disk Failure Safe Allocatable Size (TRS - SNDFC) / Redundancy Factor

Total Cell Failure Safe Allocatable Size (TRS - SNCFC) / Redundancy Factor

Free Disk Failure Safe Allocatable Size (FRS - SNDFC) / Redundancy Factor

Free Cell Failure Safe Allocatable Size (FRS - SNCFC) / Redundancy Factor

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output What kind of Usable Space?

Page 30: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

Total Raw Size (TRS) 360

Used Raw Size (URS) 120

Free Raw Size (FRS) 240

Total Allocatable Size (TAS) TRS / 2 = 180

Used Allocatable Size (UAS) URS / 2 = 60

Free Allocatable Size (FAS) FRS / 2 = 120

Size Needed for Disk Failure Coverage (SNDFC) 10

Size Needed for Cell Failure Coverage (SNCFC) 120

Total Disk Failure Safe Allocatable Size (TRS - SNDFC) / 2 = 175

Total Cell Failure Safe Allocatable Size (TRS - SNCFC) / 2 = 120

Free Disk Failure Safe Allocatable Size (FRS - SNDFC) / 2 = 115

Free Cell Failure Safe Allocatable Size (FRS - SNCFC) / 2 = 60

Normal Redundancy

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Calculations for Normal Redundancy

Page 31: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

Total Raw Size (TRS) 360 360

Used Raw Size (URS) 120 120

Free Raw Size (FRS) 240 240

Total Allocatable Size (TAS) TRS / 2 = 180 TRS / 3 = 120

Used Allocatable Size (UAS) URS / 2 = 60 URS / 3 = 40

Free Allocatable Size (FAS) FRS / 2 = 120 FRS / 3 = 80

Size Needed for Disk Failure Coverage (SNDFC) 10 20

Size Needed for Cell Failure Coverage (SNCFC) 120 240

Total Disk Failure Safe Allocatable Size (TRS - SNDFC) / 2 = 175 (TRS - SNDFC) / 3 = 113.3

Total Cell Failure Safe Allocatable Size (TRS - SNCFC) / 2 = 120 N/A for Quarter Rack

Free Disk Failure Safe Allocatable Size (FRS - SNDFC) / 2 = 115 (FRS - SNDFC) / 3 = 73.3

Free Cell Failure Safe Allocatable Size (FRS - SNCFC) / 2 = 60 N/A for Quarter Rack

Normal Redundancy High Redundancy

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output Calculations for High Redundancy

Page 32: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

ASMCMD> lsdg State Type Rebal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files Name MOUNTED NORMAL N 512 4096 4194304 27942912 16708892 9314304 3697294 0 N DATAC1/ MOUNTED NORMAL N 512 4096 4194304 1038240 1036984 346080 345452 0 Y DBFS_DG/ MOUNTED NORMAL N 512 4096 4194304 11973312 7966060 3991104 1987478 0 N RECOC1/

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output What we have in ASMCMD

Total_MB Total Raw Size (TRS) Free_MB Free Raw Size (FRS) Req_mir_free_MB ≥11.2.0.4.9 & ≥ 12.1.0.2 Size Needed for Disk Failure Coverage (SNDFC) <11.2.0.4.9 & <12.1.0.2 Size Needed for Cell Failure Coverage (SNCFC) Usable_file_MB ≥11.2.0.4.9 & ≥ 12.1.0.2 Free Disk Failure Safe Allocatable Size ≥11.2.0.4.9 & ≥ 12.1.0.2 Free Cell Failure Safe Allocatable Size

Page 33: A Deep Dive into ASM Redundancy in Exadata

A Deep Dive into ASM

Redundancy in Exadata

References

1 – Overview

2 – Failure

3 – Second Failure

4 – Usable Space

5 – ASMCMD "lsdg" Output

Oracle Exadata Database Machine Maintenance Guide

Automatic Storage Management Administrator's Guide

NOTE:1484274.1 - Auto disk management feature in Exadata

NOTE: 443835.1 - ASM Fast Mirror Resync - Example To Simulate Transient Disk Failure And Restore Disk

NOTE:1431697.1 - Exadata Database Machine : How to identify cell failgroups and Partner disks for a grid disk

NOTE:1968642.1 - Recover from diskgroup failure using the 12.1.0.2 “mount restricted force for recovery” feature - An Example

NOTE:1386147.1 - How to Replace a Hard Drive in an Exadata Storage Server (Hard Failure)

NOTE:1339373.1 - Operational Steps for Recovery after Losing a Disk Group in an Exadata Environment

NOTE:1551288.1 - Understanding ASM Capacity and Reservation of Free Space in Exadata

NOTE:1319567.1 - ASM Usable Space Calculations in Exadata Environment along with cell failure considerations