storage_diagnostics_and_troubleshooting_guide

Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved

Storage Diagnostics and Troubleshooting Participant Guide

Global Education Services LSI Corporation


3rd edition (July 2008)


Table of Contents

Terms and Conditions .............................................................................................. 5

Storage Systems Diagnostics and Troubleshooting Course Outline ............................... 9

Module 1: Storage System Support Data Overview ................................................... 13

All support Data Capture..................................................................................... 14

Major Event Log (MEL) Overview ......................................................................... 17

State Capture Data File ....................................................................................... 32

Accessing the Controller Shell .............................................................................. 34

Logging In To the Controller Shell (06.xx) ............................................................ 34

Logging In To the Controller Shell (07.xx) ............................................................ 34

Controller Analysis.............................................................................................. 35

Additional Output ............................................................................................... 48

Knowledge Check ............................................................................................... 50

Additional Commands ......................................................................................... 51

Debug Queue..................................................................................................... 56

Knowledge Check ............................................................................................... 59

Modifying Controller States.................................................................................. 60

Diagnostic Data Capture (DDC) ........................................................................... 62

Knowledge Check ............................................................................................... 65

Module 3: Configuration Overview and Analysis....................................................... 67

Configuration Overview and Analysis.................................................................... 68

Knowledge Check ............................................................................................... 74

Drive and Volume State Management................................................................... 75

Volume Mappings Information ............................................................................. 92

Knowledge Check ............................................................................................... 94

Portable Volume Groups in 07.xx ......................................................................... 95

RAID 6 Volumes in 07.xx..................................................................................... 96

Troubleshooting Multiple Drive Failures ................................................................ 97

Offline Volume Groups ...................................................................................... 106

Clearing the Configuration................................................................................. 108

Recovering Lost Volumes .................................................................................. 109

Knowledge Check ............................................................................................. 114

Module 4: Fibre Channel Overview and Analysis .................................................... 115

Fibre Channel................................................................................................... 116

Fibre Channel Arbitrated Loop (FC-AL) ............................................................... 116

Fibre Channel Arbitrated Loop (FC-AL) – The LIP ................................................ 117

Knowledge Check ............................................................................................. 122

Drive Side Architecture Overview ....................................................................... 123

Knowledge Check ............................................................................................. 139


Destination Driver Events.................................................................................. 140

Read Link Status (RLS) and Switch-on-a-Chip (SOC)............................................ 143

What is SOC or SBOD?...................................................................................... 148

Field Case........................................................................................................ 160

Drive Channel State Management ...................................................................... 161

SAS Backend.................................................................................................... 163

Appendix A: SANtricity Managed Storage Systems .................................................. 173

6998 /6994 /6091 (Front) ................................................................................. 174

6998 /6994 /6091 (Back) .................................................................................. 174

3992 (Back) ..................................................................................................... 175

3994 (Back) ..................................................................................................... 176

4600 16-Drive Enclosure (Back)......................................................................... 176

4600 16-Drive Enclosure (Front) ........................................................................ 176

Appendix B: Simplicity Managed Storage Systems .................................................. 178

1333 ............................................................................................................... 178

1532 ............................................................................................................... 179

1932 ............................................................................................................... 180

SAS Drive Tray (Front)...................................................................................... 181

SAS Expansion Tray (Back) ............................................................................... 181

Appendix C – State, Status, Flags (06.xx) .............................................................. 183

Appendix D – Chapter 2 - MEL Data Format ........................................................... 189

Appendix E – Chapter 30 – Data Field Types.......................................................... 203

Appendix F – Chapter 31 – RPC Function Numbers ................................................. 215

Appendix G – Chapter 32 – SYMbol Return Codes................................................... 229

Appendix H – Chapter 5 - Host Sense Data ............................................................ 261

Appendix I – Chapter 11 – Sense Codes ................................................................ 279


Terms and Conditions

Agreement

This Educational Services and Products Terms and Conditions (“Agreement”) is between LSI Corporation (“LSI”), a Delaware corporation, doing business in AL, AZ, CA, CO, CT, DE, FL, GA, KS, IL, MA, MD, MN, NC, NH, NJ, NY, OH, OR, PA, SC, UT, TX, VA and WA as LSI Corporation, with a place of business at 1621 Barber Lane, Milpitas, California 95035 and you, the Student. By signing this Agreement, or clicking on the “Accept” button as appropriate, Student accepts all of the terms and conditions set forth below. LSI reserves the right to change or modify the terms and conditions of this Agreement at any time.

Course materials

The course materials are derived from end-user publications and engineering data related to LSI’s Engenio Storage Group (“ESG”) and reflect the latest information available at the time of printing but will not include modifications if they occurred after the date of publication. In all cases, if there is discrepancy between this information and official publications issued by LSI, LSI’s official publications shall take precedence. LSI assumes no obligation for the accuracy or correctness of the course materials and assumes no obligation to correct any errors contained herein or to advise Student of liability for the accuracy or correctness of the course materials provided to Student. LSI makes no commitment to update the course materials and LSI reserves the right to change the course materials, including any terms and conditions, from time to time at its sole discretion. LSI reserves the right to seek all available remedies for any illegal misuse of the course materials by Student. LSI reserves the right to seek all available remedies for any illegal misuse of the course materials.

Certification

Student acknowledges that purchasing or participating in an LSI course does not imply certification with respect to any LSI certification program. To obtain certification, Student must successfully complete all required elements in an applicable LSI certification program. LSI may update or change certification requirements at any time without notice.

Ownership

LSI and its affiliates retain all right, title and interest in and to the course materials, including all copyrights therein. LSI grants Student permission to use the course materials for personal, educational purposes only. The resale, reproduction, or distribution of the course materials, and the creation of derivative works based on the course materials, is prohibited without the prior express written permission of LSI. Nothing in this Agreement shall be construed as an assignment of any patents, copyrights, trademarks, or trade secret information or other intellectual property rights.


Testing

While participating in course, LSI may test Student's understanding of the subject matter. Furthermore, LSI may record the Student's participation in a course with videotape or other recording means. Student agrees that LSI is the owner of all such test results and recordings, and may use such test results and recordings subject to LSI's privacy policy.

Software license

All software utilized or distributed as course materials, or an element thereof, is licensed pursuant to the license agreement accompanying the software.

Indemnification

Student agrees to indemnify, defend and hold LSI, and all its officers, directors, agents, employees and affiliates, harmless from and against any and all third party claims for loss, damage, liability, and expense (including reasonable attorney's fees and costs) arising out of content submitted by Student, Student's use of course materials (except as expressly outlined herein), or Student's violations of any rights of another.

Disclaimer of warranties

THE COURSE MATERIALS (INCLUDING ANY SOFTWARE) ARE PROVIDED ON AN “AS

IS” AND “AS AVAILABLE” BASIS, WITHOUT WARRANTY OF ANY KIND. LSI DOES

NOT WARRANT THAT THE COURSE MATERIALS: WILL MEET STUDENT'S REQUIREMENTS; WILL BE UNINTERRUPTED, TIMELY, SECURE, OR ERROR-FREE; OR WILL PRODUCE RESULTS THAT ARE RELIABLE. LSI EXPRESSLY DISCLAIMS ALL WARRANTIES, WHETHER EXPRESS, IMPLIED OR STATUTORY, ORAL OR WRITTEN, WITH RESPECT TO THE COURSE MATERIALS, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE WITH RESPECT TO THE SAME. LSI EXPRESSLY DISCLAIMS ANY WARRANTY WITH RESPECT TO ANY TITLE OR NONINFRINGEMENT OF ANY THIRD-PARTY NTELLECTUAL PROPERTY RIGHTS, OR AS TO THE ABSENCE OF COMPETING CLAIMS, OR AS TO INTERFERENCE WITH STUDENT’S QUIET ENJOYMENT.

Limitation of liability

STUDENT AGREES THAT LSI SHALL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, CONSEQUENTIAL OR EXEMPLARY DAMAGES, INCLUDING BUT NOT LIMITED TO, DAMAGES FOR LOSS OF PROFITS, GOODWILL, USE, DATA OR OTHER SUCH LOSSES, ARISING OUT OF THE USE OR INABILITY TO USE THE COURSE MATERIALS, EVEN IF LSI HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES, LSI'S LIABILITY FOR DAMAGES TO STUDENT FOR ANY CAUSE WHATSOEVER, REGARDLESS OF THE FORM OF ANY CLAIM OR ACTION, SHALL NOT EXCEED THE AGGREGATE FEES PAID BY STUDENT FOR THE USE OF THE COURSE MATERIALS INVOLVED IN THE CLAIM.


Miscellaneous

Student agrees to not export or re-export the course materials without the appropriate United States and foreign government licenses, and shall otherwise comply with all applicable export laws. In the event that course materials in the form of software is acquired by or on behalf of a unit or agency of the United States government (the “Agency”), the Agency agrees that such software is comprised of “commercial computer software” and “commercial computer software documentation” as such terms are used in 48 C.F.R. 12.212 (Sept. 1995) and is provided to the Agency for evaluation or licensing (A) by or on behalf of civilian agencies, consistent with the policy set forth in 48 C.F.R. 12.212; or (B) by or on behalf of units of the Department of Defense, consistent with the policies set forth in 48 C.F.R. 227-7202-1 (June 1995) and 227.7203-3 (June 1995). This Agreement shall be governed by and construed in accordance with the laws of the State of California, with regard to its choice of law or conflict of law provisions. In the event of any conflict between foreign laws, rules and regulations and those of the United States, the laws, rules and regulations of the United States shall govern. In any action or proceeding to enforce the rights under this Assignment, the prevailing party shall be entitled to recover reasonable costs and attorneys' fees. In the event that any provision of this Agreement shall, in whole or in part, be determined to be invalid, unenforceable or void for any reason, such determination shall affect only the portion of such provision determined to be invalid, unenforceable or void, and shall not affect the remainder of such provision or any other provision of this Agreement. This Agreement constitutes the entire agreement between LSI and Student relating to the course materials and supersedes any prior agreements, whether written or oral, between the parties.

Trademark acknowledgments

Engenio, the Engenio design, HotScaletm, SANtricity, and SANsharetm are trademarks or registered trademarks of LSI Corporation. All other brand and product names may be trademarks of their respective companies.

Copyright notice

© 2006, 2007, 2008 LSI Corporation. All rights reserved Agreement accepted by Student (Date):

Agreement not accepted by Student (Date):


Left Blank Intentionally


Storage Systems Diagnostics and Troubleshooting Course Outline

Course Description:

Storage Systems Diagnostics and Troubleshooting is an advanced course that presents the technical aspects of diagnosing and troubleshooting LSI-based storage systems through advanced data analysis and in-depth troubleshooting. The basic objective of this course is to equip the participants with the essential concepts associated with troubleshooting and repairing LSI-based storage systems using either SANtricitytm Storage Management software, analysis of support data or controller shell commands. The information contained in the course is derived from internal engineering publications and is confidential to LSI Corporation. It reflects the latest information available at the time of printing but may not include modifications if they occurred after the date of publication.

Prerequisites:

Ideally the successful student will have completed both the Installation and Configuration and the Support and Maintenance courses offered by Global Education services at LSI Corporation.

However, an equivalent knowledge of storage management, installation, basic maintenance and problem determination with LSI-based storage systems can be substituted.

Students should have at least 6 months field exposure with LSI storage products and technologies in a support function.

Audience:

This course is designed for customer support personnel responsible for diagnosing and troubleshooting LSI storage systems through the use of support data analysis and controller shell access. The course is designed for individuals employed as Tier 3 support of LSI-based storage systems.

It is assumed that the student has in-depth experience and knowledge with Fiber Channel Storage Area Network (SAN) technologies including RAID, Fiber Channel topology, hardware components, installation, and configuration.

Course Length:

Approximately 4 days in length with 60% lecture and 40% hands-on lab.


Course Objectives

Upon completion of this course, the participant will be able to: • Recognize the underlying behavior of LSI-based storage systems • Analyze a storage system for failures through the analysis of support data

• Successfully analyze backend fiber channel errors • Successfully interpret configuration errors

Course Modules

1. Storage System Support Data Analysis 2. Storage System Level Overview 3. Configuration Overview and Analysis 4. IO Driver and Drive Side Error Reporting and Analysis

Module 1: Storage System Support Data Overview

Upon completion should be able to complete the following:

• Describe the purpose of the files that are included within an the All Support Data Capture

• Analyze the Major Event Log at a high level in order to diagnose an event Lab

• Gather the support data file • Analyze a MEL event • Diagram the events in a MEL that lead to an error

Module 2: Storage System Level Overview


• Log into the controller shell • Identify and modify the controller states

• Recognize the battery function within the controllers • Describe the network functionality • List developer functions available within the controller shell commands

Lab • Log into the controller shell

• Modify controller states


Module 3: Configuration Overview and Analysis


• Describe the difference between the legacy configuration structures and the new 07.xx firmware configuration database

• Analyze an array’s configuration from shell output and recognize any errors in the configuration

LAB

• Fix configuration errors on live system

Module 4: IO Driver and Drive Side Error Reporting and Analysis


• Describe how fibre channel topology works • Determine how fibre channel topology relates to the different protocols that LSI

uses in its storage array products

• Analyze backend errors for problem determination and isolation LAB

• Analyze backend data case studies


Module 1: Storage System Support Data Overview


• Describe the purpose of the files that are included within an the All Support Data Capture

• Analyze the Major Event Log at a high level in order to diagnose an event


All support Data Capture

• ZIP archive of useful debugging files • Some files are for development use only, and are not support readable • Typically the first item requested for new problem analysis

• Benefits – Provides a point-in-time snapshot of system status. – Contains all logs needed for a ‘first look’ at system failures. – Easy customer interface through the GUI. – Non-disruptive

• Drawbacks

– Requires GUI accessibility. – Can take some time to gather on a large system.

All Support Data Capture


All Support Data Capture Files - 06.xx.xx.xx

• driveDiagnosticData.bin – Drive log information contained in a binary format.

• majorEventLog.txt – Major Event Log

• NVSRAMdata.txt – NVSRAM settings from both controllers

• objectBundle – Binary format file containing java object properties

• performanceStatistics.csv – Current performance statistics by volume

• persistentReservations.txt – Volumes with persistent reservations will be noted here

• readLinkStatus.csv – RLS diagnostic information in comma separated value format

• recoveryGuruProcedures.html – Recovery Guru procedures for all failures on the system

• recoveryProfile.csv – Log of all changes made to the configuration

• socStatistics.csv – SOC diagnostic information in comma separated value format

• stateCaptureData.dmp/txt – Informational shell commands ran on both controllers

• storageArrayConfiguration.cfg – Saved configuration for use in the GUI script engine

• storageArrayProfile.txt – Storage array profile

• unreadableSectors.txt – Unreadable sectors will be noted here, noting the volume and drive LBA


All Support Data Capture Files - 07.xx.xx.xx

• Contains all the same files as the 06.xx.xx.xx releases but adds 3 new files. – Connections.txt

• Lists the physical connections between expansion trays – ExpansionTrayLog.txt

• ESM event log for each ESM in the expansion trays – featureBundle.txt

• Lists all premium features and their status on the system • Most useful files for first-look system analysis and troubleshooting

– stateCaptureData.dmp/txt – majorEventLog.txt – storageArrayProfile.txt – socStatistics.csv – readLinkStatus.csv – recoveryGuruProcedures.html


Major Event Log (MEL) Overview

Major Event Log Facts

• Array controllers log events and state transitions to an 8192 event circular buffer. • Log is written to DACSTOR region of drives.

– Log is permanent – Survives:

• Power cycles • Controller swaps

• SANtricity can display log, sort by parameters and save to file. • Only critical errors send SNMP traps and Email alerts

A Details Window from a MEL log (06.xx)


General Raw Data Categories (06.xx)

General Raw Data Categories (07.xx)


Byte Swapping

• Remember when byte swapping select all of the bytes in the field

• NOTE: Do not swap the nibbles

– e.g. Value is not “00 00 00 00 00 00 01 fa”

Comparison of the Locations of the Summary Information and Raw Data (06.xx)


Quick View of the Locations Raw Data Fields (06.xx)

MELH - Signature

MEL version - 2 means 5.x code or 06.x code Event Description - Includes: Event Group, Component, Internal Flags, Log Group &

Priority I/O Origin – refer to the MEL spec for the event type

Reporting Controller - 0=A 1=B

Valid? - 0=Not valid 1=Valid data O1 - Number of Optional Data Fields

O2 - Total length of all of the Optional Data Fields in Hex F1 - Length of this optional data field

F2 - Data field type (If there is a value of 0x8000 this is a continuation of

the previous optional data field. This would be read as a continuation of the previous data field type 0x010d.)

F3 - The “cc” means drive side channel and the following value refers to the channel number and is 1 relative.

Sense Data - Vendor specific depending on the component type.

N/U - Not Used


Comparison of the Locations of the Summary Information and Raw Data (07.xx)


Quick View of the Locations Raw Data Fields (07.xx)

Event Description - includes: Event Group, Component, Internal Flags, Log Group & Priority

Location – Decode based on the component type

Valid? - 0=Not valid 1=Valid data

1. – I/O Origin 2. - Reserved

3. - Controller reported by (0=A 1=B) 4. - Number of optional data fields present

5. - Total length of optional Data

6. - Single optional field length 7. - Data field type, data field types that begin with 0x8000 are a continuation of the

previous data field of the same type

Sense Data - vendor specific depending on the component type.


MEL Summary Information

• Date/Time: Time of the event adjusted to the management station local clock

• Sequence number: Order that the event was written to the MEL

• Event type: Event code, check MEL Specification for a list of all event types

• Event category: Category of the event (Internal, Error, Command)

• Priority: Either informational or critical

• Description: Description of the event type

• Event specific codes: Information related to the event (if available)

• Component type: Component the event is associated with

• Component location: Physical location of the component the event is associated with

• Logged by: Controller which logged the event

Event Specific Codes

• Skey/ASC/ASCQ

– Defined in Chapter 11 (06.xx), 12 (07.xx) of the Software Interface Spec • AEN Posted events

– Event 3101 • Drive returned check condition events

– Event 100a

• Return status/RPC function/null – Defined in Chapter 31 & 32 of the MEL Spec (06.16)

• Controller return status/function call for requested operation events

– Event 5023


Controller Return States

• Return status and RPC function call as defined in the MEL Specification


• Return Status

0x01 = RETCODE_OK

• RPC Function Call

0x07 = createVolume_1()



• SenseKey /ASC /ASCQ

6/3f/80 = Drive no longer usable (The controller set the drive state to “Failed – Write Failure”)

AEN Posted for recently logged event (06.xx)

• Byte 14 = 0x7d (FRU)

• Bytes 26 & 27 = 0x02 & 0x05 (FRU Qualifiers)

• Values decoded using the Software Interface Specification Chapter 5 (6.xx)

• FRU Qualifiers are decoded depending on what the FRU value is


Sense Data (SIS Chapter 5)

• Byte 14 FRU = 0x7d – FRU is Drive Group (Devnum = 0x60000d)

• Byte 26 = 0x02

– Tray ID = 2

• Byte 27 = 0x05 – Slot = 5

AEN posted for recently logged event (06.xx)

• Byte 14 = 0x06 (FRU)

• Bytes 26 & 27 = 0xd5 & 0x69 (FRU Qualifiers)

• Values decoded using the Software Interface Specification Chapter 5 (6.xx)

• FRU Qualifiers are decoded depending on the FRU code


Sense Data (SIS Chapter 5)

• SenseKey / ASC / ASCQ

6/3f/c7 = Non Media Component Failure

• Byte 14 FRU = 0x06 – FRU is Subsystem Group

• Byte 26 = 0xd5

1 1 0 1 0 1 0 1 = 0x55 = tray 85

• Byte 27 = 0x69

0 1 1 0 1 0 0 1

– Device State = 0x3 = Missing – Device Type Identifier = 0x09 = Nonvolatile Cache

Automatic Volume Transfer

• IO Origin field

o 0x00 = Normal AVT o 0x01 = Forced AVT

• LUN field o Number of volumes being transferred o Will be 0x00 if it is a forced volume transfer


Automatic Volume Transfer

• IO Origin field – 0x00 = Normal AVT – 0x01 = Forced AVT

• LUN field – Number of volumes being transferred – Will be 0x00 if it is a forced volume transfer


Mode Select Page 2C

• IOP ID Field o Contains the Host Number that issued the Mode Select (referenced in the

tditnall command output)

• Optional data is defined in the Software interface Specification, section 6.15 (or 5.15)


Module 2: Storage System Analysis

Upon completion should be able to complete the following: • Log into the controller shell • Identify and modify the controller states

• Recognize the battery function within the controllers • Describe the network functionality • List developer functions available within the controller shell commands


State Capture Data File • Series of controller shell commands ran against both controllers • Different firmware levels run different sets of commands

• Some information still needs to be gathered manually

Amethyst/Chromium (06.16.xx,06.19.xx/06.23.xx)

The following commands are collected in the state capture for the Amethyst and Chromium releases: moduleList spmShowMaps fcAll

arrayPrintSummary spmShow socShow

cfgUnitList getObjecGraph_MT showEnclosures

vdShow ccmStateAnalyze netCfgShow

cfgUnitList i showSdStatus

cfgUnit ionShow 99 dqprint

ghsList showEnclosuresPage81 printBatteryAge

cfgPhyList fcDump dqlist

Chromium 2 State Capture Additions (06.60.xx.xx) The release of Chromium 2 (06.60.xx.xx) introduced the following additional commands to the state capture dump. tditnall luall fcHosts 3

iditnall ionShow 12 svlShow

fcnShow excLogShow getObjectGraph_MT 99*

chall ccmStateAnalyze 99**

* getObjectGraph_MT 99 replaced the individual getObjectGraph_MT calls used in previous

releases

** ccmStateAnalyze 99 replaces the ccmStateAnalyze used in previous releases


Crystal (07.10.xx.xx) The following commands are collected in the state capture for the Crystal release: evfShowOwnership luall hwLogShow

rdacMgrShow ionShow spmShowMaps

vdmShowDriveTrays fcDump spmShow

vdmDrmShowHSDrives fcAll 10 fcHosts

evfShowVol showSdStatus getObjectGraph_MT

vdmShowVGInfo ionShow 99 ccmShowState

bmgrShow discreteLineTableShow netDfgShow

bidShow ssmShowTree inetstatShow

tditnall ssmDumpEncl dqprint

iditnall socShow dqlist

fcnShow showEnclosuresPage81 taskInfoAll


Accessing the Controller Shell • Accessed via RS-232 port on communication module • Default settings are 38,400 baud, 8-N-1 no flow control

• 06.xx firmware controllers allow access to the controller shell over the network via rlogin

• 07.xx firmware controllers allow access to the controller shell over the network via

telnet • Always capture your shell session using your terminal’s capturing

functionality

Logging In To the Controller Shell (06.xx)

• If logging serially, get command prompt by sending Break signal, followed by Esc key when prompted.

– Using rlogin you may be prompted for a login name, use “root”

• Enter password when prompted: – Infiniti

• Command prompt is a ‘right arrow’ ( -> ) • The shell allows user to access controller firmware commands & routines directly

Logging In To the Controller Shell (07.xx)

• If logging in serially, get command prompt by sending Break signal, followed by Esc key when prompted.

– Otherwise shell access can be gained via the telnet protocol.

• You will be prompted for a login name, use “shellUsr”

• Enter password when prompted: – wy3oo&w4 –

• Command prompt is a ‘right arrow’ ( -> )

• The shell allows user to access controller firmware commands & routines directly.


Controller Analysis


Controller Analysis

• bidShow 255 (07.xx)

• Driver level information, similar to bmgrShow but for development use

getObjectGraph_MT / getObjectGraph_MT 99

• Prior to Chromium 2 (06.60.xx.xx), and in Crystal (07.xx) the getObjectGraph_MT command was used several times to collect the following:

• getObjectGraph_MT 1 – Controller Information • getObjectGraph_MT 4 – Drive Information • getObjectGraph_MT 8 – Component Status

• As of Chromium 2 (06.60.xx.xx) the state capture utilizes getObjectGraph_MT 99 which collects the entire object graph including controller, drive, component, and volume/configuration data.

• The object graph is actually used by the Storage Manager software to provide the visual representation of the current array status.

• The output of getObjectGraph_MT can be used to determine individual component status.

The downside of using the getObjectGraph_MT output is that it is somewhat complicated and cryptic however it can be very valuable in determining problems with the information being reported to the customer via Storage Manager.


Additional Output


Knowledge Check Analyze the storageArrayProfile.txt file to find the following information:

Controller Firmware version:

Board ID:

Network IP Address Controller A:

Controller B:

Volume Ownership (by SSID) Controller A:

Controller B:

ESM Firmware Version:

Find the same information in the StateCaptureData.txt file. List what command was referenced to find the information.

Command Referenced

06.xx 07.xx

Controller Firmware version:

Board ID:

Network IP Address:

Volume Ownership (by SSID):

ESM Firmware Version:


Additional Commands


Debug Queue • Used to log pertinent information about various firmware functions. • Each core asset team can write to the debug queue.

• There is no standard for data written to the debug queue, each core asset team writes the information it feels is needed for debug.

• The debug queue output is becoming increasingly important for problem

determination and root cause analysis.

• Because so much data is being written to the debug queue, it is important to gather it as soon as possible after the initial failure.

• Because there is no standard for the data written to the debug queue, it is necessary for multiple development teams to work in conjunction to analyze the debug queue.

• This makes it difficult to interpret from a support standpoint without development

involvement.


Debug Queue Rules

• First check ‘dqlist’ to verify which trace contains events during the time of failure

• It is possible that there may not be a debug queue trace file that contains the timeline of the failure, in this case, no information can be gained

• First data capture is a must with the debug queue as information is logged very quickly

• Even though a trace may be available for a certain timeframe, it is not a guarantee that further information can be gained about a failure event

Summary

• Look at the first / last timestamps and remember that they’re in GMT.

• Don’t just type ‘dqprint’ unless you actually want to flush and print the ‘trace’ trace file (the one we’re currently writing new debug queue data to). Only typing ‘dqprint’ can actually make you lose the useful data if you’re not paying attention.

• Keep in mind that the debug queue wasn’t designed for you to read, only for you to collect and someone in development to read.

• Remember, even LSI developers, when looking at debug queue traces, need to

go back to the core asset team that actually wrote the code that printed specific debug queue data, in order to decode it.


Knowledge Check What command would you run to gather the following information:

Detailed process listing:

Available controller memory:

Lock status:

There is no need to capture controller shell login sessions.

True False The Debug Queue should only be printed at development request.

True False The Debug Module is needed for access to all controller shell commands.

True False


Modifying Controller States

• Controller states can by modified via the GUI to place a controller offline, in-service mode, online, or to reset a controller

• These same functions can be achieved from the controller shell if GUI access is

not available

• Commands that end in _MT use the SYMbol layer and require that the network be enabled but does not require that the controller actually be on the network. The controller must also be through Start Of Day

• The _MT commands are valid for both 06.xx and 07.xx firmware

• The legacy (06.xx and lower) commands are referenced in the ‘Troubleshooting

and Technical Reference Guide Volume 1’ on page 27

• To transfer all volumes from the alternate controller and place the alternate controller in service mode

-> setControllerServiceMode_MT 1 -> cmgrSetAltToServiceMode (07.xx only)

• While the controller is in service mode it is still powered on and is available for shell access. However it is not available for host I/O, similar to a ‘passive’ mode.

• To transfer all volumes from the alternate controller and place the alternate controller offline

-> setControllerToFailed_MT 1 -> cmgrSetAltToFailed (07.xx only)

• While the controller is offline it is powered off and is unavailable for shell access.

It is not available for host I/O


• To place the alternate controller back online from either an offline state, or from in service mode

-> setControllerToOptimal_MT 1 -> cmgrSetAltToOptimal (07.xx only)

• This will place the alternate controller back online and active, however will not

automatically redistribute the volumes to the preferred controller

• In order to reset a controller

• Soft reset controller – Reboot

• Reset controller with full POST

– sysReboot – resetController_MT 0

• Reset the alternate controller (06.xx)

– isp rdacMgrAltCtlReset –

• Reset the alternate controller

– altCtlReset 2 – resetController_MT 1


Diagnostic Data Capture (DDC)

Brief History

• Multiple ancient IO events in the field

• Need of having better diagnostic capability

• Common infrastructure which can be used for many such events

What is DDC (Diagnostic Data Capture)?

• A mechanism to capture sufficient diagnostic information about the

controller/array state at the time of an unusual event, and store the diagnostic data for later retrieval/transfer to LSI Development for further analysis

• Introduced in Yuma 1.2 (06.12.16.00)

• Part of Agate (06.15.23.00)

• All future releases

Unusual events triggering DDC (as of 07.xx)

• Ancient IO

• Master abort due to bad address accessed by the fibre channel chip results in

PCI error

• Destination device number registry corruption

• EDC Error returned by the disk drives

• Quiescence failure of volumes owned by the alternate controller


DDC Trigger

• MEL event gets logged whenever DDC logs are available in the system

• A system-wide Needs Attention condition is created for successful DDC capture

• Batteries

– Get enabled if system has batteries which are sufficiently charged – DDC logs triggered by ancient IO MAY sustain without batteries, as

ancient IO does not cause hard reboot.

• No new DDC trigger if all of the following are true – New event is of same type as previous – New trigger happens within 10 minutes of the previous trigger – Previous DDC logs have not been retrieved (DDC - NA is set)

Persistency of DDC Information

• DDC info is persistent across power cycle, and controller reboot provided the

following is true: – System contains batteries which are sufficiently charged

DDC Logs format

• Binary

• Must be sent to LSI development to be analyzed

DDC CLI commands

• Commands to retrieve the DDC information

– save storageArray diagnosticData file=“<filename>.zip";

• Command to clear the DDC NA – reset storageArray diagnosticData; – CLI calls this command internally in case retrieval is successful – This can be called without any retrieval (Just to clear NA)


DDC MEL Events

• MEL_EV_DDC_AVAILABLE

– Event # 6900 – Diagnostic data is available – Critical

• MEL_EV_DDC_RETRIEVE_STARTED – Event # 6901 – Diagnostic data retrieval operation started – Informational

• MEL_EV_DDC_RETRIEVE_COMPLETED – Event # 6902 – Diagnostic data retrieval operation completed – Informational

• MEL_EV_DDC_NEEDS_ATTENTION_CLEARED – Event # 6903 – Diagnostic data Needs Attention status cleared – Informational


Knowledge Check

1) A controller can only be placed offline via the controller shell interface.

True False

2) A controller in service mode is available for 1/O.

True False

3) An offline controller is not available for shell access.

True False

4) DDC is to be collected and interpreted by support personnel.

True False


Module 3: Configuration Overview and Analysis Upon completion should be able to complete the following:

• Describe the difference between the legacy configuration structures and the new 07.xx firmware configuration database

• Analyze an array’s configuration from shell output and recognize any errors in the configuration


Configuration Overview and Analysis

• In 06.xx firmware, the storage array configuration was maintained as data structures resident in controller memory with pointers to related data structures

• The data structures were written to DACstore with physical references (devnums) instead of memory pointer references

• Drawbacks of this design are that the physical references used in DACstore

(devnums) could change, which could cause a configuration error when the controllers are reading the configuration information from DACstore

• As of 07.xx the storage array configuration has been changed to a database design • The benefits are as follows:

– A single configuration database that is stored on every drive in a storage array – Configuration changes are made in a transactional manner – i.e. updates are

either made in their entirety or not at all – Provides support for > 2TB Volumes, increased partitions, increased host ports – Unlimited Global Hot Spares – More drives per volume group – Pieces can be failed on a drive as opposed to the entire drive


Configuration Overview and Analysis

What does this mean to support?

• Drive States and Volume States have changed slightly • Shell commands have changed

– cfgPhyList, cfgUnitList, cfgSetDevOper, cfgFailDrive, etc

Configuration Overview and Analysis (06.xx)

• How is the configuration of an 06.xx storage array maintained?

• Each component of the configuration is maintained via data structures

– Piece Structure – Drive Structure – Volume Structure

• Each structure contains a reference pointer to associated structures as well as

information directly related to it’s component

• Pieces – Pieces are simply the slice of a disk that one volume is utilizing, there

could be multiple pieces on a drive, but a piece can only reference one drive

• Piece Structures

– Piece structures maintain the following configuration data • A pointer to the volume structure • A pointer to the drive structure • Devnum of drive that the piece resides on • Spared devnum if a global hot spare has taken over • The piece’s state

• Drive Structures

– Drive structures maintain the following configuration data • The drives devnum and tray/slot information • Blocksize, Capacity, Data area start and end • The drive’s state and status • The drive’s flags • The number of volumes resident on the drive (assuming it is

assigned) • Pointers to all pieces that are resident on the drive (assuming it is

assigned)


• Volume Structures – Volume structures maintain the following information

• SSID number • RAID level • Capacity • Segment size • Volume state • Volume label • Current owner • Pointer to the first piece

06.xx configuration layout


Configuration Overview and Analysis (07.xx)

• How is the configuration of an 07.xx storage array maintained?

• Each component of the configuration is maintained via ‘records’ in the

configuration database – Piece Records – Drive Records – RAID Volume Records – Volume Group Records

• Each record maintains a reference to it’s parent record and it’s own specific state

info • The “Virtual Disk Manager” (VDM) uses this information and facilitates the

configuration and I/O behaviors of each volume group – VDM is the core module that consists of the drive manager, the piece

manager, the volume manager, the volume group manager, and exclusive operations manager

• Pieces

– Pieces may also be referenced as ‘Ordinals’. Just remember that piece == ordinal and ordinal == piece

• Piece Records

– Piece records maintain the following configuration data • A reference to the RAID Volume Record • Update Timestamp of the piece record • The persisted ordinal (what piece number, in stripe order, is this

record in the RAID Volume) • The piece’s state

– Note that there is no reference to a drive record – The update timestamp is set when the piece is failed – The parent record for a piece is the RAID Volume record it belongs to


• Drive Records

– Drive records maintain the following configuration data • The physical drive’s WWN • Blocksize, Capacity, Data area start and end • The drive’s accessibility, role, and availability states (more on this

later) • The drive’s physical slot and enclosure WWN reference • The WWN of the volume group the drive belongs to (assuming it

is assigned) • The drive’s ordinal in the volume group (its piece number) • Reasons for why a drive is marked incompatible, non-redundant,

or marked as non-critical fault • Failure Reason • Offline Reason

– Note that there is no reference to the piece record itself, only the ordinal

value – The parent record for an assigned drive is the Volume Group record

• RAID Volume Records

– RAID Volume records maintain the following configuration data • SSID • RAID level • Current path • Preferred path • Piece length • Offset • Volume state • Volume label • Segment size

– Volume Records only refer back to their parent volume group record via the WWN of the volume group

• Volume Group Records

– Volume Group records simply maintain the following • The WWN of the Volume Group • The Volume Group Label • The RAID Level • The current state of the Volume Group • The Volume Group sequence number

– Note that the Volume Group record does not reference anything but itself


07.xx configuration layout

Configuration Overview and Analysis • There are several advantages that may not be immediately obvious

o The 06.xx configuration manager used devnums (which could change) and arbitrary memory locations (which change on every reboot)

o 07.xx configuration uses hard set values such as physical device WWNs,

and internally set WWN values for RAID Volumes and Volume Groups which will not change once created.

• The configuration database is maintained on all drives in the storage array

• Provides for a more robust and reliable means of handling failure scenarios


Knowledge Check

1) 06.xx config uses data structures or database records to maintain the configuration?

2) 07.xx config database is stored on every drive.

True False 3) Shell commands to analyze the config did not change between 06.xx and 07.xx.

True False 4) What are the 3 data structures used for 06.xx config?

5) What are the 4 database records used for 07.xx config?


Drive and Volume State Management


Volume State Management

Beginning with Crystal there are different classifications for volume group states

• Complete – All drives in a group are present

• Partially Complete – Drives are missing however redundancy is available to allow I/O operations to continue

• Incomplete – Drives are missing and there is not enough redundancy available to allow I/O operations to continue

• Missing – All drives in a volume group are inaccessible

• Exported – Volume group and associated volumes are offline as a result of a user

initiated export (used in preparation for a drive migration)


Hot Spare Behavior

• Only valid for non-RAID 0 volumes and volume groups

• Not valid if any volumes in the volume group are dead

• A hot spare can spare for a failed drive or NotPresent drive that has failed pieces

• If an InUse hot spare drive fails and that failure causes any volumes in the volume group to transition to failed state, then the failed InUse hot spare will remain integrated in the VG to provide the best chance or recovery

• If none of the volumes in the volume group are in the failed state, then the failed InUse hot spare is de-integrated from the volume group making it a “failed standby” hot spare and another optimal standby hot spare will be integrated

• If failure occurred due to reconstruction (read error), then the InUse hot spare drive won’t be failed but it will be de-integrated from the volume group. We won’t retry integration with another standby hot spare drive. This “read error” information is not persisted or held in memory so we will retry integration if the controller was ever rebooted or if there was an event that would start integration.

• When copyback completes, the InUse hot spare drive is de-integrated from its group and is transitioned to a Standby Optimal hot spare drive.

• New hot spare features (07.xx) – An integrated hot spare can be made the permanent member of the

volume group it is sparing in via a user action in SANtricity Storage Manager


Volume Mappings Information


Knowledge Check

1) For 07.xx list all of the possible:

Drive accessibility states:

Drive role states:

Drive availability states:

06.xx 07.xx

2) What command(s) would you reference in order to get a quick look at all volume states?

06.xx 07.xx

3) What command(s) would you reference in order to get a quick look at all drive states?


Portable Volume Groups in 07.xx

• Previously drive migrations were performed via a system of checking NVSRAM bits, marking volume groups offline, removing drives, and finally carefully re-inserting drives in to the receiving system one at a time and waiting for the group to be merged and brought online.

• This procedure is now gone and has been replaced by portable volume group functionality.

• Portable volume group functionality provides a means of safely removing and moving an entire drive group from one storage system to another

• Uses the model of “Exporting” and “Importing” the configuration on the associated disks

• “Exporting” a volume group performs the following

o Volumes are removed from the current configuration and configuration database synchronization ceases

o The Volume Group is placed in the “Export” state and the drives marked

offline and spun down

o Drive references are removed once all drives in the “Exported” volume group are physically removed from the donor system

• Drives can now be moved to the receiving system

o Once all drives are inserted to the receiving system the volume group does not immediately come online

o The user must specify that the configuration of the new disks be

“Imported” to the current system configuration

o Once “Imported” the configuration data on the migrated group and the existing configuration on the receiving system are synchronized and the volume group is brought online


RAID 6 Volumes in 07.xx

• First we should get the “Marketing” stuff out the of the way

o RAID 6 is provided as a premium feature o RAID 6 will only be supported on the Winterpark (399x) platform due to

controller hardware requirements

• XBB-2 (Which will release with Emerald 7.3x) will support RAID 6

o RAID 6 Volume Groups can be migrated to systems that do not have RAID 6 enabled via a feature key but only if the controller hardware supports RAID 6

• The volume group that is migrated will continue to function

however a needs attention condition will be generated because the premium features will not be within limits

The Technical Bits

• LSI’s RAID 6 implementation is of a P+Q design o P is for parity, just like we’ve always had for RAID 5 and can be used to

reconstruct data o Q is for the differential polynomial calculation which when used with

Gaussian elimination techniques can also be used to reconstruct data o It’s probably easier to think of the “Q” as CRC data

• A RAID 6 Volume Group can survive up to two drive failures and maintain access to user data

• Minimum number of drives for a RAID 6 Volume Group is five drives with a

maximum of 30

• There is some additional capacity overhead due to the need to store both P and Q data (i.e. the capacity of two disks instead of one like in RAID 5)

• Recovery from RAID 6 failures only requires slight modification of RAID 5

recovery procedures o Revive up to the third drive to fail o Reconstruct the first AND second drive to fail

• Reconstructions on RAID 6 volume groups will take about twice as long as a normal RAID 5 reconstruction


Troubleshooting Multiple Drive Failures • When addressing a multiple drive failure, there are several key pieces of information

that need to be determined prior to performing any state modifications.

• RAID Level o Is it a RAID 6?

– RAID 6 volume group failures occur after 3 drives have failed in the volume group

o Is it a RAID 3/5 or RAID 1? – RAID 5 volume group failures occur after two drives have failed in

an volume group. o RAID 1 volume group failures occur when enough drives fail to cause an

incomplete mirror. – This could be as few as two drives or half the drives + 1.

o RAID 0 volume groups are dead upon the first drive failure

• Despite the drive failures is each individual volume group configuration complete? – i.e. Are all drives accounted for, regardless of failed or optimal?

• How many drives have failed and what volume group does each drive belong? • In what order did the drives fail in each individual volume group?

• Are there any global hot spares? o Are any of the hot spares in use o Are there any hot spares not in use and if so are they in an optimal

condition?

• Are there any backend errors that lead to the initial drive failures? o This is the most common cause of multiple drive failures, all backend

issues must be fixed or isolated before continuing any further Multiple Drive Failures – Why RAID Level is Important

• RAID 6 Volume Groups o RAID 6 volume groups can survive 2 drive failures due to the p+q

redundancy model, after the third drive failure the volume group is marked as failed

o Up until the third drive failure, data in the stripe is consistent across the drives


• RAID 5 and RAID 3 Volume Groups o After the second drive failure the volume group and associated volumes

are marked as failed, no I/Os have been accepted since the second drive failed

o Up until the second drive failure, data in the stripe is consistent across the drives

• RAID 1 Volume Groups o RAID 1 volume groups can survive multiple drive failures as long as one

side of the mirror is still optimal o RAID 1 volume groups can be failed after only two drives fail if both the

data drive and the mirror drive fail o Until the mirror becomes incomplete the RAID 1 pairs will function

normally

• RAID 0 o As there is no redundancy these arrays cannot generally be recovered.

However, the drives can be revived and checked – no guarantees can be made that the data will be recovered.

Multiple Drive Failures – Configuration Considerations

• Although there are several mechanisms to ensure configuration integrity there are failure scenarios that may result in configuration corruption

• If the failed volume group’s configuration is incomplete, reviving and reconstructing

drives could permanently corrupt user data

• If any of the drives have an ‘offline’ status (06.xx), reviving drives could revert them to an unassigned state

• How can this be avoided? o Check to see if the customer has an old profile that shows the appropriate

configuration for the failed volume group(s) o If the volume group configuration appears to be incomplete, corrupted, or if

there is any doubt – escalate immediately


Multiple Drive Failures – How Many Drives?

• Assuming the volume group configuration is complete and all drives are accounted for you need to determine how many drives are failed

• Make a list of the failed drives in each failed volume group

• Using the output of ionShow 12 determine whether or not these drives are in an open state

o If the drives are in a closed state they will be inaccessible and attempts to spin up, revive, or reconstruct will likely fail

Multiple Drive Failures – What’s the failure order?

• Failure order is important for RAID 6, RAID 3/5, and RAID 1 volume group failures.

• Determining the failure order is just as important as determining the status of the failed volume group’s configuration

• Failure order should be determined from multiple data points

o The Major Event Log (MEL) o Timestamp information from the drive’s DACstore (06.xx)

o Timestamp information from the failed piece (07.xx)

• Often times, failures occur close together and will show up either at the same timestamp or within seconds of each other in the MEL

Multiple Drive Failures – What’s the failure order? (06.xx)

• In order to obtain information from DACstore the drive must be spun up

isp cfgPrepareDrive,0x<phydev> Note: this is the only command that uses the “phydev” address not the devnum address

• This command will spin the drive up, but not place it back in service.

It will still be listed as failed by the controller. However since it is spun up, it will service direct disk reads of the DACstore region necessary for the following commands.


I’ve got my failure order, what’s next?

• Using the information on the previous slides you should have now determined what

the failure order is of the drives.

• Special considerations need to be made depending on the RAID level of the failed volume group

o For RAID 6 volume groups, the most important piece of information is the

first two drives that failed o For RAID 5 volume groups, the most important piece of information is the

first drive that failed

o For RAID 1 volume groups, the most important piece of information is the first drive that failed causing the mirror to break.

• Before making any modifications to the failed drives, any unused global hot spares should be failed to prevent them sparing for drives unnecessarily.

o To fail the hot spares – Determine which unused hot spares are to be failed – From the GUI

• Select the drive, • from the Advanced menu select Recovery >> Fail Drive

– From the controller shell • Determine the devnums of the hot spares that are to be failed • Using the devnum enter

– isp cfgFailDrive,0x<devnum> (06.xx) – setDriveToFailed_MT 0x<devnum> (06.xx & 07.xx)


Reviving Drives

• Begin with the last drive that failed and revive drives until the volume group becomes degraded

• From the GUI o Select the last drive to fail and from the Advanced menu select

Recovery >> Revive >> Drive o Check to see if the volume group is degraded, if not move on to the next

drive (Last -> First) and revive it. Repeat this step until the volume group is degraded

o Volume group and associated volumes should now be in a degraded state.

• From the controller shell

o Using the devnum of the drive perform the following

• isp cfgSetDevOper,0x<devnum> (06.xx) • setDriveToOptimal_MT 0x<devnum> (06.xx & 07.xx)

o Check to see if the volume group is degraded, if not move on to the next

drive (Last -> First) and revive it. Repeat this step until the volume group is degraded

o The volume group and associated volumes should now be in a degraded

state

• Mount volumes in read-only (if possible) and verify data


Cleanup

• If data checks out, reconstruct the remaining failed drives, replace drives as warranted

– From the GUI

• Select the drive • From the Advanced menu select

Recovery >> Reconstruct Drive

– From the controller shell

• Using the devnum of the drive perform the following • isp cfgReplaceDrive,0x<devnum> (06.xx)

• startDriveReconstruction_MT 0x<devnum> (06.xx &

07.xx)

• Once reconstructions have begun, the previously failed hot spares can be revived

– From the GUI

• Select the last drive to fail • From the Advanced menu select

Recovery >> Revive >> Drive

– From the controller shell

• Using the devnum of the drive perform the following

– isp cfgSetDevOper,0x<devnum> (06.xx) – setDriveToOptimal_MT 0x<devnum> (06.xx &

07.xx)


Multiple Drive Failures – A Few Final Notes

• If there is any doubt about the failure order, the array configuration, or you are

simply not confident – find a senior team member to consult with prior to taking any action.

– Beyond this you can ALWAYS escalate

• You are dealing with a customer’s data, be mindful of this at all times.

– Think about what you are doing, establish a plan based on high level facts

– Take your time – Write down the information as you review the data – If something doesn’t look right, ask a co-worker or escalate

• RAID 0 Volume Groups

– Revive the drives, check the data. – There is no guarantee that data will be recovered, and depending on the

nature of the drive failure the array may not stay optimal long enough to use the data.

• If there are multiple drive failures, there is chance that a backend problem is at fault

– DO NOT PULL AND RESEAT DRIVES – Every attempt should be made to resolve any backend issues prior to

changing drive states.

– Get the failure order information, address the backend issue, spin up drives and restore access.


Offline Volume Groups

Offline Volume Groups (06.xx)

• As a protection mechanism in 06.xx configuration manager, if all members (drives) of a volume group are not present during start of day, the controller will mark the associated volume group offline until all members are available

• This behavior can cause situations where a volume group is left in an offline status

with all drives present, or with one drive listed as out of service



• IMPORTANT: If a group is offline, it is unavailable for configuration changes.

That means that if any drives in the associated volume group are failed and revived, they will not be configured into the volume group, but will transition to an unassigned state instead

• In order to bring a volume group online through the controller shell with no pieces out of service, or only one piece out of service

– isp cfgMarkNonOptimalDriveGroupOnline,<SSID>

• Where ‘SSID’ is any volume in the group, this only needs to be run once against any volume in the group


• Because 07.xx firmware does not implement this functionality, it is not expected that this will be a concern for 07.xx systems

• Volume Groups that do not have all members (drives) present during start of day

will transition to their appropriate state

– Partially Complete – Degraded – Incomplete – Dead – Missing

• Even though the group is listed as degraded or dead, it is possible that all volumes will still be in an optimal state since no pieces are marked as out of service


Clearing the Configuration • In extreme situations it may be necessary to clear the configuration from the system • This can be accomplished by either clearing the configuration information from the

appropriate region in DACstore or by completely wiping DACstore from the drives and rebuilding it during start of day

• The configuration can be reset via the GUI

– Advanced >> Recovery >> Reset >> Configuration (06.xx) – Advanced >> Recovery >> Clear Configuration >> Storage Array (07.xx)

• To wipe the configuration information

– sysWipe

• This command must be run on both controllers. • For 06.xx systems, the controllers must be rebooted once the

command has completed. • As of 07.xx the controllers will reboot automatically once the

command has completed

• To wipe DACstore from all drives

– sysWipeZero 1 (06.xx) – dsmWipeAll (07.xx)

• After either of these commands, the controllers must be rebooted in order to write new DACstore to all the drives

• To wipe DACstore from a single drive

– isp cfgWipe1,0x<devnum> (06.xx) • Either the controllers must be rebooted in order to write new DACstore to

the drive, or it must be (re)inserted into a system

– dsmWipe 0x<devnum>,<writeNewDacstore> (07.xx)


• Where <writeNewDacstore> is either a 0 to not write new DACstore until start of day or the drive is (re)inserted into a system, or a 1 to write new clean DACstore once it has been cleared

• There are times where the Feature Enable Identifier key becomes corrupt, in order to clear it and generate a new Feature Enable Identifier use the following command.

• safeSysWipe (06.xx and 07.xx)

• For 07.xx systems, you must also remove the safe header from the

database • dbmRemoveSubRecordType 18 (07.xx)

Note: This is a very dangerous command as it wipes out a record in the database – make sure you type “18” and not another number

• Once this has been completed on both controllers, they will need to both be rebooted in order to generate a new ID.

• All premium feature keys will need to be regenerated with the new ID and reapplied.

Recovering Lost Volumes • There are times that volumes are lost and need to be recovered, either due to a

configuration problem with the storage array, or the customer simply deleted the wrong volume

• Multiple pieces must be known about the missing volume in order to ensure data

recovery – Drives and Piece Order of the drives in the missing volume group – Capacity of each volume in the volume group – Disk offset where each volume starts – Segment Size of the volumes – RAID level of the group – Last known state of the drives

• This information can be obtained from historical capture all support data files relatively easy

• Finding Drive and Piece order

– Old Profile in the ‘Volume Group’ section – vdShow or cfgUnit output in the stateCaptureData.dmp file (06.xx)

– evfShowVol output in the stateCaptureData.txt file (07.xx)


• Finding Capacity, Offset, RAID level, and Segment size

– vdShow or cfgUnit output in the stateCaptureData.dmp file (06.xx) – evfShowVol output in the stateCaptureData.txt file (07.xx)

• The last known state of the drives is a special case where a drive was previously failed in a volume prior to the deletion of the volume, it must be failed again after the recreation of the volume in order to maintain consistent data/parity

• SMcli command to recreate a volume without initializing data on the volume

– recover volume (drive=(trayID,slotID) | drives=(trayID1,slotID1 ... trayIDn,slotIDn) | volumeGroup=volumeGroupNumber) userLabel="volumeName" capacity=volumeCapacity offset=offsetValue raidLevel=(0 | 1 | 3 | 5 | 6) segmentSize=segmentSizeValue [owner=(a | b) cacheReadPrefetch=(TRUE | FALSE)]

• This command is discussed in the EMW help in further detail

– Help >> Contents >> Command Reference Table of Contents >> Commands Listed by Function >> Volume Commands >> Recover RAID Volume

• When specifying the capacity, specify it in bytes for a better chance of data recovery, if entered in Gigabytes there could be some rounding discrepancies in the outcome

• A lost volume can be created using this method as many times as necessary until the data is recovered as long as there are no writes that take place to the volume when it is recreated improperly

• NEVER use this method to create a brand new volume that contains no data. Doing so will cause data corruption upon degradation, since the volume was never initialized during creation.

• If creating volumes using the GUI, instead of the ‘recover volume’ CLI command, steps must first be made in the controller shell in order to prevent initialization

• There is a flag in the controller shell that defines whether or not to initialize the data region of the drives upon new volume creations

– writeZerosFlag


Recovering Lost Volumes – Setup

(Note in the following examples: red denotes what to type, black is the output, blue press <enter> key) -> writeZerosFlag

value = 0 = 0x0

-> writeZerosFlag=1 -> writeZerosFlag

value = 1 = 0x1

-> VKI_EDIT_OPTIONS EDIT APPLICATION SCRIPTS (disabled) Enter ‘I’ to insert statement; ‘D’ to delete statement; ‘C’ to clear all options; + to enable debug options; ‘Q’ to quit i <enter> Enter statement to insert (exit insert mode with newline only): writeZerosFlag=1 <enter> EDIT APPLICATION SCRIPTS (disabled)

1) writeZerosFlag=1

Enter ‘I’ to insert statement; ‘D’ to delete statement; ‘C’ to clear all options; + to enable debug options; ‘Q’ to quit + <enter> EDIT APPLICATION SCRIPTS (enabled)

1) writeZerosFlag=1 Enter ‘I’ to insert statement; ‘D’ to delete statement; ‘C’ to clear all options; + to enable debug options; ‘Q’ to quit q <enter> Commit changes to NVSRAM (y/n) y <enter> value = 12589824 = 0xc01b00 ->


Recovering Lost Volumes

• A lost volume can be created using this method as many times as necessary until the

data is recovered as long as there are no writes that take place to the volume when it is recreated improperly

• NEVER use this method to create a brand new volume that contains no data. Doing

so will cause data corruption upon degradation, since the volume was never initialized during creation

• Always verify that once the volume has been recreated that the system has been

cleaned up from all changes made during the volume recreation process

Recovering Lost Volumes – Cleanup

-> writeZerosFlag

value = 1= 0x1

-> writeZerosFlag=0 -> writeZerosFlag

value = 0 = 0x0 -> VKI_EDIT_OPTIONS EDIT APPLICATION SCRIPTS (enabled) 1) writeZerosFlag=1 Enter ‘I’ to insert statement; ‘D’ to delete statement; ‘C’ to clear all options; + to enable debug options; ‘Q’ to quit c <enter> Clear all options? (y/n) y <enter> EDIT APPLICATION SCRIPTS (enabled)

Enter ‘I’ to insert statement; ‘D’ to delete statement; ‘C’ to clear all options; + to enable debug options; ‘Q’ to quit - <enter> EDIT APPLICATION SCRIPTS (disabled) Enter ‘I’ to insert statement; ‘D’ to delete statement; ‘C’ to clear all options; + to enable debug options; ‘Q’ to quit q <enter> Commit changes to NVSRAM (y/n) y <enter> value = 12589824 = 0xc01b00 ->


Recovering Lost Volumes – IMPORTANT

• IMPORTANT: do not attempt to recover lost volumes without development help.

Since this deals with customer data, it is a very sensitive matter


Knowledge Check

1) 06.xx – List the process required to determine the drive failure order for a volume group.

2) 07.xx – List the process required to determine the drive failure order for a

volume group.

3) Clearing the configuration is a normal troubleshooting technique that will be used

frequently.

True False 4) Recovering a lost volume is a simple process that should be done without

needing to take much into consideration.

True False


Module 4: Fibre Channel Overview and Analysis Upon completion should be able to complete the following:

• Describe how fibre channel topology works

• Determine how fibre channel topology relates to the different protocols that LSI uses in its storage array products

• Analyze backend errors for problem determination and isolation


Fibre Channel • Fibre Channel is a transport protocol

– Used with upper layer protocols such as SCSI, IP, and ATM

• Provides a maximum of 127 ports in an FC-AL environment – Is the limiting factor in the number of expansion drive trays that can be

used on a loop pair

Fibre Channel Arbitrated Loop (FC-AL) • Devices are connected in a ‘one way’ loop or ring topology

– Can either be physically connected in a ring fashion or using a hub

• Bandwidth is shared among all devices on the loop

• Arbitration is required for one port (the ‘initiator’) to communicate with another (the ‘target’)


Fibre Channel Arbitrated Loop (FC-AL) – The LIP • Prior to beginning I/O operations on any drive channel a Loop Initialization (LIP)

must occur.

– This must be done to address devices (ports) on the channel with an ALPA (Arbitrated Loop Physical Address) and build the loop positional map

• A 128-bit (four word) map is passed around the loop by the loop master (the

controller)

– Each offset in the map corresponds to an ALPA and has a state of either 0 for unclaimed or 1 for claimed

• There are two steps in the LIP that we will skip

– LISM – Loop Initialization Select Master • The “Loop Master” is determined • The “Loop Master” assumes the lowest ALPA (0x01) • The “A” controller is always the loop master (under optimal

conditions)

– LIFA – Loop Initialization Fabric Address • Fabric Assigned addresses are determined • Occurs on HOST side connections

• The three steps we will be looking at are the

– LIPA – Loop Initialization Previous Address – LIHA – Loop Initialization Hard Address – LISA – Loop Initialization Soft Address

• The LIP process is the same regardless of drive trays attached (JBOD & SBOD)


Fibre Channel Arbitrated Loop (FC-AL) – The LIP

• The LIPA Phase

– The Loop Master sends the loop map out and designates it as the LIPA phase in the header of the frame

– The loop map is passed from device to device in order

– If a device’s port was previously logged in to the loop it will attempt to

assume it’s previous address by setting the appropriate offset in the map to ‘1’

– If a device was not previously addressed it will pass the frame on to the

next device in the loop


• The LIHA Phase – Once the LIPA phase is complete the loop master will send the loop map

out again however specifying this as the LIHA phase in the header of the frame

– The loop map is once again passed from device to device in the loop

– Each device will check it’s hard address against the loop map

– If the offset of the loop map that corresponds to the device’s hard

address is available (set to 0) it will set that bit to 1, assuming the corresponding ALPA, and pass the loop map on to the next device

– If the hard address is not available it will pass the loop map on and await

the LISA stage of initialization

– Devices that assumed an ALPA in the LIPA phase will simply pass the map on to the next device



• How are hard addresses determined?

– Hard Addresses are determined by the ‘ones’ digit of the drive tray ID and the slot position of the device in the drive tray

– Controllers are set via hardware to always assume the same hard IDs to

ensure that they assume the lower two ALPA addresses in the loop map (0x01 for “A” and 0x02 for “B”)

• What is the benefit?

– By using hard addressing on devices a LIP can be completed quickly and non-disruptively

– LIPs can occur for a variety of reasons – loss of

communication/synchronization, new devices joining the loop (hot adding drives and ESMs)

– I/Os that were in progress when the LIP occurred can be recovered

quickly without the need for lengthy timeouts and retries


• The LISA Phase – Once the LIHA phase has completed the loop master will send the loop

map out again and now designating it as the LISA phase in the frame header

– Devices that had not assumed an ALPA on the loop map in the LIPA and

LIHA phase of initialization will now take the first available ALPA in the loop map

• If no ALPA is available the device will be ‘non-participating’ and will not be addressable on the loop

– When the LISA phase is received again by the loop master it will check

the frame header for a specific value that indicates that LISA had completed



• Once LISA has completed, the loop master will distribute the loop map again and

each device will enter it’s hex ALPA in the order that it is received

– This is referred to as the LIRP (Loop Initialization Report Position) phase

• The loop master will distribute the completed loop map to all devices to inform them of their relative position in loop to the loop master

– This is referred to as the LILP (Loop Initialization Loop Position) phase

• The loop master ends the LIP by transmitting a CLS (Close) frame to all devices on the loop placing them in monitor mode


• Hard Address Contention

– Hard address contention occurs when a device is unable to assume the ALPA that corresponds to its hard address and can be caused by

• The ‘ones’ digit of the tray ID not being unique among the drive

trays on a given loop • A hardware problem that results in the device reading the

incorrect hard address or the device is simply reporting the wrong address during the LIP

– Hard address contention will result in devices taking soft addresses

during the LIP

• ALPA Map Corruption

– A bad device on the loop will corrupt the ALPA map resulting in devices not assuming the correct address or not participating in the loop

• The net of these conditions is that LIPs become a disruptive process that can have adverse affects on the operation of the loop


Fibre Channel Arbitrated Loop (FC-AL) – Communication

• Each port has what is referred to as a Loop Port State Machine (LPSM) that is used

to define the behavior when it requires access or use of the loop

• While the loop is idle, the LPSM will be in MONITOR mode and transmitting IDLE frames

• In order for one device to communicate with another arbitration must be performed

– An ARB frame will be passed along the loop from the initiating device to the target device

– If the ARB frame is received and contains the ALPA of the initiating device

it will transition from MONITOR to ARB_WON

– An OPN (Open) frame will be sent to the device that it wishes to open communication with

– Data is transferred between the two devices

– CLS (Close) is sent and the device ports return to the MONITOR state


Knowledge Check

1) The Fibre Channel protocol does not have very much overhead for login and communication.

True False

2) Soft addressing should not cause a problem in an optimal system.

True False 3) List all the LIP phases:


Drive Side Architecture Overview


SCSI Architecture Model Terminology

• nexus: A relationship between two SCSI devices, and the SCSI initiator port and

SCSI target port objects within those SCSI devices.

• I_T nexus: A nexus between a SCSI initiator port and a SCSI target port.

• logical unit: A SCSI target device object, containing a device server and task manager, that implements a device model and manages tasks to process commands sent by an application client.


Role column

FCdr – Fibre Channel drive SATAdr – SATA drive SASdr – SAS drive

ORP columns indicate the overall state of the lu for disk device types (normally should be “+++”). O= Operation – the state of the ITN currently chosen

+) chosen itn is not degraded d) chosen itn is degraded

R= Redundancy – the stat of the redundant ITN

+) alternate itn is up d) alternate itn is degraded -) alternate itn is down x) there is no alternate itn

P= performance – Are we using the preferred path? +) chosen itn is preferred


-) chosen itn is not preferred ) no itn preferences The Channels column indicates the state of the itn on that channel which is for its lu.

*) up & chosen +) up & not chosen D) degraded & chosen D) degraded & not chosen -) down x) not present


Fibre Channel Overview and Analysis

• In order to reset the backend statistics that are displayed by the previous commands

o iopPerfMonRestart

• This must be done on both controllers

• Also flushes debug queue


Knowledge Check

1) What command will show drive path information?

2) What command will show what hosts are logged in?


Destination Driver Events


Destination Driver Events (Error Codes)

• Target detected errors:

status-sk/asc/ascq = use SCSI definitions

(status=ff means unused, sk=00 means unused) • Hid detected errors:

02-0b/00/00 IO timeout ff-00/01/00 ITN fail timeout (ITN has been disconnected for too long) ff-00/02/00 device fail timeout (all ITNs to device have been discon. for too long) ff-00/03/00 cmd breakup error


Destination Driver Events (Error Codes)

Lite detected errors: 02-0b/xx/xx xx = XCB_STAT code from table below

#define XCB_STAT_GEN_ERROR 0x01

#define XCB_STAT_BAD_ALPA 0x02

#define XCB_STAT_OVERFLOW 0x03

#define XCB_STAT_COUNT 0x04

#define XCB_STAT_LINK_FAILURE 0x05

#define XCB_STAT_LOGOUT 0x06

#define XCB_STAT_OXR_ERROR 0x07

#define XCB_STAT_ABTS_SENDER 0x08

#define XCB_STAT_ABTS_RECEIVER 0x09

#define XCB_STAT_OP_HALTED 0x0a

#define XCB_STAT_DATA_MISMATCH 0x0b

#define XCB_STAT_KILL_IO 0x0c

#define XCB_STAT_BAD_SCSI 0x0d

#define XCB_STAT_MISROUTED 0x0e

#define XCB_STAT_ABTS_REPLY_TIMEOUT 0x0f

#define XCB_STAT_REPLY_TIMEOUT 0x10

#define XCB_STAT_FCP_RSP_ERROR 0x11

#define XCB_STAT_LS_RJT 0x12

#define XCB_STAT_FCP_CHECK_COND 0x13

#define XCB_STAT_FCP_SCSI_STAT 0x14

#define XCB_STAT_FCP_RSP_CODE 0x15

#define XCB_STAT_FCP_SCSICON 0x16

#define XCB_STAT_FCP_RESV_CONFLICT 0x17

#define XCB_STAT_FCP_DEVICE_BUSY 0x18

#define XCB_STAT_FCP_QUEUE_FULL 0x19

#define XCB_STAT_FCP_ACA_ACTIVE 0x1a

#define XCB_STAT_MEMORY_ERR 0x1b

#define XCB_STAT_ILLEGAL_REQUEST 0x1c

#define XCB_STAT_MIRROR_CHANNEL_BUSY 0x1d

#define XCB_STAT_FCP_INV_LUN 0x1e

#define XCB_STAT_FCP_DL_MISMATCH 0x1f

#define XCB_STAT_EDC_ERROR 0x20

#define XCB_STAT_EDC_BLOCK_SIZE_ERROR 0x21

#define XCB_STAT_EDC_ORDER_ERROR 0x22

#define XCB_STAT_EDC_REL_OFFSET_ERROR 0x23

#define XCB_STAT_EDC_UDT_FLUSH_ERROR 0x24

#define XCB_STAT_FCP_IOS 0x25

#define XCB_STAT_FCP_IOS_DUP 0x26


Read Link Status (RLS) and Switch-on-a-Chip (SOC)

• Each port on each device maintains a Link Error Status Block (LESB) which tracks the following errors

– Invalid Transmission Words – Loss of Signal – Loss of Synchronization – Invalid CRCs – Link Failures – Primitive Sequence Errors

• Read Link Status (RLS) is a link service that collects the LESB from each device

• Transmission Words

– Formed by 4 Transmission Characters – Two types:

• Data Word – Dxx.y, Dxx.y, Dxx.y, Dxx.y

• Special Function Word such as Ordered Set – Kxx.y, Dxx.y,Dxx.y, Dxx.y

– Ordered Set consists of Frame Delimiter, Primitive Signal, and Primitive Sequence

• A Transmission Word is Invalid when one of the following conditions is detected:

– At least one Invalid Transmission Character is within Transmission Word – Any valid Special Character is at second, third, or fourth character

position of a Transmission Word

– A defined Ordered Set is received with Improper Beginning Running Disparity


RLS Diagnostics

• Analyze RLS Counts:

– Look for “step” or “spike” in error counts – Identify the first device (in Loop Map Order) that detects high number of

Link Errors • Link Error Severity Order: LF > LOS > ITW

– Get the location of the first device – Get the location of its upstream device


RLS Diagnostics Example

Example:

• Drive [0,9] has high error counts in ITW, LF, and LOS

• Upstream device is Drive [0,8]

• Drive [0,8] and Drive [0,9] are in same tray

• Most likely bad component: Drive [0,8]

Important Note:

• Logs need to be interpreted, not merely read

• The data is representative of errors seen by the devices on the loop • No Standard error counting • Different devices may count the error in different rate • RLS counts are still valid in SOC environments • Not valid however for SATA trays


What is SOC or SBOD?

• Switch-On-a-Chip ( SOC )

• Switch Bunch Of Disks (SBOD) Features:

• Crossbar switch (Loop-Switch) • Supported in FC-AL topologies • Per device monitoring

SOC Components

• Controllers – 6091 Controller – 399x Controller

• Drive Trays – 2Gb SBOD ESM (2610) – 4Gb ESM (4600 – Wrigley)

SBOD vs JBOD


What is the SES?

SCSI Enclosure Services

• The SOC provides monitor and controller for SES • The SES is the device that consumes the ALPA

• The brains of the ESM


SOC Statistics

• In order to clear the drive side SOC statistics

clearSocErrorStatistics_MT

• In order to clear the controller side SOC statistics

socShow 1


Determining SFP Ports

• 2GB SBOD drive enclosures ports go from left to right

• 4GB SBOD drive enclosures ports start from the center and go to the outside (Wrigley-Husker)

• On all production models ports are labels on drive trays

Port State (PS)

• Inserted – The standard state when a devices is present

• Loopback – a connection when Tx is connected to Rx

• Unknown – non-deterministic state

• Various forms of bypassed state exist.

– Most commonly seen: • Byp_TXFlt is expected when a drive is not inserted • Byp_NoFru is expected when an SFP is not present

– Other misc.

• Bypassed, Byp_LIPF8, Byp_TmOut, Byp_RxLOS, Byp_Sync, Byp_LIPIso, Byp_LTBI, Byp_Manu, Byp_Redn, Byp_Snoop, Byp_CRC, Byp_OS

Port State (PS) meanings

• Bypassed – Generic bypass condition (indication that port was never in use)

• Byp_TXFlt – Bypassed due to transmission fault • Byp_NoFru – No FRU installed • Byp_LIPF8 – Bypass on LIP (F8,F8) or No Comma • Byp_TmOut – Bypassed due to timeout • Byp_RxLOS – Bypassed due to receiver Loss Of Signal (LOSG) • Byp_Sync – Bypasses due to Loss Of Synchronization (LOS) • Byp_LIPIso – Bypass – LIP isolation port • Byp_LTBI – Loop Test Before Insert testing process • Byp_Manu – General catch all for a forced bypass state • Byp_Redn – Redundant port connection • Byp_CRC – Bypassed due to CRC errors • Byp_OS – Bypassed due to Ordered Set errors • Byp_Snoop


Port Insertion Count (PIC)

• Port insertion count – The number of times the device has been inserted into this port.

• The value is incremented each time a port successfully transitions from the bypassed state to inserted state.

• Range: 0-255 = 28 Loop state (LS)

• The condition of the loop between the SOC and component • Possible States:

– Up = Expected state when a device is present

– Down = Expected state when no device is present

– Transition states as loop is coming up ( listed in order )

• Down -> Init -> Open -> Actv -> Up Loop Up Count (LUC)

• The total instances that the loop has been identified as having changed from Down to Up during the SOC polling intervals.

– Note: This implies that a loop can go down and up multiple instances in

one SOC polling cycle and only be detected once. – Polling cycle is presently 30 ms

– Range: 0-255 = 28

CRC Error Count (CRCEC) • Number of CRC ( Cyclic Redundancy Check) errors that are detected in frames.

• A single invalid word in a frame will increment the CRC counter

• Range: 0 - 4,294,967,294 = 232


Relative Frequency Drive Error Avg. (relFrq count / RFDEA)

• SBODs are connected to multiple devices.

• This leads to the SBOD being in multiple clock domains

• Overtime clocks tend to drift. SBODs employ a clock check feature comparing the relative frequency of all attached devices to the clock connected to the SBOD.

• If one transmitter is transmitting at the slow end and its partner at the fast end of tolerance range then the two clocks are in specification but will have extreme difficulty in communicating

• Range: 0 - 4,294,967,294 = 232 Loop Cycle Count (loopCy / LCC)

• The loop cycle is the detection of a Loop transition. – Unlike Loop Up Count the Loop Cycle count does not require the loop to

transition to the up state.

• The Loop Cycle Count is more useful in understanding overhead of the FC protocol.

• Until Loop Up goes to 1 no data has been transmitted.

• Loop Cycle allows for an understanding that an attempt is being made to bring up the loop.

– Does not mean the loop has come up

• Range: 0 - 4,294,967,294 = 232

• Possible States: – Same as Loop States (LS)

• Up, Down, Transition states as loop is coming up


Ordered Set Error Count (OSErr / OSEC)

• Number of Ordered Sets that are received with an encoding error. • Ordered Sets include Idle, ARB, LIP, SOF, EOF, etc

• Range: 0 - 4,294,967,294 = 232 Port Connection Held Off Count (hldOff / PCHOC)

• Port connections held off count • The number of instances a device has attempted to connect to a specific port

and received busy.

• Range: 0 - 4,294,967,294 = 232 Port Utilization- Traffic Utilization (PUP)

• The percentage of bandwidth detected over a 240ms period of time. Other values

• Sample Time:

– Time in seconds in which that sample was taken General Rules of Thumb for Analysis

• It requires more energy to transmit (Tx) than receive (Rx)

• In some instances it is not possible to isolate the specific problematic component.

– The recommend replacement order is the following

1. SFP 2. Cable 3. ESM 4. Controller


Analysis of RLS/ SOC

• RLS is an error reporting mechanism that reports errors as seen by the devices on the array.

• SOC counters are controlled by the SOC chip

• SOC is an error reporting mechanism that monitors communication between two devices.

• SOC data does not render RLS information obsolete

• RLS & SOC need to be interpreted not merely read

• Different devices may not count errors at the same rate

• Different devices may have different expected thresholds

• Know the topology/ cabling of the storage array

• When starting analysis always capture both RLS and SOC

• Do not always expect the first capture of the RLS/ SOC to pin point the problematic device.


Analysis of SOC

• Errors are generally not propagated through the loop in a SOC Environment. – What is recorded is the communication statistics between two devices.

• The exception to the rule – loopUp Count – CRC Error Count – OS Error Count

• Focus emphasis on the following parameters – Insertion count – Loop up count – Loop cycle count – CRC error count – OS error count

• The component connected to the port with the highest errors in the aforementioned stats is the most likely candidate for a bad component

Known Limitations

• Non-optimal configurations – i.e. improper cabling

• SOC in hub mode


Field Case

• Multiple backend issues reported in MEL • readLinkStatus.csv

• RLS stats show drive tray 1 & 2 are on channel 1 & 3 (All counts zero)

Field Case (cont)

• socStatistics.csv (Amethyst 2 release)

– SOC stats shows problem on: (M = Million) • Focusing on Drive Tray 1 ESM-A the user can see that the SES (or

brains of ESM) is By-passed and the loop state is down.

• Recommendation was to replace ESM-A.

• The drive tray can continue to operate after it is up without the SES.


Drive Channel State Management This feature provides a mechanism for identifying drive-side channels where device paths (IT nexus) are experiencing channel related I/O problems. This mechanism’s goal is twofold:

1) It aims to provide ample notice to an administrator that some form of problem exists among the components that are present on the channel

2) It attempts to eliminate, or at least reduce, I/O on drive channels that are

experiencing those problems.

• There are two states for a drive channel – OPTIMAL and DEGRADED

• A drive channel will be marked degraded by the controller when a predetermined threshold has been met for channel errors

– Timeout errors – Controller detected errors: Misrouted FC Frames and Bad ALPA errors, for

example – Drive detected errors: SCSI Parity Errors, for example – Link Down errors

• When a drive channel is marked degraded a critical event will be logged to the

MEL and a needs attention condition set in Storage Manager What a degraded drive channel means

• When a controller marks a drive-side channel DEGRADED, that channel will be avoided to the greatest extent possible when scheduling drive I/O operations.

– To be more precise, the controller will always select an OPTIMAL channel

over a DEGRADED channel when scheduling a drive I/O operation.

– However, if both paths to a given drive are associated with DEGRADED channels, the controller will arbitrarily choose one of the two.

• This point further reinforces the importance of directing administrative attention

to a DEGRADED channel so that it can be repaired and returned to the OPTIMAL state before other potential path problems arise.


• A drive channel that is marked degraded will be persisted through a reboot as the surviving controller will direct the rebooting controller to mark the path degraded

– If there is no alternate controller the drive path will be marked OPTIMAL again

• The drive channel will not automatically transition back to an OPTIMAL state

(with the exception of the above situation) unless directed by the user via the Storage Manager software


SAS Backend


SAS Backend Overview and Analysis

• Statistics collected from PHYs – A SAS Wide port consists of multiple PHYs, each with independent error

counters

• Statistics collected from PHYs on: – SAS Expanders – SAS Disks – SAS I/O Protocol ASICs

• PHYs that do not maintain counters

– Reported as “N/A” or similar in User Interface – Including SATA Disks

• PHY counters do not wrap (per standard)

– Maximum value of 4,294,967,295 (232) – Must be manually reset

• Counters defined in SAS 1.1 Standard

– Invalid DWORDs – Running Disparity Errors – Loss of DWORD Synchronization

• After dword synchronization has been achieved, this state machine monitors invalid dwords that are received. When an invalid dword is detected, it requires two valid dwords to nullify its effect. When four invalid dwords are detected without nullification, dword synchronization is considered lost.

– PHY Reset Problems

• Additional information returned – Elapsed time since PHY logs were last cleared – Negotiated physical link rate for the PHY – Hardware maximum physical link rate for the PHY


SAS Error counts

• IDWC – Invalid Dword Count – A dword that is not a data word or a primitive (i.e., in the character

context, a dword that contains an invalid character, a control character in other that the first character position, a control character other than K28.3 or K28.5 in the first character position, or one or more characters with a running disparity error). This could mark the beginning of a loss of Dword synchronization. After the fourth non-nullified (if followed by a valid Dword) Invalid Dword, Dword synchronization is lost.

• RDEC – Running Disparity Error Count

– Cumulative encoded signal imbalance between one an zero signal state. Any Dwords with one or more Running Disparity Errors will be considered an invalid Dword.

• LDWSC – Loss of Dword synch Count

– After the fourth non-nullified (if followed by a valid Dword) Invalid Dword, Dword synchronization is lost.

• RPC – Phy Reset Problem Count

– Number of times a phy reset problem occurred. When a phy or link is reset, it will run through it’s reset sequence (OOB, Speed Negotiation, Multiplexing, Identification).


• SAS error logs are gathered as part of the Capture all Support Data bundle – sasPhyErrorLogs.csv

• Not available through the GUI interface, only CLI or the support bundle.

• CLI command to collect SAS PHY Error Statistics

– save storageArray SASPHYCounts file=“<file>”;

• CLI command to reset SAS PHY Error Stastics – reset storageArray SASPHYCounts;

• Shell commands to collect SAS PHY Error Stastics

– sasShowPhyErrStats 0 • List phys with errors

– sasShowPhyErrStats 1 • List all phys

– getSasErrorStatistics_MT

• Shell commands to reset SAS PHY Error Statistics

– sasClearPhyErrStats – clearSasErrorStatistics_MT



• Remember that SAS error statistics are gathered per PHY

• If a PHY has a high error count, look at the device that the PHY is directly attached to


Appendix A: SANtricity Managed Storage Systems

• Fully-featured midrange storage designed for wide-ranging open systems environments

• Compute-intensive applications, consolidation, tiered storage • Fully-featured management software designed to provide administrators

with extensive configuration flexibility

• FC and IB connectivity with support for FC/SATA drives

Attribute 6998 | 6994 6498

Overview Flagship system targeted

at enterprises with compute-intensive applications and large consolidations

Targeted at HPC environments utilizing

InfiniBand for Linux server clustering interconnect

Key features

• Disk performance

• SANtricity robustness • Dedicated data cache

• 4 Gb/s interfaces • Switched-loop backend

• FC | SATA intermixing

• Native IB interfaces

• Switched-loop backend • FC | SATA intermixing

• SANtricity robustness

Host interfaces Eight 4 Gb/s FC Four 10 Gb/s IB

Drive interfaces Eight 4 Gb/s FC Eight 4 Gb/s FC

Drives 224 FC or SATA 224 FC or SATA

Data cache 4, 8, 16 GB (dedicated) 2 GB (dedicated)

Cache IOPS 575,000 | 375,000 IOPS ---

Disk IOPS 86,000 | 62,000 IOPS ---

Disk MB/s 1,600 | 1,280 MB/s 1,280 MB/s


6998 /6994 /6091 (Front)

6998 /6994 /6091 (Back)


Attribute 3994 | 3992

Overview Fully-featured systems targeted at midrange environments requiring high-end functionality

and performance value

Key features

• Performance value

• SANtricity robustness

• FC | SATA intermixing • 4 Gb/s interfaces

• Switched-loop backend

Host interfaces Eight | Four 4 Gb/s FC

Drive interfaces Four 4 Gb/s FC

Drives 112 FC or SATA

Data cache 4 GB | 2 GB (shared)

Cache IOPS 120,000 IOPS

Disk IOPS 44,000 | 28,000 IOPS

Disk MB/s 990 | 740 MB/s

3992 (Back)


3994 (Back)

4600 16-Drive Enclosure (Back)

4600 16-Drive Enclosure (Front)


Appendix B: Simplicity Managed Storage Systems � Affordable and reliable storage designed for SMB, departmental and remote-site customers

� Intuitive, task-oriented management software designed for sites with limited IT resources that need to be self-sufficient

� FC and SAS connectivity with support SAS/SATA drives (SATA drive support mid-2007)

Attribute 1333 | 1331

Overview Shared DAS targeted at SMB and entry level environments

requiring ease of use and reliability.

Entry-point storage for Microsoft Cluster Server

Key features

• Shared DAS

• High availability/reliability

• SAS host interfaces • Robust, intuitive Simplicity software

• Snapshot / Volume Copy

Host interfaces Six | Two 3 Gb/s

“wide” SAS

Drive interfaces Two 3 Gb/s “wide” SAS

Drives 42 SAS



Disk IOPS 22,000 IOPS

Disk MB/s 900 MB/s

1333


Attribute 1532

Overview

iSCSI connectivity-integration into

low-cost IP Networks – Pervasive and

well-understood interface technology – Simple to implement and manage

with intuitive easy-to-use storage software

Key features • Cost effective and reliable

• iSCSI host connectivity • Attach to redundant IP switches

Host interfaces Four 1Gb/s iSCSI


Drives 42 SAS




Disk MB/s 320 MB/s

1532


Attribute 1932

Overview

Ideal for departments or remote offices

that need to integrate inexpensive

storage into existing FC networks. Also appealing

to smaller organizations planning initial SANs.

Key features

• High availability/reliability

• Robust, intuitive Simplicity software • 4 Gb/s host interfaces

• Snapshot / Volume Copy

Host interfaces Four 4 Gb/s FC


Drives 42 SAS




Disk MB/s 900 MB/s

1932


SAS Drive Tray (Front)

SAS Expansion Tray (Back)


Appendix C – State, Status, Flags (06.xx) Drive State, Status, Flags From pp 15 – 16, Troubleshooting and Technical Reference Guide – Volume 1 Drive State Values 0 Optimal

1 Non-existent drive

2 Unassigned, w/DACstore

3 Failed

4 Replaced

5 Removed – optimal pg2A = 0

6 Removed – replaced pg2A = 4

7 Removed – Failed pg2A = 3

8 Unassigned, no DACstore

Drive State Values 0x0000 Optimal

0x0001 Unknown Channel

0x0002 Unknown Drive SCSI ID

0x0003 Unknown Channel and Drive SCSI ID

0x0080 Format in progress

0x0081 Reconstruction in progress

0x0082 Copy-back in progress

0x0083 Reconstruction initiated but no GHS is integrated

0x0090 Mismatched controller serial number

0x0091 Wrong vendor – lock out

0x0092 Unassigned drive locked out

0x00A0 Format failed

0x00A1 Write failed

0x00A2 Start of Day failed

0x00A3 User failed via Mode Select

0x00A4 Reconstruction failed

0x00A5 Drive failed at Read Capacity

0x00A6 Drive failed for internal reason

0x00B0 No information available

0x00B1 Wrong sector size

0x00B2 Wrong capacity

0x00B3 Incorrect Mode parameters

0x00B4 Wrong controller serial number

0x00B5 Channel Mismatch

0x00B6 Drive Id mismatch

0x00B7 DACstore inconsistent

0x00B8 Drive needs to have a 2MB DACstore

0x00C0 Wrong drive replaced

0x00C1 Drive not found

0x00C2 Drive offline, internal reasons


Drive State (d_flags) 0x00000100 Drive is locked for diagnostics

0x00000200 Drive contains config. sundry_

0x00000400 Drive is marked deleted by Raid Mgr._0

0x00000800 Defined drive without drive

0x00001000 Drive is spinning or accessible

0x00002000 Drive contains a format or accessible

0x00004000 Drive is designated as HOT SPARE

0x00008000 Drive has been removed

0x00010000 Drive has an ADP93 DACstore

0x00020000 DACstore update failed

0x00040000 Sub-volume consistency checked during SOD

0x00080000 Drive is part of a foreign rank (cold added).

0x00100000 Change vdunit number

0x00200000 Expanded DACstore parameters

0x00400000 Reconfiguration performed in reverse VOLUME order

0x00800000 Copy operation is active (not queued).


Volume State, Status, Flags From pp 17 – 18, Troubleshooting and Technical Reference Guide – Volume 1 VOLUME State (vd_state) These flags are bit values, and the following flags are valid: 0x0000 optimal

0x0001 degraded

0x0002 reconstructing

0x0003 formatting

0x0004 dead

0x0005 quiescent

0x0006 non\existent

0x0007 dead, awaiting format

0x0008 not spun up yet

0x0009 unconfigured

0x000a LUN is in process of ADP93 upgrade

0x000b Optiaml state and reconfig

0x000c Degraded state and reconfig

0x000d Dead state and reconfig

VOLUME Status (vd_status) These flags are bit values, and the following flags are valid:

0x0000 No sub-state/status available

0x0020 Parity scan in progress

0x0022 Copy operation in progress

0x0023 Restore operation in progress

0x0025 Host parity scan in progress

0x0044 Format in progress on virtual disk

0x0045 Replaced wrong drive

0x0046 Deferred error


VOLUME Flags (vd_flags) These flags are bit values, and the following flags are valid:

0x00000001 Configured

0x00000002 Open

0x00000004 On-Line

0x00000008 Not Suspended

0x00000010 Resources available

0x00000020 Degraded

0x00000040 Spare piece - VOLUME has Global Hot Spare drive in use

0x00000080 RAID 1 ping-pong state

0x00000100 RAID 5 left asymmetric mapping

0x00000200 Write-back caching enabled

0x00000400 Read caching enabled

0x00000800 Suspension in progress while switching Global Hot Spare drive

0x00001000 Quiescence has been aborted or stopped

0x00010000 Prefetch enabled

0x00020000 Prefetch multiplier enabled

0x00040000 IAF not yet started, don't restart yet

0x00100000 Data scrubbing is enabled on this unit

0x00200000 Parity check is enabled on this unit

0x00400000 Reconstruction read failed

0x01000000 Reconstruction in progress

0x02000000 Data initialization in progress

0x04000000 Reconfiguration in progress

0x08000000 Global Hot Spare copy-back in progress

0x90000000 VOLUME halted; awaiting graceful termination of any reconstruction, verify, or copy-back


From p 27, Troubleshooting and Technical Reference Guide – Volume 1 3.2.5 Controller/RDAC Modify Commands

3.2.5.01 isp rdacMgrSetModeActivePassive This command sets the controller (that you are talking to) to active mode and the alternate

controller mode to passive.

WARNING*** This command does not modify the controller cache setup, only the controller states. This may be accomplished by issuing the following command:

isp ccmEventNotify,0x0f

3.2.5.02 isp rdacMgrSetModeDualActive

This command sets both array controller modes to dual active.

WARNING*** This command does not modify the controller cache setup, only the controller states. This may be accomplished by issuing the following command:

isp ccmEventNotify,0x0f

3.2.5.03 isp rdacMgrAltCtlFail Will fail the alternate controller and takes ownership of it’s volumes.

NOTE: In order to fail a controller, it may be necessary to set the controller to a passive state first.

3.2.5.04 isp rdacMgrAltCtlResetRelease

Will release the alternate controller if it is being held in reset or failed.


Appendix D – Chapter 2 - MEL Data Format Major Event Log Specification 349-1053040 (Software Release 6.16) LSI Logic Confidential

Chapter 2: MEL Data Format The event viewer formats and displays the most meaningful fields of major event log entries from the controller. The data displayed for individual events varies with the event type and is described in the Events Description section. The raw data contains the entire major event data structure retrieved from the controller subsystem. The event viewer displays the raw data as a character string. Fields that occupy multiple bytes may appear to be byte swapped depending on the host system. Fields that may appear as byte swapped are noted in the table below.

2.1. Overview of the Major Event Log Fields

Table 2-1: MEL Data Fields



2.1.1. Constant Data Field format, No Version Number Note: If the log entry field does not have a version number, the format will be as shown below. Table 2-2: Constant Data Field format, No Version Number

2.1.2. Constant Data Field Format, Version 1 If the log entry field contains version 1, the format will be as shown below. Table 2-3: Constant Data Field Format, Version 1


Table 2-3: Constant Data Field Format, Version 1

2.2. Detail of Constant Data Fields 2.2.1. Signature (Bytes 0-3) Field Details The Signature field is used internally by the controller. The current value is ‘MELH.’

2.2.2. Version (Bytes 4 -7) Field Details When the Version field is present, the value should be 1 or 2, depending on the format of the MEL entry.

2.2.3. Sequence Number (Bytes 8 - 15) Field Details The Sequence Number field is a 64-bit incrementing value starting from the time the system log was created or last initialized. Resetting the log does not affect this value.

2.2.4. Event Number (Bytes 16 - 19) Field Details The Event Number is a 4 byte encoded value that includes bits for drive and controller inclusion, event priority, and the event value. The Event Number field is encoded as follows

Table 2-4: Event Number (Bytes 16 - 19) Encoding


2.2.4.1. Event Number - Internal Flags Field Details The Internal Flags are used internally within the controller firmware for events that require unique handling. The host application ignores these values. Table 2-5: Internal Flags Field Values

2.2.4.2. Event Number - Log Group Field Details The Log Group field indicates what kind of event is being logged. All events are logged in the system log. The values for the Log Group Field are described as follows: Table 2-6: Log Group Field Values

2.2.4.3. Event Number - Priority Field Details The Priority field is defined as follows: Table 2-7: Priority Field Values


2.2.4.4. Event Number - Event Group Field Details The Event Group field is defined as follows: Table 2-8: Event Group Field Values

2.2.4.5. Event Number - Component Type Field Details The Component Type Field Values are defined as follows:


2.2.5. Timestamp (Bytes 20 - 23) Field Details The Timestamp field is a 4 byte value that corresponds to the real time clock on the controller. The real time clock is set (via the boot menu) at the time of manufacture. It is incremented every second and started relative to January 1, 1970.

2.2.6. Location Information (Bytes 24 - 27 ) Field Details The Location Information field indicates the Channel/Drive or Tray/Slot information for the event. Logging of data for this field is optional and is zero when not specified.

2.2.7. IOP ID (Bytes 28-31) Field Details The IOP ID is used by MEL to associate multiple log entries with a single event or I/O. The IOP ID is guaranteed to be unique for each I/O. A valid IOP ID may not be available for certain MEL entries and some events use this field to log other information. The event descriptions will indicate if the IOP ID is being used for unique log information. Logging of data for this field is optional and is zero when not specified.

2.2.8. I/O Origin (Bytes 32-33) Field Details The I/O Origin field specifies where the I/O or action originated that caused the event. It uses one of the Error Event Logger defined origin codes:A valid I/O Origin may not be available for certain MEL entries and some events use this field to log other information. The event descriptions will indicate if the I/O Origin is being used for unique log information. Logging of data for this field is optional and is zero when not specified. Table 2-9: I/O Origin Field Values

A valid I/O Origin may not be available for certain MEL entries and some events use this field to log other information. The event descriptions will indicate if the I/O Origin is being used for unique log information. Logging of data for this field is optional and is zero when not specified. When decoding MEL events, additional FRU information can be found in the Software Interface Specification.

2.2.9. LUN/Volume Number (Bytes 36 - 39) Field Details The LUN/Volume Number field specifies the LUN or volume associated with the event being logged. Logging of data for this field is optional and is zero when not specified.


2.2.10. Controller Number (Bytes 40-43) Field Details The Controller Number field specifies the controller associated with the event being logged. Table 2-10: Controller Number (Bytes 40-43) Field Values

Logging of data for this field is optional and is zero when not specified.

2.2.11. Category Number (Bytes 44 - 47) Field Details This field identifies the category of the log entry. This field is identical to the event group field encoded in the event number. Table 2-11: Event Group Field Values

2.2.12. Component Type (Bytes 48 - 51) Field Details Identifies the component type associated with the log entry. This is identical to the Component Group list encoded in the event number

Table 2-12: Component Type Field Details



2.2.13. Component Location Field Details The first entry in this field identifies the component based on the Component Type field listed above. The definition of the remaining bytes is dependent on the Component Type

Table 2-13: Component Type Location Values



2.2.14. Location Valid (Bytes 120-123) Field Details This field contains a value of 1 if the component location field contains valid data. If the component location data is not valid or cannot be determined the value is 0.

2.2.15. Number of Optional Fields Present (Byte 124) Field Details The Number of Optional Fields Present specifies the number (if any) of additional data fields that follow. If this field is zero then there is no additional data for this log entry.


2.2.16. Optional Field Data Field Details The format for the individual optional data fields follows: Table 2-14: Optional Field Data Format

2.2.17. Data Length (Byte 128) Field Details The length in bytes of the optional data field data (including the Data Field Type)

2.2.18. Data Field Type (Bytes 130-131) Field Details See Data Field Types on page 14for the definitions for the various optional data fields.

2.2.19. Data (Byte 132) Field Details Optional field data associated with the Data Field Type. This data may appear as byte swapped when using the event viewer.


Appendix E – Chapter 30 – Data Field Types Major Event Log Specification 349-1053040 (Software Release 6.16) LSI Logic Confidential

Chapter 30: Data Field Types This table describes data field types.

Table 30-1: Data Field Types


Table 30-1: Data Field Types


Appendix F – Chapter 31 – RPC Function Numbers Major Event Log Specification 349-1053040 (Software Release 6.16) LSI Logic Confidential

Chapter 31: RPC Function Numbers The following table lists SYMbol remote procedure call function numbers:

Table 31-1: SYMbol RPC Functions


Table 31-1: SYMbol RPC Functions


Appendix G – Chapter 32 – SYMbol Return Codes Major Event Log Specification 349-1053040 (Software Release 6.16) LSI Logic Confidential

Chapter 32: SYMbol Return Codes This section provides a description of each of the SYMbol return codes..

Return Codes


Return Codes


Appendix H – Chapter 5 - Host Sense Data

Software Interface Specification 349-1062130 - Rev. A1 (Chromium 1 & 2) LSI Logic Confidential

Chapter 5: Host Sense Data

5.1. Request Sense Data Format Sense data returned by the Request Sense command is in one of two formats: Fixed format or Descriptor format. The format is based on the value of the D_SENSE bit (byte 2, bit 2) in the Control Mode Page. When this bit is set to 0, sense data is returned using Fixed format. When the bit is set to 1, then sense data is returned using Descriptor format. This parameter will default to 1b for volumes >= 2 TB in size. The parameter defaults to 0b for volumes < TB in size. This change is persisted on a logical unit basis See “6.11.Control Mode Page (Page A)” on page 6-232. The first byte of all sense data contains the response code field that indicates the error type and format of the sense data.: If the response code is 0x70 or 0x71, the sense data format is Fixed. See “5.1.1.Request Sense Data - Fixed Format” on page 5-189. f the response code is 0x72 or 0x73, the sense data format is Descriptor. See “5.1.2.Request Sense Data - Descriptor Format” on page 5-205. For more information on sense data response codes, see SPC-3, SCSI Primary Commands.

5.1.1. Request Sense Data - Fixed Format The table below outlines the Fixed format for Request Sense data. Information about individual bytes is defined in the paragraphs following the table

Table 5.1: Request Sense Data Format


5. 1. 1. 1. Incorrect Length Indicator (ILI) - Byte 2 This bit is used to inform the host system that the requested non-zero byte transfer length for a Read or Write Long command does not exactly match the available data length. The information field in the sense data will be set to the difference (residue) of the requested length minus the actual length in bytes. Negative values will be indicated by two's complement notation. Since the controller does not support Read or Write Long, this bit is always zero.

5. 1. 1. 2. Sense Key - Byte 2 Possible sense keys returned are shown in the following table:

Table 5.2: Sense Key - Byte 2


5. 1. 1. 3. Information Bytes - Bytes 3-6 This field is implemented as defined in the SCSI standard for direct access devices. The information could be any one of the following types of information: ² The unsigned logical block address indicating the location of the error being reported. ² The first invalid logical block address if the sense key indicates an illegal request.

5. 1. 1. 4. Additional Sense Length - Byte 7 This value will indicate the number of additional sense bytes to follow. Some errors cannot return valid data in all of the defined fields. For these errors, invalid fields will be zero-filled unless specified in the SCSI-2 standard as containing 0xFF if invalid. The value in this field will be 152 (0x98) in most cases. However, there are situations when only the standard sense data will be returned. For these sense blocks, the additional sense length is 10 (0x0A).

5. 1. 1. 5. Command Specific Information – Bytes 8-11 This field is only valid for sense data returned after an unsuccessful Reassign Blocks command. The logical block address of the first defect descriptor not reassigned will be returned in this field. These bytes will be 0xFFFFFFFF if information about the first defect descriptor not reassigned is not available or if all the defects have been reassigned. The command-specific field will always be zero-filled for sense data returned for commands other than Reassign Blocks.

5. 1. 1. 6. Additional Sense Codes - Bytes 12-13 See the information on supported sense codes and qualifiers in See “11.2.Additional Sense Codes and Qualifiers” on page 11-329. for details on the information returned in these fields.

5. 1. 1. 7. Field Replaceable Unit Code - Byte 14 A non-zero value in this byte identifies a field replaceable unit that has failed or a group of field replaceable modules that includes one or more failed devices. For some Additional Sense Codes, the FRU code must


be used to determine where the error occurred. As an example, the Additional Sense Code for SCSI bus parity error is returned for a parity error detected on either the host bus or one of the drive buses. In this case, the FRU field must be evaluated to determine if the error occurred on the host channel or a drive channel. Because of the large number of replaceable units possible in an array, a single byte is not sufficient to report a unique identifier for each individual field replaceable unit. To provide meaningful information that will decrease field troubleshooting and problem resolution time, FRUs have been grouped. The defined FRU groups are listed below.

5.1.1.7.1. Host Channel Group (0x01) A FRU group consisting of the host SCSI bus, its SCSI interface chip, and all initiators and other targets connected to the bus.

5.1.1.7.2. Controller Drive Interface Group (0x02) A FRU group consisting of the SCSI interface chips on the controller which connect to the drive buses.

5.1.1.7.3. Controller Buffer Group (0x03) A FRU group consisting of the controller logic used to implement the on-board data buffer.

5.1.1.7.4. Controller Array ASIC Group (0x04) A FRU group consisting of the ASICs on the controller associated with the array functions.


5.1.1.7.5. Controller Other Group (0x05) A FRU group consisting of all controller related hardware not associated with another group.

5.1.1.7.6. Subsystem Group (0x06) A FRU group consisting of subsystem components that are monitored by the array controller, such as power supplies, fans, thermal sensors, and AC power monitors. Additional information about the specific failure within this FRU group can be obtained from the additional FRU bytes field of the array sense.

5.1.1.7.7. Subsystem Configuration Group (0x07) A FRU group consisting of subsystem components that are configurable by the user, on which the array controller will display information (such as faults).

5.1.1.7.8. Sub-enclosure Group (0x08) A FRU group consisting of the attached enclosure devices. This group includes the power supplies, environmental monitor, and other subsystem components in the sub-enclosure.

5.1.1.7.9. Redundant Controller Group (0x09) A FRU group consisting of the attached redundant controllers.

5.1.1.7.10. Drive Group (0x10 - 0xFF) A FRU group consisting of a drive (embedded controller, drive electronics, and Head Disk Assembly), its power supply, and the SCSI cable that connects it to the controller; or supporting sub-enclosure environmental electronics. For SCSI drive-side arrays, the FRU code designates the channel ID in the most significant nibble and the SCSI ID of the drive in the least significant nibble. For Fibre Channel drive-side arrays, the FRU code contains an internal representation of the drive’s channel and id. This representation may change and does not reflect the physical location of the drive. The sense data additional FRU fields will contain the physical drive tray and slot numbers. NOTE: Channel ID 0 is not used because a failure of drive ID 0 on this channel would cause an FRU code of 0x00, which the SCSI-2 standard defines as no specific unit has been identified to have failed or that the data is not available.

5. 1. 1. 8. Sense Key Specific Bytes - Bytes 15-17 This field is valid for a sense key of Illegal Request when the sense-key specific valid (SKSV) bit is on. The sense-key specific field will contain the data defined below. In this release of the software, the field pointer is only supported if the error is in the CDB


² C/D = 1 indicates the illegal parameter is in the CDB. ² C/D = 0 indicates that the illegal parameter is in the parameters sent during a Data Out phase. ² BPV = 0 indicates that the value in the Bit Pointer field is not valid. ² BPV = 1 indicates that the Bit Pointer field specifies which bit of the byte designated by the Field Pointer field is in error. When a multiple-bit error exists, the Bit Pointer field will point to the most significant (left-most) bit of the field. The Field Pointer field indicates which byte of the CDB or the parameter was in error. Bytes are numbered from zero. When a multiple-byte field is in error, the pointer will point to the most-significant byte.

5. 1. 1. 9. Recovery Actions - Bytes 18-19 This is a bit-significant field that indicates the recovery actions performed by the array controller.

5. 1. 1. 10. Total Number Of Errors - Byte 20 This field contains a count of the total number of errors encountered during execution of the command. The ASC and ASCQ for the last two errors encountered are in the ASC/ASCQ stack field.

6 Downed LUN 5 Failed drive 5. 1. 1. 11. Total Retry Count - Byte 21


The total retry count is for all errors seen during execution of a single CDB set.

5. 1. 1. 12. ASC/ASCQ Stack - Bytes 22-25 These fields store information when multiple errors are encountered during execution of a command. The ASC/ASCQ pairs are presented in order of most recent to least recent error detected.

5. 1. 1. 13. Additional FRU Information - Bytes 26-33 These bytes provide additional information about the field replaceable unit identified in byte 14. The first two bytes are qualifier bytes that provide details about the FRU in byte 14. Byte 28 is an additional FRU code which identifies a second field replaceable unit. The value in byte 28 can be interpreted using the description for byte 14. Bytes 29 and 30 provide qualifiers for byte 28, just as bytes 26 and 27 provide qualifiers for byte 14. The table below shows the layout of this field. Following the table is a description of the FRU group code qualifiers. If an FRU group code qualifier is not listed below, this indicates that bytes 26 and 27 are not used in this release

5.1.1.13.1. FRU Group Qualifiers for the Host Channel Group (Code 0x01) FRU Group Qualifier - Bytes 26 (MSB) & 27 (LSB) - The most significant byte indicates which host channel is reporting the failed component. The least significant byte provides the device type and state of the device being reported


5.1.1.13.2. Mini-hub Port Mini-Hub Port indicates which of the Mini-Hub ports is being referenced. For errors where the Mini-Hub port is irrelevant port 0 is specified

5.1.1.13.3. Controller Number Controller Number indicates which controller the host interface is connected to.

5.1.1.13.4. Host Channel LSB Format The least significant byte provides the device type and state of the device being reported.

Host Channel Number indicates which channel of the specified controller. Values 1 through 4 are valid.

5.1.1.13.4.1. Host Channel Device State Host Channel Device State is defined as:


5.1.1.13.4.2. Host Channel Device Type Identifier The Host Channel Device Type Identifier is defined as:

5.1.1.13.5. FRU Group Qualifiers For Controller Drive Interface Group (Code 0x02) FRU Group Qualifier - Bytes 26 (MSB) & 27 (LSB) - The most significant byte indicates which drive channel is reporting the failed component. The least significant byte provides the device type and state of the device being reported.

5.1.1.13.5.1. Drive Channel MSB Format:

* = Reserved for parallel SCSI

5.1.1.13.5.2. Mini-Hub Port The Mini-Hub Port indicates which of the Mini-Hub ports is being referenced. For errors where the Mini- Hub port is irrelevant port 0 is specified.


5.1.1.13.5.3. Drive Channel Number Drive Channel Number indicates which channel. Values 1 through 6 are valid.

5.1.1.13.5.4. Drive Channel LSB Format Drive Channel LSB Format (Not used on parallel SCSI)

5.1.1.13.5.41. Drive Interface Channel Device State Device Interface Channel Device State is defined as:

5.1.1.13.5.42. Host Channel Device Type Identifier Host Channel Device Type Identifier is defined as

5.1.1.13.6. FRU Group Qualifiers For The Subsystem Group (Code 0x06) FRU Group Qualifier - Bytes 26 (MSB) & 27 (LSB) - The most significant byte indicates which primary component fault line is reporting the failed component. The information returned depends on the configuration set up by the user. For more information, see OLBS 349-1059780, External NVSRAM Specification for Software Release 7.10. The least significant byte provides the device type and state of the device being reported. The format for the least significant byte is the same as Byte 27 of the FRU Group Qualifier for the Sub-Enclosure Group (0x08).


5.1.1.13.7. FRU Group Qualifiers For The Sub-Enclosure Group (Code 0x08) FRU Group Qualifier - Bytes 26 (MSB) & 27 (LSB) - The most significant byte indicates which enclosure identifier is reporting the failed component. The least significant byte provides the device type and state of the device being reported. Statuses are reported such that the first enclosure for each channel is reported, followed by the second enclosure for each channel.

5.1.1.13.7.1. Sub-Enclosure MSB Format:

5.1.1.13.7.11. Tray Identifier Enable (TIE) Bit When the Tray Identifier Enable (TIE) bit is set to 01b, the Sub-Enclosure Identifier field provides the tray identifier for the sub-enclosure being described.

5.1.1.13.7.12. Sub-Enclosure Identifier When set to 00b, the Sub-Enclosure Identifier is defined as

5.1.1.13.7.2. Sub-Enclosure LSB Format

5.1.1.13.7.21. Sub-Enclosure Device State


The Sub-Enclosure Device Type Identifier is defined as

5.1.1.13.7.22. Sub-Enclosure Device Type Identifier

The Sub-Enclosure Device Type Identifier is defined as

5.1.1.13.8. FRU Group Qualifiers For The Redundant Controller Group (Code 0x09)


FRU Group Qualifier - Bytes 26 (MSB) & 27 (LSB) - The most significant byte indicates which tray contains the failed controller. The least significant byte indicates the failed controller within the tray.

5.1.1.13.8.1. Redundant Controller MSB Format:

5.1.1.13.8.2. Redundant Controller LSB Format:

5.1.1.13.8.21. Controller Number Field The Controller Number field is defined as:

5.1.1.13.9. FRU Group Qualifiers For The Drive Group (Code 0x10 – 0xFF) FRU Group Qualifier - Bytes 26 (MSB) & 27 (LSB) - The most significant byte indicates the tray number of the affected drive. The least significant byte indicates the drive’s physical slot within the drive tray indicated in byte 26.

5.1.1.13.9.1. Drive Group MSB Format:


5.1.1.13.9.2. Drive Group LSB Format:

5. 1. 1. 14. Error Specific Information - Bytes 34-36 This field provides information read from the array controller VLSI chips and other sources. It is intended primarily for development testing, and the contents are not specified.

5. 1. 1. 15. Error Detection Point - Bytes 37-40 The error detection point field will indicate where in the software the error was detected. It is intended primarily for development testing, and the contents are not specified.

5. 1. 1. 16. Original CDB - Bytes 41-50 This field contains the original Command Descriptor Block received from the host.

5. 1. 1. 17. Reserved - Byte 51 5. 1. 1. 18. Host Descriptor - Bytes 52-53 This bit position field provides information about the host. Definitions are given below.


5. 1. 1. 19. Controller Serial Number - Bytes 54-69 This sixteen-byte field contains the manufacturing identification of the array hardware. Bytes of this field are identical to the information returned by the Unit Serial Number page in the Inquiry Vital Product Data.

5. 1. 1. 20. Array Software Revision - Bytes 70-73 The Array Application Software Revision Level matches that returned by an Inquiry command.

5. 1. 1. 21. LUN Number - Byte 75 The LUN number field is the logical unit number in the Identify message received from the host after selection.

5. 1. 1. 22. LUN Status - Byte 76 This field indicates the status of the LUN. It's contents are defined in the logical array page description in the Mode Parameters section of this specification except for the value of 0xFF, which is unique to this field. A value of 0xFF returned in this byte indicates the LUN is undefined or is currently unavailable (reported at Start of Day before the LUN state is known).

5. 1. 1. 23. Host ID - Bytes 77-78 The host ID is the SCSI ID of the host that selected the array controller for execution of this command.

5. 1. 1. 24. Drive Software Revision - Bytes 79-82 This field contains the software revision level of the drive involved in the error if the error was a drive error and the controller was able to retrieve the information.

5. 1. 1. 25. Drive Product ID - Bytes 83-98 This field identifies the Product ID of the drive involved in the error if the error was a drive error and the controller was able to determine this information. This information is obtained from the drive Inquiry command.

5. 1. 1. 26. Array Power-up Status - Bytes 99-100 In this release of the software, these bytes are always set to zero.

5. 1. 1. 27. RAID Level - Byte 101


This byte indicates the configured RAID level for the logical unit returning the sense data. The values that can be returned are 0, 1, 3, 5, or 255. A value of 255 indicates that the LUN RAID level is undefined.

5. 1. 1. 28. Drive Sense Identifier - Bytes 102-103 These bytes identify the source of the sense block returned in the next field. Byte 102 identifies the channel and ID of the drive. Refer to the FRU group codes for physical drive ID assignments. Byte 103 is reserved for identification of a drive logical unit in future implementations and it is always set to zero in this release.

5. 1. 1. 29. Drive Sense Data - Bytes 104-135 For drive detected errors, these fields contain the data returned by the drive in response to the Request Sense command from the array controller. If multiple drive errors occur during the transfer, the sense data from the last error will be returned.

5. 1. 1. 30. Sequence Number - Bytes 136-139 This field contains the controller’s internal sequence number for the IO request.

5. 1. 1. 31. Date and Time Stamp - Bytes 140-155 The 16 ASCII characters in this field will be three spaces followed by the month, day, year, hour, minute, second when the error occurred in the following format: MMDDYY/HHMMSS

5. 1. 1. 32. Reserved - Bytes 156 – 159


Appendix I – Chapter 11 – Sense Codes

Chapter 11: Sense Codes 11.1. Sense Keys


11.2. Additional Sense Codes and Qualifiers This section lists the Additional Sense Codes (ASC), and Additional Sense Code Qualifier (ASCQ) values returned by the array controller in the sense data. SCSI-2 defined codes are used when possible. Array specific error codes are used when necessary, and are assigned SCSI-2 vendor unique codes 0x80-0xFF. More detailed sense key information may be obtained from the array controller command descriptions or the SCSI-2 standard. Codes defined by SCSI-2 and the array vendor specific codes are shown below. The most probable sense keys (listed below for reference) returned for each error are also listed in the table. A sense key encapsulated by parentheses in the table is an indication that the sense key is determined by the value in byte 0x0A. See Section .

storage_diagnostics_and_troubleshooting_guide

Documents