nas bmc-rlm

46
RDS Technical Training - All about the N-series RLM and BMC Modules Norman Bogard Americas N series ATS Steve Lawler NetApp Technical Marketing Engineer

Upload: lucien-van-remoortere

Post on 27-Oct-2014

345 views

Category:

Documents


9 download

DESCRIPTION

NAS BMC - RLM OPTIONS

TRANSCRIPT

Page 1: NAS BMC-RLM

RDS Technical Training - All about the N-series RLM and BMC Modules

Norman BogardAmericas N series ATS

Steve LawlerNetApp Technical Marketing Engineer

Page 2: NAS BMC-RLM

2© 2009 NetApp. All rights reserved.

Agenda

What are the RLM and BMC modules?Differences between the RLM and the BMCConfigurationCommunications mechanismsInter-controller dialogue Hardware-assisted takeover

Page 3: NAS BMC-RLM

3© 2009 NetApp. All rights reserved.

What are the RLM and BMC modules?

Enables remote management of storage system irrespective of controller state– Thorough, flexible, simple to operate

– Make appliances more robustReduce total cost of ownership

– Centralized administration

– Allow easier deployment and administration of appliances in remote locations

Enterprise customers expect remote platform management

Page 4: NAS BMC-RLM

4© 2009 NetApp. All rights reserved.

Differences between RLM and BMC

BMC (Baseboard Management Controller):– Incorporated on N3000 series controllers

RLM (Remote LAN Module):– Incorporated on N5000/6000/7000 series

controllersFunctionally equivalentUser access process varies slightly

Page 5: NAS BMC-RLM

5© 2009 NetApp. All rights reserved. 5

Benefits and Solution Design

Robust remote platform management solution: – Remote console access– TCP/IP over Ethernet – SSH for secure connectivity

Hardware integration on current controllers No additional hardware or connectivity required Leverages existing data center infrastructure

Eliminates separate remote support infrastructure

Page 6: NAS BMC-RLM

6© 2009 NetApp. All rights reserved. 6

Topology – Enterprise Data Center

Support

Firewall

Firewall

SSL

Private Mgmt LANGateway

Customer LAN

CS Tools and DB

RemoteCLI/Console

access

The Internet

RLM

SSH

Customer Data Center

BMC

Operations Manager

Page 7: NAS BMC-RLM

7© 2009 NetApp. All rights reserved.

Features

Secure network interface Console pass-through Remote power cycle Down filer notification Remote diagnosis of failures Remote reset Remotely initiate coredumps Capture console logs Access to HW event logs ZAPI interface for DFM/RAM SNMP Remote GDB Platform independent SW extensibility

Page 8: NAS BMC-RLM

8© 2009 NetApp. All rights reserved.

Remote Platform Management

Remote Platform Management “built-into” the Appliance

– Remote power control– Remote console access

Data ONTAP CLI, firmware and system diagnostics

Secure network interface (SSH)– Call home - down filer notifications– Initiate core-dump (CPU NMI Interrupt)– Access to system logs from a down appliance

Non-volatile HW system event logsCaptured console logsSoftware events

Page 9: NAS BMC-RLM

9© 2009 NetApp. All rights reserved.

External LAN Interface

TCP/IP connection over a physical layer– 10/100Mb Ethernet– Dedicated LAN port for RLM/BMC– Allows management LAN for physical security

Secure connection to clients– SSH protocol– UserIDs, passwords, keys etc. managed through Data ONTAP– Logging and Auditing

Multiple services– Appliance console redirection– RLM/BMC CLI– GDB over Ethernet– ZAPI– Alerts – SMTP, SNMP

Page 10: NAS BMC-RLM

10© 2009 NetApp. All rights reserved.

Data ONTAP and Controller Integration

Management through Data ONTAP– Install and Configuration

– Firmware Update

– Provides direct access to hardware

– Works even when controller is off, hung or inoperative Customer Interface using SSH

– Multiple ports and SSH services Appliance console redirection GDB connection to appliance RLM CLI

Extensible, field upgradeable SW architecture– Integration with NetApp support model

Page 11: NAS BMC-RLM

11© 2009 NetApp. All rights reserved.

Configuration

What information is needed?– Decide if DHCP or static addressing will be used

DHCP– Tie MAC address in DHCP server

MAC address from FRU MAC address from

toaster> rlm statusStatic IP address

– IP address– Netmask of network– Gateway (GW) of network

– Mailhost addressSMTP (email server) used by RLM to send

ASUPs

Page 12: NAS BMC-RLM

12© 2009 NetApp. All rights reserved.

Configuring the RLM using Data ONTAP

There are 3 ways to configure an RLM using Data ONTAP:– Initial appliance setup

Zeros appliance’s file system and sets up appliance including the RLM

– toaster> setupReconfigures appliance and RLM without zeroing

file system

– toaster> rlm setup Just configures the RLM

Page 13: NAS BMC-RLM

13© 2009 NetApp. All rights reserved.

RLM – Testing Autosupport

To test RLM’s Autosupport– toaster> rlm test autosupport

Provided that AutoSupport has been properly configured you should soon receive RLM’s ASUP message

Page 14: NAS BMC-RLM

14© 2009 NetApp. All rights reserved.

RLM – Updating RLM Firmware

RLM on NOW site– http://now.netapp.com/NOW/download/tools/rlm_fw/

– Latest firmware and instructions on NOW site

– Changes to update instructions posted on NOW site as relevant

RLM firmware can be updated in 2 ways– Data ONTAP CLI

– RLM CLI

Page 15: NAS BMC-RLM

15© 2009 NetApp. All rights reserved.

RLM Firmware Update using ONTAP

Use software command to get RLM firmware (RLM_FW.zip)– toaster> software install http://webserver/path/RLM_FW.zip -f

Update the RLM– toaster> rlm update

Page 16: NAS BMC-RLM

16© 2009 NetApp. All rights reserved.

RLM Firmware Update from RLM CLI

Install RLM firmware image (RLM_FW.tar.gz) RLM toaster> rlm update http://webserver_ip_address/path/RLM_FW.tar.gz

web_server_ip_address is the IP address of the web server on a network accessible to your appliance

Page 17: NAS BMC-RLM

17© 2009 NetApp. All rights reserved.

How To Connect To The Module?

Must access CLI securelyWhy? The network or Internet is between

customer and filerSSH Only: Telnet not supported

– Telnet disabled by default in Data ONTAP 8Users in group ‘Administrators’ allowed access

to RLMFor security, logging in as “root” not allowed at

RLMLogin as user ‘naroot’ on the RLM when using

root credentials (password)

Page 18: NAS BMC-RLM

18© 2009 NetApp. All rights reserved.

RLM Commands

RLM toaster> ?– date– exit– events– help– priv– rlm– system– version

RLM toaster> system– system console - connect to the system console– system core - dump the system core and reset– system log - print system console logs– system power - commands controlling system power– system reset - reset the system using the selected firmware

RLM toaster> rlm– rlm reboot - reboot the RLM– rlm sensors - print RLM environmental sensors status– rlm status - print RLM status– rlm update - update RLM firmware

Page 19: NAS BMC-RLM

19© 2009 NetApp. All rights reserved.

BMC Commands

help Display a list of BMC commands. reboot The reboot command forces the BMC to reboot itself and

perform a self-test. If your console connection is through the BMC it will be dropped.

setup Interactively configure the BMC local-area network (LAN)

setttings. status Display the current status of the BMC. test autosupport Test the BMC autosupport by commanding the BMC to send a

test autosupport to all autosupport email addresses in the option lists autosupport.to, autosupport.noteto, and autosupport.support.to.

Page 20: NAS BMC-RLM

20© 2009 NetApp. All rights reserved.

TroubleshootingScenario #1 - System Down / Hung / Reboot Loop

RLM will have sent ASUP. Console logs:RLM toaster> system log

*************************************************** Log Starts ***************************************************Phoenix TrustedCore(tm) ServerCopyright 1985-2005 Phoenix Technologies Ltd. All Rights ReservedPortions Copyright (c) 2005 Network Appliance, Inc. All Rights ReservedBIOS Version: 1.0X13

CPU= AMD Opteron(tm) Processor 852 X 4Testing RAM.512MB RAM tested32768MB RAM installedFixed Disk 0: SMART ATA Flash Disk New event log messages, please check the event logERROR0251: System CMOS checksum bad - Default configuration used

Boot Loader version 1.0X5 Copyright (C) 2000,2001,2002,2003 Broadcom Corporation.Portions Copyright (C) 2002-2005 Network Appliance Inc.

CPU Type: AMD Opteron(tm) Processor 852BIOS POST Failure(s) detected. Abort AUTOBOOT

Page 21: NAS BMC-RLM

21© 2009 NetApp. All rights reserved.

TroubleshootingScenario #2 - Obtain Console Access

RLM provides remote controller console loginRLM toaster> system consoleType Ctrl-D to exit.

LOADER> version

Variable Name Value-------------------- --------------------------------------------------BIOS_VERSION 1.0X13LOADER_VERSION 1.0X5

LOADER> boot_ontapLoader:elf64 Filesys:fat Dev:ide0.0 File:X86_64/kernel/primary.krn Options:(null)Loading: 0x200000/40125488 0x2844430/42433840 0x50bc160/1929773 0x529338d/3 Entry at 0x00202008Starting program at 0x00202008

[...] toaster> sysconfig –v[...]

Page 22: NAS BMC-RLM

22© 2009 NetApp. All rights reserved.

Troubleshooting

Scenario #3 - Power Cycle

On the RLM console

On the controller console

RLM toaster> system power cycleThis will cause a dirty shutdown of your appliance. Continue? [y/n] y

toaster>

Phoenix TrustedCore(tm) ServerCopyright 1985-2005 Phoenix Technologies Ltd. All Rights Reserved

Portions Copyright (c) 2005 Network Appliance, Inc. All Rights ReservedBIOS Version: 1.0X13

CPU= AMD Opteron(tm) Processor 852 X 4Testing RAM.512MB RAM tested32768MB RAM installedFixed Disk 0: SMART ATA Flash Disk

Page 23: NAS BMC-RLM

23© 2009 NetApp. All rights reserved.

TroubleshootingScenario #4 - Corrupt Motherboard Firmware

On the RLM console

On the controller console

RLM toaster> system reset backupThis will cause a dirty shutdown of your appliance. Continue? [y/n] y

LOADER> update_flash** DO NOT TURN OFF YOUR MACHINE UNTIL THE FLASH UPDATE COMPLETES!! ** Programming... [accidentally power off the machine here]

Phoenix TrustedCore(tm) ServerCopyright 1985-2005 Phoenix Technologies Ltd. All Rights Reserved

Portions Copyright (c) 2005 Network Appliance, Inc. All Rights ReservedBIOS Version: 1.0X13

Page 24: NAS BMC-RLM

24© 2009 NetApp. All rights reserved.

TroubleshootingScenario #5 - Remotely Generate a Core

On the RLM console

On the controller console

RLM toaster> system coreThis will cause a dirty shutdown of your appliance. Continue? [y/n] y

toaster>

PANIC: RLM NMI .. dumping core! in process idle_thread2 on release

NetApp Release mainN_051023_2300 on Wed Oct 26 18:37:38 GMT 2005

version: NetApp Release mainN_051023_2300: Mon Oct 24 04:28:07 PDT 2005

cc flags: 2

DUMPCORE: START

Dumping to disks: 0d.55 0d.48

....................................

Page 25: NAS BMC-RLM

25© 2009 NetApp. All rights reserved.

TroubleshootingScenario #6 - System Down with HW/FW Problem

events commandRLM toaster> events allRecord 1: [...]Record 89: Wed Oct 26 19:46:39 2005 [Agent Event.normal]: FIFO 0x4042 –

Agent Excelsior, PCIE_RESET deasserted.Record 90: Wed Oct 26 19:46:39 2005 [Agent Event.normal]: FIFO 0x4043 –

Agent Excelsior, FC_RESET deasserted.Record 91: Wed Oct 26 19:46:57 2005 [Excelsior BIOS.warning]: POST error

0x0051: ERR_CMOS_CHECKSUM

system sensors commandRLM toaster> priv set advancedRLM toaster*> system sensorsSensor Sensor Sensor CurrentID Name State Value====== ======== ====== =======0x001 POW1_FAIL good D0x002 POW2_FAIL good D0x003 P0_THRMTRP BAD A0x004 P1_THRMTRP good D0x005 P2_THRMTRP good D...

Page 26: NAS BMC-RLM

26© 2009 NetApp. All rights reserved.

Summary:Using the RLM/BMC CLI for troubleshooting

If you need controller console access– RLM toaster> system console

If you need controller console log – RLM toaster> system log

If controller is hanging / unresponsive– RLM toaster> system core– RLM toaster> system reset– RLM toaster> system power cycle

If FW can’t boot– RLM toaster> priv set diag– RLM toaster*> system debug_port– RLM toaster> system reset backup

Find out why controller is misbehaving – RLM toaster> events all– RLM toaster> priv set advanced– RLM toaster*> system sensors

Page 27: NAS BMC-RLM

27© 2009 NetApp. All rights reserved.

RLM Status

RLM status can be obtained in two ways– From Data ONTAP console:

toaster> rlm status Just shows rlm information

toaster> sysconfigShow appliance and rlm status.

Page 28: NAS BMC-RLM

28© 2009 NetApp. All rights reserved.

Example - RLM status

Output from rlm statustoaster> rlm statusRemote LAN Module Status: Online

Part Number: 110-00030Revision: B0Serial Number: 304926Firmware Version: 1.2Mgmt MAC Address: 00:A0:98:01:9A:86Using DHCP: noIP Address: 172.22.136.64Netmask: 255.255.224.0Gateway: 172.22.128.1

Page 29: NAS BMC-RLM

29© 2009 NetApp. All rights reserved.

RLM status (via sysconfig)

sysconfig will not show RLM IP address information unless options.autosupport.content == complete

Site specific information for RLM in sysconfig keeps in line with current Autosupport policies.

Page 30: NAS BMC-RLM

30© 2009 NetApp. All rights reserved.

RLM - System Console Access (Redirection)

RLM toaster>

RLM toaster> system console

Type Ctrl-D to exit.

Password:

Thu Nov 10 06:11:45 GMT [rlm_console_login_m:info]: root logged in from RLM

toaster*>

(Ctrl-D)

RLM toaster>

Page 31: NAS BMC-RLM

31© 2009 NetApp. All rights reserved.

EMS Error Messages for RLM errors

Data ONTAP generates EMS messages for RLM errors– Hourly status monitoring of RLM fails– Mailhost not setup correctly for AutoSupport– Network Configuration of RLM failed– Firmware Update errors– Heartbeat from RLM

Stopped Resumed Booted from backup

– Data ONTAP – RLM communication errors– Errors sending userid/password information to RLM

Error Messages and Troubleshooting Guide – Describes RLM EMS Error messages– Provides corrective actions

Page 32: NAS BMC-RLM

32© 2009 NetApp. All rights reserved.

RLM generated ‘Down-Controller’ ASUPs

RLM continuously monitors the System Health– Firmware POST Errors

– Boot failures

– Heartbeat from Data ONTAP

– Data ONTAP abnormal reboots

– Watchdog resets

– Hardware errors

– User initiated reboots/power-cycles/NMIWhen the system goes down or fails to boot

– RLM generates ‘Down-Controller’ AutoSupport email

Page 33: NAS BMC-RLM

33© 2009 NetApp. All rights reserved. 33

Remote Support Diagnostics Tool

HTTPS

Firewall

Firewall

IBM/NetApp SupportCustomer

RLM v3.0Secured access modelNondisruptive upgradeFunctional even when appliance is down

Appliance down notificationOptimized CORE handlingRemote data collectionTrigger AutoSupport on-demand

Remote Support Customer

Data Repository

Internet

Page 34: NAS BMC-RLM

34© 2009 NetApp. All rights reserved.

Hardware Assisted Takeover

Slow Node Failure Detection:Results in Long Takeover Time

HA systems takeover partner workload after failure detection

Legacy failure detection is a slow process– Partners in a cluster use heartbeat mechanism

to determine failoverThis heartbeat over the IB link is SW drivenPartner waits up to 15 seconds to avoid premature

takeovers due to Data ONTAP scheduling issues– cf.takeover.detection.seconds

Page 35: NAS BMC-RLM

35© 2009 NetApp. All rights reserved.

Hardware Assisted TakeoverKey Features

Predictable failure detection timePlatform independent configurationSecure alerting mechanism

– Prevent replay attacksNative diagnosability

– Continuous runtime diagnosisCustomer-initiated test mechanismsLeverage existing infrastructure

– Based on standard SNMP v1 Traps

Page 36: NAS BMC-RLM

36© 2009 NetApp. All rights reserved.

Hardware Assisted TakeoverUsing Out-of-Band Hardware Alerting Mechanism

Out-of-band hardware-based failure detection– Predictable failure detection time for a class of

failuresDetection time reduced from 15 seconds to less

than 3 seconds (RLM detection and reporting takes ~20ms)

Leverageable across product portfolio– Based on standard SNMP Traps

Does not replace the existing HA mechanism– Optimization for hardware-assist detected

failures

Page 37: NAS BMC-RLM

37© 2009 NetApp. All rights reserved.

Hardware Assisted Takeover:Separate Storage Controllers

ONTAP

hwassist

Enet

DataONTAPGig-E

hwassist

Controller 1 Controller 2

Enet

InfiniBand

Interconnect

Gig-E

Network

Network

Page 38: NAS BMC-RLM

38© 2009 NetApp. All rights reserved.

Hardware Assisted Takeover Benefits

First introduced with release of Data ONTAP 7.3 Speeds takeover in the event of:

– Abnormal system reboot (aka panic)

– System reset due to watchdog timeout

– System power off, power cycle, or reset of the partner

– System POST error during boot

– Complete loss of power to the partner

– Environmental shutdown conditions Takeover not expedited when:

– Operator-initiated halt of the partner - already at minimum latency via cluster interconnect

– 'Busy-Hung' of partner, where it continues to service its watchdog

Page 39: NAS BMC-RLM

39© 2009 NetApp. All rights reserved.

Hardware Assisted Takeover:Data ONTAP Commands and Options

The following customer-visible options are supported1.options cf.hw_assist.enable

To enable/disable hwassist on partner2. options cf.hw_assist.partner.address

To configure partner IP address on which alerts will be sent by RLM.

3. options cf.hw_assist.partner.portTo configure partner UDP port on which alerts will be sent by RLM.

The following hidden options are supported1. options cf.hw_assist.health_check_interval

Interval in secs to send periodic keep alive alerts.2. options cf.hw_assist.retry_count

Number of times each hardware assist alert is sent.

Page 40: NAS BMC-RLM

40© 2009 NetApp. All rights reserved.

Hardware Assisted Takeover:Data ONTAP Commands and Options

The following commands are supported in advanced mode1. cf hw_assist status

Command is used to get latest status of hwassist feature. If hwassist is active it will print port and IP address on which hwassist is listening for traps. If hwassist is inactive it will print the reason with a possible solution.

2. cf hw_assist testCommand is used to test send/recv path of hwassist alerts between clustered filers.

3. cf hw_assist statsCommand will print detailed information of all hwassist alerts received by the filer.

4. cf hw_assist stats clearCommand will clear information of all hwassist alerts received by the filer.

The following commands are supported in test mode5. cf test_hw_assist get ss

Command will print current shared secret for local as well as partner node.6. cf test_hw_assist update ss

Command will update shared secret for local node.

Page 41: NAS BMC-RLM

41© 2009 NetApp. All rights reserved.

Hardware Assisted TakeoverSingle Management Port; Integrated HWAssist

Backplane

Agent

SIO

DataONTAP

10/100 EnetGig-E

Gig-E

Switch

RJ-45

Agent

SIO

DataONTAP

10/100 EnetGig-E

Gig-E

RJ-45

Switch

RLM

Enet

RLM

Enet

Page 42: NAS BMC-RLM

42© 2009 NetApp. All rights reserved.

Hardware Assisted Takeover Alerting Mechanism using SNMP Traps

UDP message formatted as SNMP trap Multiple trap messages based on configuration settings Resumes working in the event of reboot of the hwassist

during the uptime of the filer Backward compatible with Data ONTAP kernels that do not

support this feature Extensible data format for future improvements On a N6xxx system, the IP address specified in the cf.hw_assist.partner.address option should specify the partner's e0m interface. (The e0M interface is dedicated to Data ONTAP management activities.)

Page 43: NAS BMC-RLM

43© 2009 NetApp. All rights reserved.

Hardware Assisted TakeoverBasic Design – Flow of Events RLM/BMC on Failed Controller (Downfiler)

– Detects a failure event at its monitored controller– Triggers an alert to partner controller (Data ONTAP)– Alert message identifies cause of failure– Alert message sent via UDP (in SNMP Trap format)

CFO software on partner controller– Receives RLM/BMC alert (UDP packet in SNMP Trap

format)– Applies policy to received alert– Initiates takeover if warranted

Estimated failure detection time savings– RLM/BMC: ~20ms to detect event and send alert– CFO: <1 to 3 seconds to process RLM/BMC alert– Detection time reduced by >10 sec for RLM/BMC

detected failures

Page 44: NAS BMC-RLM

44© 2009 NetApp. All rights reserved.

Example of SNMP v1 Trap:

2006-06-23 11:02:16 or-196-rlm.lab.netapp.com [172.22.136.196] (via 172.22.136.196) TRAP, SNMP v1, community publiciso.3.6.1.4.1.789 Enterprise Specific Trap (536) Uptime: 0:00:01.90iso.3.6.1.4.1.789.1.1.12.0 = STRING: “Remote Management Event: type=system_down, severity=notice, event=power_cycle_via_rlm, ss=ABCDE56789, system_id=0118044518” iso.3.6.1.4.1.789.1.1.9.0 = STRING: "12345678

Where:– iso.3.6.1.4.1.789: Is Netapp enterprise OID.– iso.3.6.1.4.1.789.1.1.12.0: OID used for the variable field that will contain

the trap-specific info. – iso.3.6.1.4.1.789.1.1.9.0: OID with product serial– type: type of event i.e. system_down, system_up, keep_alive, test– severity: would be alert, warning, notice, normal, info, debug– event:post_error, abnormal_reboot, l2_watchdog_timeout etc – ss: shared secret key (will be 0's if we have no key, there will be no key for

periodic and test types) – system_id: system id of the system from which the trap is sent

Page 45: NAS BMC-RLM

45© 2009 NetApp. All rights reserved.

Hardware Assisted TakeoverTypes of Failures Detected

Loss of power Level 2 Watchdog Timer Reset System POST Failures

– Firmware POST fatal errors

– Boot media corruption Operator Initiated system down events

– Power cycle, power down or reset Boot Timeout Abnormal Reboots including Panics Data ONTAP RLM heartbeat timeouts

Page 46: NAS BMC-RLM

46© 2009 NetApp. All rights reserved.

Thank You!