csm support for blue gene/p csm 1.7.0 line item 0xr skills transfer materials by marty fullam...

25
CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam [email protected]

Upload: raul-lowry

Post on 15-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

CSM support for Blue Gene/P

CSM 1.7.0

Line item 0XR

Skills Transfer Materials

by Marty [email protected]

Page 2: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

What's a Blue Gene? It's IBM's flagship supercomputer offering. It's official name is the "IBM

System Blue Gene Solution". See http://www-03.ibm.com/servers/deepcomputing/bluegene.html

It looks like this:

The Service Node is the administration focal point of a Blue Gene. Among other things, it maintains a DB2 database of configuration, RAS, and environmental data. The Blue Gene system administrator hangs out here.

The Front End Nodes are used for compiling & submitting Blue Gene jobs. End users hang out here.

The File Servers serve files to the other systems. The I/O and Compute Nodes (aka the Blue Gene core) run the user jobs. All systems are POWER systems. The Service Node, Front End Nodes, and

File Servers run SLES. The I/O and Compute Nodes run custom operating systems.

Front End Nodes

1 GigabitEthernet

1 GigabitEthernet

Service Node

File Servers

DB2

Compute Nodes(65,536)

I/O Nodes(1024)

Blue Gene/Lor

Blue Gene/P

Page 3: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

CSM / Blue Gene Topology 1

Just install a CSM management server on your Blue Gene Service Node, and then add the CSM Blue Gene support. Define no CSM nodes.

Notice that there is no CSM cluster per se, just a management server. And even though the I/O and Compute Nodes of the Blue Gene core are not managed by CSM, you will still be able to monitor them.

This topology is new this release! We call it Stand-alone CSM Blue Gene monitoring support.

Blue Gene core(I/O and Compute Nodes)

Blue Gene Service NodeCSM management server

Blue Gene Front End Nodes

Blue Gene File Servers

Blue Gene with CSM to monitor the Blue Gene DB

Page 4: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

CSM / Blue Gene Topology 2

Just add CSM to the Blue Gene systems you have. Pick any system to be the management server (though it's probably most typical to use the Service Node).

Define your Service Node, Front End Nodes, and File Servers as CSM nodes. Notice that the I/O and Compute Nodes of the Blue Gene core are not managed by CSM

(mainly because they are not general-purpose Linux systems, and don't need to be burdened with CSM and RSCT software). However, as you will see, you will still be able to monitor them.

We call this topology Full CSM plus Blue Gene monitoring support.

Blue Gene with CSM to monitor the Blue Gene DB, and tomanage the Service Node, Front End Nodes, and File Servers

Blue Gene core(I/O and Compute Nodes)

Blue Gene Service NodeCSM management serverCSM managed node

Blue Gene Front End NodesCSM managed nodes

Blue Gene File ServersCSM managed nodes

CSM Cluster

Page 5: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

CSM / Blue Gene Topology 3Blue Gene as part of a larger CSM cluster

Here, the management server is a system outside of the Blue Gene solution. And while the Blue Gene Service Node, Front End Nodes, and File Servers are

configured as managed nodes in the CSM cluster, they are not the only managed nodes, there can be lots of others completely unrelated to the Blue Gene.

This topology, like Topology 2, is Full CSM plus Blue Gene monitoring support.

Blue Gene core(I/O and Compute Nodes)

Blue Gene Service NodeCSM managed node

Blue Gene Front End NodesCSM managed nodes

Blue Gene File ServersCSM managed nodes

CSM management server

IBM eServer Blue Gene Solution

CSM Cluster

OtherCSM managed node

OtherCSM managed node

OtherCSM managed node

Page 6: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

CSM support for Blue Gene If you use Full CSM plus Blue Gene monitoring support (Topology 2

or 3 in previous charts), use existing CSM and RSCT function to help manage the Blue Gene Service Node, Front End Nodes, and File Servers. After all, they're just SLES POWER systems. Nothing new here. Use any or all CSM function.

For Stand-alone CSM Blue Gene monitoring support (Topology 1) or Full CSM plus Blue Gene monitoring support (Topology 2 or 3), also use the optional rpm, csm.bluegene, which gives the system administrator the ability to monitor, effectively, the Blue Gene core using standard CSM monitoring capabilities (ERRM conditions and responses). Actually, what we provide is the ability to monitor the Blue Gene DB2 database where the Service Node is continually writing RAS, configuration, and environmental data about the Blue Gene core.

Page 7: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

An optional part of CSM (only customers with a Blue Gene would care about it!) Used on the CSM management server (AIX, or Linux i386 or ppc64), and in the

Full CSM plus Blue Gene monitoring support case on the Blue Gene Service Node / CSM managed node (SLES ppc64) too , so it is present in thecsm-aix-1.7.x.x and csm-linux-1.7.x.x tarballs and on the CDs.

If you want to use it, manual install is required on the management server (installp or geninstall on AIX, rpm –i on Linux), followed by these additional setup steps...

Stand-alone CSM Blue Gene monitoring support case: Run bgsetupmon on the management server.

Full CSM plus Blue Gene monitoring support case: On a non-SLES ppc64 management server, use the copycmspkgs -n service_node command to

copy the CSM SLES ppc64 packages from the CSM for Linux ppc64 CD (or from the expanded tarball) to the /csminstall directory. (This is not necessary on a SLES ppc64 management server because installms will have already copied the packages to the /csminstall directory.)

The Blue Gene Service Node must be configured as a CSM managed node (whether or not it is also the CSM management server), and it must have the autoupdate package installed.

The IBM.ManagedNode "Properties" attribute of the Blue Gene Service Node must include "BlueGeneNodeType|:|ServiceNode".

Then you must run bgsetupms on the management server, followed by: installnode -n service_node or updatenode -n service_node. (During this install or update, SMS

installs csm.bluegene on the Service Node.)

csm.bluegene package

Page 8: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

What's in the csm.bluegene package?

It contains management server-specific files and Service Node-specific files (even though all files get installed on both types of systems).

Management server files: /opt/csm/bin/bgsetupmon - the end-user command used to set up Stand-alone CSM Blue

Gene monitoring support on the management server. /opt/csm/bin/bgsetupms - the end-user command used to set up Full CSM plus Blue Gene

monitoring support on the management server. /opt/csm/install/resources/bluegene.ms/IBM.Nodegroup/BlueGeneServiceNodes.pm

et al - a set of predefined nodegroups created when bgsetupms calls mkresources. /opt/csm/install/resources/bluegene.ms/IBM.Condition/*l - a set of predefined ERRM

conditions created when bgsetupmon or bgsetupms calls mkresources. /opt/csm/csmbin/bgsetupsn - a post-install customization script that sets up Blue Gene

support on the Service Node (in Full CSM plus Blue Gene monitoring support case only). It runs on the Service Node (via a mount of the management server's /csminstall directory) when installnode -n service_node or updatenode -n service_node is run. It gets called by csmfirstboot or updatenode.client, respectively. (Note: /opt/csm/csmbin is the installed location; but bgsetupms copies it to /csminstall/csm/scripts, and then creates a couple of symbolic links named 500CSM_bgsetupsn.BlueGeneServiceNodes in /csminstall/csm/scripts/update and /csminstall/csm/scripts/installpostreboot, and it is one of these symbolic links that is used.)

(1 of 2)

Page 9: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

Service Node files: /opt/csm/bin/bgmksensor - the end-user command used to create Blue Gene-

specific sensors to monitor the Service Node's DB2 database for events of interest. (Keep in mind, though, that we do ship a number of predefined Blue Gene sensors, and they may be sufficient for all the monitoring the user cares to do. So this command is not necessarily used.)

/opt/csm/install/resources/bluegene.sn/IBM.Sensor/* - a set of predefined sensors (created when bgsetupmon or bgsetupsn calls mkresources).

/opt/csm/csmbin/bgmanage_trigger - an internally used command called by Blue Gene sensors to create or drop DB2 triggers and sequences as necessary.

/opt/csm/csmbin/bgrun_dbcmds - an internally used command called by bgmksensor and bgmanage_trigger to run db2 commands.

/opt/csm/lib/bgrefresh_sensor.so - an internally used shared library called by the DB2 stored procedure that bgrun_dbcmds creates. It uses RMC's runact-api to call Blue Gene sensors' SetValues() routine.

/opt/csm/pm/BlueGeneUtils.pm - a set of utilities used by the various scripts.

(2 of 2)What's in the csm.bluegene package?

Page 10: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

bgsetupmon

/opt/csm/bin/bgsetupmon is run on the CSM management server in the Stand-alone CSM Blue Gene monitoring support case. It must be run as part of the procedure to install CSM’s Blue Gene support. It must also be run when updating CSM to a new level. It has no significant flags or options.

/opt/csm/bin/bgsetupms is run on the CSM management server in the Full CSM plus Blue Gene monitoring support case. It must be run as part of the procedure to install CSM’s Blue Gene support. It must also be run when updating CSM to a new level. It has no significant flags or options.

bgsetupms

Page 11: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

bgmksensor /opt/csm/bin/bgmksensor is run on the Blue Gene Service

Node, if used at all. It is used to create custom IBM.Sensor resources used in monitoring the Blue Gene database. It is a higher level command than SensorRM’s mksensor command. A comparison of their usage statements highlights how different the commands are:

mksensor [−n host] [−i seconds] [ −c n ] [ −e 0 | 1 | 2 ] [−u user-ID] [−h] [−v │ −V] sensor_name [″]sensor_command[″]

bgmksensor −t table −o {d | i | u} [−w column[,...]] [−x "event_expression"] [−p column[,...]] [−T table] [−O {d | i | u}] [−W column[,...]] [−X "rearm_expression"] [−P column[,...]] [-h] [−v | −V] sensor_name

Think of bgmksensor as a wrapper to mksensor; both define an IBM.Sensor resource, but bgmksensor does so much more. In fact, bgmksensor hard-codes most sensor options and is more concerned with providing options related to the Blue Gene DB2 tables, operations, columns, and values that you want to monitor.

Page 12: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

Monitoring Overview

The Blue Gene Service Node routinely writes to its DB2 database all types of RAS, configuration, and environmental data related to the Blue Gene core (the I/O and Compute Nodes, the midplanes, the various interconnects, power supplies, fans, etc.). And this happens whether or not CSM is in the picture.

The CSM support for Blue Gene gives you a way to ‘watch’ the database for inserts, updates, and deletes that you deem important, and generate RMC events for them.

The resulting RMC events will drive the ERRM responses you specify. The charts that follow show a monitoring flow example. Step through

them to see what’s involved, and what happens when...

Page 13: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

Monitoring Flow Example(1 of 7)

DB2database

Blue Genesoftware

Service Node Management Server

Blue Gene core

Page 14: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

Monitoring Flow Example(2 of 7)

Blue Gene core

Blue Genesoftware

1. node error

2. insert

Service Node Management Server

Recording of Blue Gene core

events in DB2 occurs continually,

and occurs whether or not CSM

is present.

DB2database

Page 15: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

Monitoring Flow Example(3 of 7)

Blue Gene core

Blue Genesoftware

Service Node Management Server

BGNodeErrSensor

BGNodeErrCondition(upon BGNodeErrSensor change, isSD.Uint32 > 0?)

When CSM and csm.bluegene are

installed, there are various predefined

Sensors, Conditions, Responses, and

commands available

“E-mail root anytime”Response

DB2database

bgmanage_trigger

bgrefresh_sensor.soshared library

Page 16: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

Monitoring Flow Example(4 of 7)

Blue Gene core

Blue Genesoftware

Service Node Management Server

BGNodeErrSensor

BGNodeErrCondition

BGNodeErrCSMeDB2 Trigger(upon new row inTBGLNode isSTATUS = ‘M’?)

BGNodeErr_CSMDB2 Sequence

BGP_COMMONDB2 Procedure

BGP_COMMON_EXTDB2 Procedure

bgrefresh_sensor.soshared library

“E-mail root anytime”Response

When you start monitoring a

Blue Gene-related Condition

with startcondresp,

a number

of things are created in DB2

(via bgmanage_trig

ger,

the Command specified in the

Sensor): A Trigger or two, a

Sequence, and a couple of

Procedures

bgmanage_trigger

DB2database

Page 17: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

Monitoring Flow Example(5 of 7)

Blue Gene core

Blue Genesoftware

Service Node Management Server

BGNodeErrSensor

BGNodeErrCondition

BGNodeErrCSMeDB2 Trigger

BGNodeErr_CSMDB2 Sequence

BGP_COMMONDB2 Procedure

BGP_COMMON_EXTDB2 Procedure

bgrefresh_sensor.soshared library

“E-mail root anytime”Response

1. node error

2. insert3. evaluate

When the Blue Gene software

updates a table in the database,

DB2 evaluates the Triggers

associated with that table

DB2database

Page 18: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

Monitoring Flow Example(6 of 7)

Blue Gene core

Blue Genesoftware

Service Node Management Server

BGNodeErrSensor

BGNodeErrCondition

BGNodeErr_CSMDB2 Sequence

BGP_COMMONDB2 Procedure

BGP_COMMON_EXTDB2 Procedure

bgrefresh_sensor.soshared library

“E-mail root anytime”Response

4. get next 5. call

6. call7. call

8. SetValues()

In this case, the BGNodeErr_CSM

Trigger evaluates ‘true’ and it does

its thing: get next sequence number

and call BGP_COMMON. And

eventually SetValues() is called to

write the new sequence number into

BGNodeErr Sensors’s SD.Uint32.

DB2database

BGNodeErrCSMeDB2 Trigger(new row inTBGLNode &STATUS = ‘M’)

Page 19: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

Monitoring Flow Example(7 of 7)

Blue Gene core

Blue Genesoftware

Service Node Management Server

BGNodeErrSensor

BGNodeErrCondition

BGNodeErrCSMeDB2 Trigger

BGNodeErr_CSMDB2 Sequence

BGP_COMMONDB2 Procedure

BGP_COMMON_EXTDB2 Procedure

bgrefresh_sensor.soshared library

“E-mail root anytime”Response

9. evaluate

10. do response

At this point, it’s business as usual

for RMC. Since BGNodeErr Sensor’s

SD.Uint32 > 0, BGNodeErr Condition

is satisfied and the Response occurs.

DB2database

Page 20: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

Monitoring Details(1 of 3)

When you use bgmksensor to define a Blue Gene-related sensor, we temporarily create in the Blue Gene DB2 database the constructs required for monitoring. (By ‘constructs’ we mean the DB2 Triggers, Sequences, and Stored Procedures.) We do this to expose any errors. If we waited until you actually tried to use the defined sensor in a real monitoring situation, it would be harder to expose the errors. If DB2 gags on any of the constructs, bgmksensor reports the error(s) and creates no sensor. Whether successful or not, it deletes all DB2 constructs it created (they were temps, remember?).

When you use startcondresp to start monitoring a Blue Gene-related condition, the Command stored in the associated sensor gets run. The Command is /opt/csm/csmbin/bgmanage_trigger, and it creates the same Blue Gene DB2 database constructs that bgmksensor had created, but this time they’re not temporary. They stay defined until monitoring is stopped with stopcondresp.

Page 21: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

Monitoring Details(2 of 3)

For a given Blue Gene-related sensor, we create the following DB2 constructs: A Trigger to watch for bgmksensor’s -x event expression. If the

sensor name is BGFanTempHi, the event Trigger name is BGFanTempHiCSMe.

A Trigger to watch for bgmksensor’s -X rearm expression, if specified. If the sensor name is BGFanTempHi, the rearm Trigger name is BGFanTempHiCSMr.

A Sequence to give us a unique new number for each event or rearm forwarded from a Trigger to BGP_COMMON. If the sensor name is BGFanTempHi, the Sequence name is BGFanTempHi_CSM.

Stored Procedures named BGP_COMMON and BGP_COMMON_EXT, if we don’t already have them. (Unlike the Triggers and Sequences, these are not created on a per sensor basis; there are just the two, and they serve all sensors created.)

Page 22: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

Monitoring Details(3 of 3)

To provide support for Blue Gene rearm monitoring, we needed (and got) a new feature in RSCT. Normally, for a Condition that has a rearm expression, RMC ‘toggles’ between evaluating the Condition’s event expression, and its rearm expression. And when a Condition corresponds to a single resource, this makes perfect sense. However, in the world of Blue Gene monitoring, a Condition corresponds to a set of resources. So we can’t have RMC toggling; we must do the toggling down at the DB2 Trigger level because it is there where we’re able to distinguish one eventing or rearming resource from another. The bottom line is that when we’re monitoring a Blue Gene DB2 table for events and rearms, the Condition used must be the non-toggling type.

Because we assume the event/rearm toggling responsibilities, we introduced a DB2 table named TCSMEvents to keep track of when it’s proper to forward an event up to RMC, and when to forward a rearm. So be aware that we create this table, and that the DB2 Triggers we create manipulate its contents. TCSMEvents has two columns: sensor and origin. The latter uniquely identifies the event/rearm origin. If a sensor is in TCSMEvents, an event was forwarded last; otherwise a rearm was forwarded last, or no event or rearm was observed yet. TCSMEvents is created when the first Blue Gene monitoring is started. It is dropped when the csm.bluegene rpm is removed from the Service Node.

Page 23: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

Debugging1. bgsetupmon, bgsetupms and bgmksensor have a -v flag for verbose output.

2. bgrefresh_sensor.so will write some debug info to a file named /var/log/csm/csm_bg.log on theService Node if you do the following prior to starting monitoring:

Create /var/log/csm/csm_bg.log with 666 permissions.Temporarily modify /opt/csm/pm/BlueGeneUtils.pm in CreateTrigger where you see: CALL $SP_name('$trigname', CAST(seq_value AS CHAR(10)), $out_col_stuff2, '$debug');Change it to: CALL $SP_name('$trigname', CAST(seq_value AS CHAR(10)), $out_col_stuff2, ‘1');

3. Some helpful DB2 commands that the bglsysdb userid can run on the Service Node:

db2 { stop | start } database managerdb2 connect to bgdb0

db2 “select trigname from syscat.triggers where trigname like ‘%CSM%’”db2 “drop trigger bglsysdb.xxxCSMe”db2 “drop trigger bglsysdb.xxxCSMr” db2 “select seqname from syscat.sequences where seqname like ‘%CSM%’”db2 “drop sequence bglsysdb.xxx_CSM”

db2 “select procname from syscat.procedures where procname like ‘%BGP%’”db2 “drop procedure bglsysdb.common_bgp”db2 “drop procedure bglsysdb.common_bgp_ext”

db2 disconnect bgdb0

Page 24: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

What’s changed, what’s new this release?

1. We’ve added Blue Gene/P support!2. We’ve added Stand-alone CSM Blue Gene monitoring support (and

bgsetupmon to set this up).3. Note that in the case of Stand-alone CSM Blue Gene monitoring

support, our predefined Blue Gene-related ERRM Conditions are created on the CSM management server / Service Node, and their Management Scope is set to ‘l’ (for ‘local’). Customers who create their own Blue Gene-related ERRM Conditions in the Stand-alone case must do the same!

Page 25: CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam fullam@us.ibm.com

References

/project/design/doc/clusters_6B/csm/bluegene/CSM-BGP-CompDes.pdf

CSM Blue Gene/P Support - Component Design:

CSM 1.7.0 Planning and Installation Guide:• See section “CSM support for Blue Gene”

CSM 1.7.0 Administration Guide:• See section “Using CSM with the IBM System Blue Gene Solution”

CSM 1.7.0 Command and Technical Reference:• bgsetupmon, bgsetupms and bgmksensor