csm support for blue gene/p csm 1.7.0 line item 0xr skills transfer materials by marty fullam...
TRANSCRIPT
CSM support for Blue Gene/P
CSM 1.7.0
Line item 0XR
Skills Transfer Materials
by Marty [email protected]
What's a Blue Gene? It's IBM's flagship supercomputer offering. It's official name is the "IBM
System Blue Gene Solution". See http://www-03.ibm.com/servers/deepcomputing/bluegene.html
It looks like this:
The Service Node is the administration focal point of a Blue Gene. Among other things, it maintains a DB2 database of configuration, RAS, and environmental data. The Blue Gene system administrator hangs out here.
The Front End Nodes are used for compiling & submitting Blue Gene jobs. End users hang out here.
The File Servers serve files to the other systems. The I/O and Compute Nodes (aka the Blue Gene core) run the user jobs. All systems are POWER systems. The Service Node, Front End Nodes, and
File Servers run SLES. The I/O and Compute Nodes run custom operating systems.
Front End Nodes
1 GigabitEthernet
1 GigabitEthernet
Service Node
File Servers
DB2
Compute Nodes(65,536)
I/O Nodes(1024)
Blue Gene/Lor
Blue Gene/P
CSM / Blue Gene Topology 1
Just install a CSM management server on your Blue Gene Service Node, and then add the CSM Blue Gene support. Define no CSM nodes.
Notice that there is no CSM cluster per se, just a management server. And even though the I/O and Compute Nodes of the Blue Gene core are not managed by CSM, you will still be able to monitor them.
This topology is new this release! We call it Stand-alone CSM Blue Gene monitoring support.
Blue Gene core(I/O and Compute Nodes)
Blue Gene Service NodeCSM management server
Blue Gene Front End Nodes
Blue Gene File Servers
Blue Gene with CSM to monitor the Blue Gene DB
CSM / Blue Gene Topology 2
Just add CSM to the Blue Gene systems you have. Pick any system to be the management server (though it's probably most typical to use the Service Node).
Define your Service Node, Front End Nodes, and File Servers as CSM nodes. Notice that the I/O and Compute Nodes of the Blue Gene core are not managed by CSM
(mainly because they are not general-purpose Linux systems, and don't need to be burdened with CSM and RSCT software). However, as you will see, you will still be able to monitor them.
We call this topology Full CSM plus Blue Gene monitoring support.
Blue Gene with CSM to monitor the Blue Gene DB, and tomanage the Service Node, Front End Nodes, and File Servers
Blue Gene core(I/O and Compute Nodes)
Blue Gene Service NodeCSM management serverCSM managed node
Blue Gene Front End NodesCSM managed nodes
Blue Gene File ServersCSM managed nodes
CSM Cluster
CSM / Blue Gene Topology 3Blue Gene as part of a larger CSM cluster
Here, the management server is a system outside of the Blue Gene solution. And while the Blue Gene Service Node, Front End Nodes, and File Servers are
configured as managed nodes in the CSM cluster, they are not the only managed nodes, there can be lots of others completely unrelated to the Blue Gene.
This topology, like Topology 2, is Full CSM plus Blue Gene monitoring support.
Blue Gene core(I/O and Compute Nodes)
Blue Gene Service NodeCSM managed node
Blue Gene Front End NodesCSM managed nodes
Blue Gene File ServersCSM managed nodes
CSM management server
IBM eServer Blue Gene Solution
CSM Cluster
OtherCSM managed node
OtherCSM managed node
OtherCSM managed node
CSM support for Blue Gene If you use Full CSM plus Blue Gene monitoring support (Topology 2
or 3 in previous charts), use existing CSM and RSCT function to help manage the Blue Gene Service Node, Front End Nodes, and File Servers. After all, they're just SLES POWER systems. Nothing new here. Use any or all CSM function.
For Stand-alone CSM Blue Gene monitoring support (Topology 1) or Full CSM plus Blue Gene monitoring support (Topology 2 or 3), also use the optional rpm, csm.bluegene, which gives the system administrator the ability to monitor, effectively, the Blue Gene core using standard CSM monitoring capabilities (ERRM conditions and responses). Actually, what we provide is the ability to monitor the Blue Gene DB2 database where the Service Node is continually writing RAS, configuration, and environmental data about the Blue Gene core.
An optional part of CSM (only customers with a Blue Gene would care about it!) Used on the CSM management server (AIX, or Linux i386 or ppc64), and in the
Full CSM plus Blue Gene monitoring support case on the Blue Gene Service Node / CSM managed node (SLES ppc64) too , so it is present in thecsm-aix-1.7.x.x and csm-linux-1.7.x.x tarballs and on the CDs.
If you want to use it, manual install is required on the management server (installp or geninstall on AIX, rpm –i on Linux), followed by these additional setup steps...
Stand-alone CSM Blue Gene monitoring support case: Run bgsetupmon on the management server.
Full CSM plus Blue Gene monitoring support case: On a non-SLES ppc64 management server, use the copycmspkgs -n service_node command to
copy the CSM SLES ppc64 packages from the CSM for Linux ppc64 CD (or from the expanded tarball) to the /csminstall directory. (This is not necessary on a SLES ppc64 management server because installms will have already copied the packages to the /csminstall directory.)
The Blue Gene Service Node must be configured as a CSM managed node (whether or not it is also the CSM management server), and it must have the autoupdate package installed.
The IBM.ManagedNode "Properties" attribute of the Blue Gene Service Node must include "BlueGeneNodeType|:|ServiceNode".
Then you must run bgsetupms on the management server, followed by: installnode -n service_node or updatenode -n service_node. (During this install or update, SMS
installs csm.bluegene on the Service Node.)
csm.bluegene package
What's in the csm.bluegene package?
It contains management server-specific files and Service Node-specific files (even though all files get installed on both types of systems).
Management server files: /opt/csm/bin/bgsetupmon - the end-user command used to set up Stand-alone CSM Blue
Gene monitoring support on the management server. /opt/csm/bin/bgsetupms - the end-user command used to set up Full CSM plus Blue Gene
monitoring support on the management server. /opt/csm/install/resources/bluegene.ms/IBM.Nodegroup/BlueGeneServiceNodes.pm
et al - a set of predefined nodegroups created when bgsetupms calls mkresources. /opt/csm/install/resources/bluegene.ms/IBM.Condition/*l - a set of predefined ERRM
conditions created when bgsetupmon or bgsetupms calls mkresources. /opt/csm/csmbin/bgsetupsn - a post-install customization script that sets up Blue Gene
support on the Service Node (in Full CSM plus Blue Gene monitoring support case only). It runs on the Service Node (via a mount of the management server's /csminstall directory) when installnode -n service_node or updatenode -n service_node is run. It gets called by csmfirstboot or updatenode.client, respectively. (Note: /opt/csm/csmbin is the installed location; but bgsetupms copies it to /csminstall/csm/scripts, and then creates a couple of symbolic links named 500CSM_bgsetupsn.BlueGeneServiceNodes in /csminstall/csm/scripts/update and /csminstall/csm/scripts/installpostreboot, and it is one of these symbolic links that is used.)
(1 of 2)
Service Node files: /opt/csm/bin/bgmksensor - the end-user command used to create Blue Gene-
specific sensors to monitor the Service Node's DB2 database for events of interest. (Keep in mind, though, that we do ship a number of predefined Blue Gene sensors, and they may be sufficient for all the monitoring the user cares to do. So this command is not necessarily used.)
/opt/csm/install/resources/bluegene.sn/IBM.Sensor/* - a set of predefined sensors (created when bgsetupmon or bgsetupsn calls mkresources).
/opt/csm/csmbin/bgmanage_trigger - an internally used command called by Blue Gene sensors to create or drop DB2 triggers and sequences as necessary.
/opt/csm/csmbin/bgrun_dbcmds - an internally used command called by bgmksensor and bgmanage_trigger to run db2 commands.
/opt/csm/lib/bgrefresh_sensor.so - an internally used shared library called by the DB2 stored procedure that bgrun_dbcmds creates. It uses RMC's runact-api to call Blue Gene sensors' SetValues() routine.
/opt/csm/pm/BlueGeneUtils.pm - a set of utilities used by the various scripts.
(2 of 2)What's in the csm.bluegene package?
bgsetupmon
/opt/csm/bin/bgsetupmon is run on the CSM management server in the Stand-alone CSM Blue Gene monitoring support case. It must be run as part of the procedure to install CSM’s Blue Gene support. It must also be run when updating CSM to a new level. It has no significant flags or options.
/opt/csm/bin/bgsetupms is run on the CSM management server in the Full CSM plus Blue Gene monitoring support case. It must be run as part of the procedure to install CSM’s Blue Gene support. It must also be run when updating CSM to a new level. It has no significant flags or options.
bgsetupms
bgmksensor /opt/csm/bin/bgmksensor is run on the Blue Gene Service
Node, if used at all. It is used to create custom IBM.Sensor resources used in monitoring the Blue Gene database. It is a higher level command than SensorRM’s mksensor command. A comparison of their usage statements highlights how different the commands are:
mksensor [−n host] [−i seconds] [ −c n ] [ −e 0 | 1 | 2 ] [−u user-ID] [−h] [−v │ −V] sensor_name [″]sensor_command[″]
bgmksensor −t table −o {d | i | u} [−w column[,...]] [−x "event_expression"] [−p column[,...]] [−T table] [−O {d | i | u}] [−W column[,...]] [−X "rearm_expression"] [−P column[,...]] [-h] [−v | −V] sensor_name
Think of bgmksensor as a wrapper to mksensor; both define an IBM.Sensor resource, but bgmksensor does so much more. In fact, bgmksensor hard-codes most sensor options and is more concerned with providing options related to the Blue Gene DB2 tables, operations, columns, and values that you want to monitor.
Monitoring Overview
The Blue Gene Service Node routinely writes to its DB2 database all types of RAS, configuration, and environmental data related to the Blue Gene core (the I/O and Compute Nodes, the midplanes, the various interconnects, power supplies, fans, etc.). And this happens whether or not CSM is in the picture.
The CSM support for Blue Gene gives you a way to ‘watch’ the database for inserts, updates, and deletes that you deem important, and generate RMC events for them.
The resulting RMC events will drive the ERRM responses you specify. The charts that follow show a monitoring flow example. Step through
them to see what’s involved, and what happens when...
Monitoring Flow Example(1 of 7)
DB2database
Blue Genesoftware
Service Node Management Server
Blue Gene core
Monitoring Flow Example(2 of 7)
Blue Gene core
Blue Genesoftware
1. node error
2. insert
Service Node Management Server
Recording of Blue Gene core
events in DB2 occurs continually,
and occurs whether or not CSM
is present.
DB2database
Monitoring Flow Example(3 of 7)
Blue Gene core
Blue Genesoftware
Service Node Management Server
BGNodeErrSensor
BGNodeErrCondition(upon BGNodeErrSensor change, isSD.Uint32 > 0?)
When CSM and csm.bluegene are
installed, there are various predefined
Sensors, Conditions, Responses, and
commands available
“E-mail root anytime”Response
DB2database
bgmanage_trigger
bgrefresh_sensor.soshared library
Monitoring Flow Example(4 of 7)
Blue Gene core
Blue Genesoftware
Service Node Management Server
BGNodeErrSensor
BGNodeErrCondition
BGNodeErrCSMeDB2 Trigger(upon new row inTBGLNode isSTATUS = ‘M’?)
BGNodeErr_CSMDB2 Sequence
BGP_COMMONDB2 Procedure
BGP_COMMON_EXTDB2 Procedure
bgrefresh_sensor.soshared library
“E-mail root anytime”Response
When you start monitoring a
Blue Gene-related Condition
with startcondresp,
a number
of things are created in DB2
(via bgmanage_trig
ger,
the Command specified in the
Sensor): A Trigger or two, a
Sequence, and a couple of
Procedures
bgmanage_trigger
DB2database
Monitoring Flow Example(5 of 7)
Blue Gene core
Blue Genesoftware
Service Node Management Server
BGNodeErrSensor
BGNodeErrCondition
BGNodeErrCSMeDB2 Trigger
BGNodeErr_CSMDB2 Sequence
BGP_COMMONDB2 Procedure
BGP_COMMON_EXTDB2 Procedure
bgrefresh_sensor.soshared library
“E-mail root anytime”Response
1. node error
2. insert3. evaluate
When the Blue Gene software
updates a table in the database,
DB2 evaluates the Triggers
associated with that table
DB2database
Monitoring Flow Example(6 of 7)
Blue Gene core
Blue Genesoftware
Service Node Management Server
BGNodeErrSensor
BGNodeErrCondition
BGNodeErr_CSMDB2 Sequence
BGP_COMMONDB2 Procedure
BGP_COMMON_EXTDB2 Procedure
bgrefresh_sensor.soshared library
“E-mail root anytime”Response
4. get next 5. call
6. call7. call
8. SetValues()
In this case, the BGNodeErr_CSM
Trigger evaluates ‘true’ and it does
its thing: get next sequence number
and call BGP_COMMON. And
eventually SetValues() is called to
write the new sequence number into
BGNodeErr Sensors’s SD.Uint32.
DB2database
BGNodeErrCSMeDB2 Trigger(new row inTBGLNode &STATUS = ‘M’)
Monitoring Flow Example(7 of 7)
Blue Gene core
Blue Genesoftware
Service Node Management Server
BGNodeErrSensor
BGNodeErrCondition
BGNodeErrCSMeDB2 Trigger
BGNodeErr_CSMDB2 Sequence
BGP_COMMONDB2 Procedure
BGP_COMMON_EXTDB2 Procedure
bgrefresh_sensor.soshared library
“E-mail root anytime”Response
9. evaluate
10. do response
At this point, it’s business as usual
for RMC. Since BGNodeErr Sensor’s
SD.Uint32 > 0, BGNodeErr Condition
is satisfied and the Response occurs.
DB2database
Monitoring Details(1 of 3)
When you use bgmksensor to define a Blue Gene-related sensor, we temporarily create in the Blue Gene DB2 database the constructs required for monitoring. (By ‘constructs’ we mean the DB2 Triggers, Sequences, and Stored Procedures.) We do this to expose any errors. If we waited until you actually tried to use the defined sensor in a real monitoring situation, it would be harder to expose the errors. If DB2 gags on any of the constructs, bgmksensor reports the error(s) and creates no sensor. Whether successful or not, it deletes all DB2 constructs it created (they were temps, remember?).
When you use startcondresp to start monitoring a Blue Gene-related condition, the Command stored in the associated sensor gets run. The Command is /opt/csm/csmbin/bgmanage_trigger, and it creates the same Blue Gene DB2 database constructs that bgmksensor had created, but this time they’re not temporary. They stay defined until monitoring is stopped with stopcondresp.
Monitoring Details(2 of 3)
For a given Blue Gene-related sensor, we create the following DB2 constructs: A Trigger to watch for bgmksensor’s -x event expression. If the
sensor name is BGFanTempHi, the event Trigger name is BGFanTempHiCSMe.
A Trigger to watch for bgmksensor’s -X rearm expression, if specified. If the sensor name is BGFanTempHi, the rearm Trigger name is BGFanTempHiCSMr.
A Sequence to give us a unique new number for each event or rearm forwarded from a Trigger to BGP_COMMON. If the sensor name is BGFanTempHi, the Sequence name is BGFanTempHi_CSM.
Stored Procedures named BGP_COMMON and BGP_COMMON_EXT, if we don’t already have them. (Unlike the Triggers and Sequences, these are not created on a per sensor basis; there are just the two, and they serve all sensors created.)
Monitoring Details(3 of 3)
To provide support for Blue Gene rearm monitoring, we needed (and got) a new feature in RSCT. Normally, for a Condition that has a rearm expression, RMC ‘toggles’ between evaluating the Condition’s event expression, and its rearm expression. And when a Condition corresponds to a single resource, this makes perfect sense. However, in the world of Blue Gene monitoring, a Condition corresponds to a set of resources. So we can’t have RMC toggling; we must do the toggling down at the DB2 Trigger level because it is there where we’re able to distinguish one eventing or rearming resource from another. The bottom line is that when we’re monitoring a Blue Gene DB2 table for events and rearms, the Condition used must be the non-toggling type.
Because we assume the event/rearm toggling responsibilities, we introduced a DB2 table named TCSMEvents to keep track of when it’s proper to forward an event up to RMC, and when to forward a rearm. So be aware that we create this table, and that the DB2 Triggers we create manipulate its contents. TCSMEvents has two columns: sensor and origin. The latter uniquely identifies the event/rearm origin. If a sensor is in TCSMEvents, an event was forwarded last; otherwise a rearm was forwarded last, or no event or rearm was observed yet. TCSMEvents is created when the first Blue Gene monitoring is started. It is dropped when the csm.bluegene rpm is removed from the Service Node.
Debugging1. bgsetupmon, bgsetupms and bgmksensor have a -v flag for verbose output.
2. bgrefresh_sensor.so will write some debug info to a file named /var/log/csm/csm_bg.log on theService Node if you do the following prior to starting monitoring:
Create /var/log/csm/csm_bg.log with 666 permissions.Temporarily modify /opt/csm/pm/BlueGeneUtils.pm in CreateTrigger where you see: CALL $SP_name('$trigname', CAST(seq_value AS CHAR(10)), $out_col_stuff2, '$debug');Change it to: CALL $SP_name('$trigname', CAST(seq_value AS CHAR(10)), $out_col_stuff2, ‘1');
3. Some helpful DB2 commands that the bglsysdb userid can run on the Service Node:
db2 { stop | start } database managerdb2 connect to bgdb0
db2 “select trigname from syscat.triggers where trigname like ‘%CSM%’”db2 “drop trigger bglsysdb.xxxCSMe”db2 “drop trigger bglsysdb.xxxCSMr” db2 “select seqname from syscat.sequences where seqname like ‘%CSM%’”db2 “drop sequence bglsysdb.xxx_CSM”
db2 “select procname from syscat.procedures where procname like ‘%BGP%’”db2 “drop procedure bglsysdb.common_bgp”db2 “drop procedure bglsysdb.common_bgp_ext”
db2 disconnect bgdb0
What’s changed, what’s new this release?
1. We’ve added Blue Gene/P support!2. We’ve added Stand-alone CSM Blue Gene monitoring support (and
bgsetupmon to set this up).3. Note that in the case of Stand-alone CSM Blue Gene monitoring
support, our predefined Blue Gene-related ERRM Conditions are created on the CSM management server / Service Node, and their Management Scope is set to ‘l’ (for ‘local’). Customers who create their own Blue Gene-related ERRM Conditions in the Stand-alone case must do the same!
References
/project/design/doc/clusters_6B/csm/bluegene/CSM-BGP-CompDes.pdf
CSM Blue Gene/P Support - Component Design:
CSM 1.7.0 Planning and Installation Guide:• See section “CSM support for Blue Gene”
CSM 1.7.0 Administration Guide:• See section “Using CSM with the IBM System Blue Gene Solution”
CSM 1.7.0 Command and Technical Reference:• bgsetupmon, bgsetupms and bgmksensor