orchestration (lico) 5.1.0 lenovo intelligent computing · 2018-05-03 · 3.2.3. xcat management...

67
Lenovo Intelligent Computing Orchestration (LiCO) 5.1.0 Administrator Manual Date: May 3, 2018 Version: v1.0

Upload: others

Post on 27-Apr-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Lenovo Intelligent Computing Orchestration (LiCO) 5.1.0

Administrator Manual

Date: May 3, 2018

Version: v1.0

Contents Administrator Manual ..........................................................................................................0

1. Introduction to LiCO ....................................................................................................5

1.1. Terminology .............................................................................................................6

1.2. Premises and assumptions ......................................................................................6

1.3. Operating environment ............................................................................................7

2. Instructions on use ......................................................................................................8

2.1. Administrator Home Page ........................................................................................8

2.1.1. Login to the Administrator Interface .....................................................................8

2.1.2. Switch Role ........................................................................................................10

2.1.3. Change Account Password ................................................................................11

2.1.4. View Cluster Status ............................................................................................12

2.1.5. View Cluster Alert Messages .............................................................................13

2.2. User Management .................................................................................................14

2.2.1. Users Group (Using LDAP) ................................................................................15

2.2.1.1. Create a User Group ......................................................................................15

2.2.1.2. Delete a User Group .......................................................................................16

2.2.2. Users (Using LDAP) ...........................................................................................16

2.2.2.1. Create a User .................................................................................................17

2.2.2.2. Edit a User Account ........................................................................................18

2.2.2.3. Change User Password ..................................................................................19

2.2.2.4. Delete a User Account ....................................................................................19

2.2.2.5. Suspend a User ..............................................................................................20

2.2.2.6. Browse User Details .......................................................................................21

2.2.3. User Group (Not Using LDAP) ...........................................................................21

2.2.4. Users (Not Using LDAP) ....................................................................................21

2.2.4.1. Import a User ..................................................................................................22

2.2.4.2. Edit a User ......................................................................................................23

2.2.4.3. Change User Password ..................................................................................23

2.2.4.4. Delete a User ..................................................................................................24

2.2.4.5. Suspend a User ..............................................................................................25

2.2.4.6. Browse User Details .......................................................................................25

2.2.5. Billing Group .......................................................................................................26

2.2.5.1. Create a Billing Group ....................................................................................26

2.2.5.2. Edit a Billing Group .........................................................................................27

2.2.5.3. Credit / Debit an Account ................................................................................27

2.2.5.4. Delete a Billing Group .....................................................................................28

2.2.6. Troubleshooting and Solutions ...........................................................................28

2.3. Monitor ...................................................................................................................29

2.3.1. List View .............................................................................................................29

2.3.2. Physical View .....................................................................................................33

2.3.3. Group View ........................................................................................................35

2.3.4. GPU View ...........................................................................................................36

2.3.5. Jobs ....................................................................................................................37

2.3.6. Alerts ..................................................................................................................38

2.3.7. Operation ...........................................................................................................40

2.4. Reports ..................................................................................................................41

2.4.1. Job Reports ........................................................................................................41

2.4.2. Alert Reports ......................................................................................................44

2.4.3. Action Reports ....................................................................................................45

2.5. Admin .....................................................................................................................46

2.5.1. VNC ....................................................................................................................46

2.5.1.1. Manage on Web .............................................................................................47

2.5.1.2. Manage Using Command Lines .....................................................................48

2.5.2. Operation Logs ...................................................................................................49

2.5.3. System Logs ......................................................................................................49

2.6. Settings ..................................................................................................................49

2.6.1. Alert Policy .........................................................................................................50

2.6.2. Notification Group ..............................................................................................53

2.6.3. Notification Setting .............................................................................................54

2.6.4. Scripts ................................................................................................................55

2.7. Operator Functions ................................................................................................56

2.8. User Functions .......................................................................................................56

3. HPC Cluster Management ........................................................................................56

3.1. View HPC Cluster Details ......................................................................................57

3.2. Remote Management of HPC Cluster Hardware ...................................................57

3.2.1. Interface Management .......................................................................................57

3.2.2. Command Line Management .............................................................................58

3.2.3. xCAT Management ............................................................................................58

3.3. Parallel Commands ...............................................................................................60

3.4. Job Scheduling Commands ...................................................................................60

3.5. Queue Commands .................................................................................................60

3.6. Job Management ...................................................................................................61

4. Important Information ................................................................................................62

4.1. Restarting LiCO .....................................................................................................62

4.2. MPI Program Installation Location .........................................................................62

4.3. Absolute User Directory Path ................................................................................63

4.4. Resolving a Failed Job Submission .......................................................................63

4.5. Resolving a Failed User or Group Creation ...........................................................64

4.6. Manage Users and Groups Using command Lines ...............................................64

4.7. Batch Deletion of Jobs in the Database .................................................................65

4.8. Failure to View or Delete a VNC ............................................................................66

4.9. Data Sources for GPU Monitoring .........................................................................66

Preface Thank you for choosing Lenovo Intelligent Computing Orchestration (LiCO). LiCO aims to provide you with a high-performance computing and artificial intelligence platform that is easy to use and rich in features. This document is designed for users who possess a basic knowledge of high performance computation and server clusters, and have a certain level of understanding of parallel development, job scheduling and artificial intelligence (AI).

1. Introduction to LiCO

Lenovo Intelligent Computing Orchestration (LiCO) is Lenovo's all-in-one solution for High Performance Computing (HPC) and Artificial Intelligence (AI) Training clusters. It provides cluster management, monitoring, job scheduling, user management, account management and file management. LiCO enables users to centrally allocate the resources in a supercomputing cluster and supports simultaneous HPC and AI jobs. With the widespread application of artificial intelligence, HPC and big data, LiCO is being used by increasing numbers of government bodies, universities, and organizations working in fields such as meteorology, geology, oil and petrochemicals, the military and life sciences.

LiCO utilizes a WEB-based architecture and allows users to easily control and manage clusters through a web portal. It has the following features:

1. Cluster management and monitoring: LiCO provides a physical view of the server room and detailed monitoring data for each node including CPU, memory, disk, temperature, system load and network use (TCP/IP). Each node can be grouped logically, making centralized planning and management easy.

2. Job management and monitoring: Users can directly view and manage the status and results of jobs. Various HPC workload schedulers and a wide range of job types are supported (including AI jobs such as TensorFlow and Caffe).

3. User management and billing: LiCO manages both local and domain users through the same interface. It supports user resource usage billing (top-ups and chargebacks), and offers the ability to set billing groups and fees.

4. Alarms and notifications: Users can set alarm policies, receive email, and text notifications.

5. Reports: Various reports can be generated, including cluster reports, alarm reports, job reports, and billing group job reports.

6. Customization: A range of customizations are available, such as job template customization, report customization, and 3D server visualizations.

Users can also login through the WEB portal using built-in SSH terminal to the login node and execute commands.

1.1. Terminology

Computer cluster: a general reference to a collection of server resources including management nodes, login nodes, and computing nodes.

Job: a series of commands in sequence intended to accomplish a particular task.

Job status: the status of a job in the scheduling system, such as waiting, in queue, on hold, running, suspended, or completed.

Node status: the status of a node, such as idle, occupied, busy, or off.

Job scheduling system: the distributed program in control of receiving, distributing, executing and registering jobs, also referred to as the operation scheduler or simply scheduler.

Management node: the server in a cluster running management programs such as job scheduling, cluster management and user billing.

Login node: the server in a cluster to which users can log in via Linux and conduct operations.

Computing node: the server in a cluster for executing jobs.

User group: a set of users for which the system has defined an access control policy, so that all users in the same user group have access to the same set of cluster resources.

Billing group: a group of cluster users that are to be billed under one account, also referred to as a billing account. A billing account can be made up of a single user or multiple users.

1.2. Premises and assumptions

The descriptions in this manual are based on a situation in which Slurm is the job scheduler. LiCO currently supports three kinds of schedulers: Slurm, Torque and LSF. The commands for Slurm in this manual are not applicable to Torque or LSF; if those schedulers are being used, please refer to the corresponding documentation.

1.3. Operating environment

Cluster server:

Lenovo Think System series servers.

Operating systems supported by the cluster servers:

CentOS / Red Hat 7.4

SLES 12.3

Client requirements:

Hardware: CPU of 2.0GHz or above, memory of 1GB or above.

Browser: Chrome(v62.0 and above) or Firefox(v56.0 and above) recommended.

Display resolution: 1280 x 800 or above

2. Instructions on use

2.1. Administrator Home Page

2.1.1. Login to the Administrator Interface

Open a browser and enter the IP address for the cluster’s login node, such as https://10.220.112.21 (the client must have direct access to the cluster login node). You will see the LiCO login page, as shown:

Figure 1. Login

A user can assume three kinds of roles, including administrator, operator, and ordinary user. Administrators can view the entire computer cluster and the information of all users. Operators can only view resources to which they have access, as well as their own information. Ordinary users can execute jobs and run operations such as job monitoring.

After entering the correct administrator username and password, click “Login” to open the administrator home page, as shown:

Figure 2. User home page

The left navigation bar has the following functions:

HPC.com: Cluster name. When the mouse hovers over it, the current scheduling and file service state is shown. You can reference the LiCO Installation Guide to edit the cluster name.

Home: The current page, showing basic cluster information.

User Manage: User management page. The administrator can perform basic operations on the user/group, account/rate.

Monitor: Monitor the HPC cluster.

Reports: Export reports in Excel, PDF, or html formats based on job, alert, or action type.

Admin: Check the cluster VNC and system log information.

Settings: Configure alerts for the HPC cluster and manage notification groups and notification connections.

There are two shortcut icons in the upper right corner of the interface.

: Show the number of unconfirmed alerts in the current cluster. Click to choose to

enter the alert details page or choose to turn alert sounds on or off.

: Check current user information, edit current user password, or log out, and switch

between user roles.

2.1.2. Switch Role

With the highest permission level in the system, an administrator can switch to the role of an operator or user and be redirected to the corresponding home page.

Click on the “ ” icon in the upper right corner. From the information that pops up, the

user can click to switch to the operator or user interface, as shown:

Figure 3. Switch role

The left navigation bar on the operator home page features only Home, Monitor, and Report, as shown:

Figure 4. Operator main page

The left navigation bar on the user home page features Home, Job Submission, Job List, Training Models, Expert Mode, and Manage functionalities, as shown:

Figure 5. User main page

2.1.3. Change Account Password

Click on the “ ” icon in the upper right corner. The current username can be found in the

information that pops up. Next, click on the “ ” icon to change the password for the

current account, as shown. After entering the correct information, click “Submit” to change the password.

Figure 6. Password change dialog

2.1.4. View Cluster Status

On the administrator home page, click on the “ ” icon to maximize or minimize the navigation bar, which shows the basic status of the entire cluster, as shown:

Figure 7. Cluster overview

The following information can be found on the home page:

CPU: Utilization rate for the CPUs in a server cluster, including the CPU kernels used and the total CPU kernels in a cluster.

GPU: Utilization rate for the GPUs in a server cluster, including the GPU kernels used and the total GPU kernels in a cluster.

Memory: Utilization rate for the memory in a server cluster, including the memory used and the total memory in a cluster.

Storage: Utilization rate for the storage space in a server cluster, including the storage space used and the total storage space in a cluster.

Network: Capacity of the network on a server cluster, including reading and writing speeds.

Node: Shows the number of computers turned on or off in the computer cluster.

Node Status: Shows the usage status of nodes on the computer cluster, including busy, in use, idle, or off. The primary basis for determining node usage is average process load per minute.

Job: Shows the names and running times of jobs that are running or waiting.

Job Status: Shows past information about the job, including the number of jobs running, waiting, and finished. An administrator can choose to display the number of jobs in all queues or the number of jobs in a certain queue. In terms of time, available display options include the last hour, the last day, the last seven days, and the last thirty days. In terms of job type, available display options include unfinished and finished jobs.

Message: Shows the most recent operations log for the web system.

When placing the cursor over the cluster name in the left navigation bar, the current scheduling and file service status will be displayed, as shown in the figure below:

Figure 8. Cluster status

The health of the scheduler and parallel file system is indicated by the following color-coded system:

Scheduler: Green means that the scheduler is working normally and red means that the scheduler is not working normally.

Parallel File System: Green means that the parallel file system is working normally and red means that the parallel file system is not working normally.

2.1.5. View Cluster Alert Messages

If you want to add an alert rule for the cluster and trigger the alert, the “ ” symbol in the

upper right corner of the interface will produce a red numerical prompt. The number

corresponds to the current number of unconfirmed alerts. Click the “ ” icon with the

mouse and then click to view all alert information. You can also skip to the alert monitoring interface.

This also allows you to turn the alert sounds on and off. When alert sound is turned on, every new alert will trigger a sound, as shown:

Figure 9. Alert messages

2.2. User Management

There are three user constructs: user group, user, and billing group (or billing account).

User Group: A group of users on the HPC cluster with similar queue access permissions.

Users: List of users in the HPC cluster.

It has the following attributes:

Username: Account name.

Role: Administrator, operator, or user. Administrators can view the status of an entire cluster. Operators can only view their own queues and job statuses.

First Name: The first name of the user.

Last Name: The last name of the user.

Billing Group: The billing group to which a user belongs.

User Group: The user group to which a user belongs.

Last Login Time: The user’s most recent login time.

Email: The user’s email address.

Password: The user’s password.

Billing Group (Billing Account): The billing account number, which can be used by one or multiple users. When members of a billing group run applications in a cluster, the balance in the billing account will be debited according to the number of CPU kernels used and the time taken in running the applications.

It has the following attributes:

Name: The name of the billing group.

Billing Rate: The fee per unit computing time. If the rate is 1, then any member of the billing group using 1 CPU kernel for 1 hour would be charged 1dollar.

Used Time: The amount of time used by members’ applications: CPU (number of kernels) x time (hours).

Spent Amount: The amount of money spent by members of the billing group. As billing rates are subject to change, the debited amount may not be equal to the computing time * the current billing rate.

Balance: The amount remaining in the billing group’s account.

Description: A description of the billing group.

2.2.1. Users Group (Using LDAP)

Click on the Manage menu on the left navigation bar, then click on “User Group” to enter the user interface.

Figure 10. User group management

2.2.1.1. Create a User Group

During system initialization, a user group (with the default name “default_os_group”) is created by the system, but it is recommended that the administrator create a new user group for use on the system.

Click the “Create” button and the following dialog will appear:

Figure 11. Create user group

The administrator can enter a name that has not been used in the system and clicking the “Confirm” button.

2.2.1.2. Delete a User Group

The administrator can delete a user group that has been created.

Click on the “ ” icon in the recorded operations list for the user group, and the following

dialog will appear.

Figure 12. Delete user group

Click the “Submit” button and the user group will be deleted from the system.

2.2.2. Users (Using LDAP)

Click on the Manage menu on the left navigation bar, then click on “User” to enter the user interface.

Figure 13. User management

2.2.2.1. Create a User

During system initialization, an administrator account (with the default name “hpcadmin”) is created.

Click the “Create New” button and the “Create New User” page will appear, as in the image below.

Figure 14. Create user dialog

The following information is needed to create a user:

Username: Enter your account login name (Required).

Note: Username must only contain lowercase letters, numbers, underscores, minus signs, and start with letters

Role: Choose the user role (Administrator, Operator, or Ordinary User).

First Name: Enter your first name.

Last Name: Enter your last name.

Billing Group: Choose the billing group to which you belong (Required). (Selected from the billing group list)

Email: Enter your email address.

User Group: Choose the user group to which you belong (Required). (Selected from the user group list)

Password: Enter your password (Required).

Note: Your password should be at least 10 characters and include at least one uppercase letter, one lowercase letter, one special symbol, and one number.

Confirm Password: Enter your password a second time (Required).

After filling out the information as in the figure below, click the “Submit” button and the system will create the user account; when it is successfully created, the user can log in.

2.2.2.2. Edit a User Account

An administrator can change user information such as role, user group, billing group, and email address.

Click the “Edit” button in a user’s action record, and the following dialog box will appear.

Figure 15. Edit user dialog

After changing the user’s information, click the “Submit” button to complete editing.

2.2.2.3. Change User Password

An administrator can change passwords for operators or ordinary users, but not those of other administrators.

Click on the “ ” icon for the user, and the following dialog box will appear.

Figure 16. Change user password dialog

After entering and confirming the new password, click the “Submit” button to change the user’s password.

2.2.2.4. Delete a User Account

An administrator can delete existing users. Click on the “ ” icon for the user, and the

following dialog box will appear.

Figure 17. Delete user dialog

Click the “Submit” button and the system will delete the user account.

2.2.2.5. Suspend a User

An administrator can suspend the accounts of operators or ordinary users, but not those of other administrators.

Click on the on/off button for a user’s suspension status, and the following dialog box will appear.

Figure 18. Suspend user dialog

After entering the suspension time, click the “Submit” button to suspend the user.

2.2.2.6. Browse User Details

An administrator can browse user details. Click on the “ ” icon for a user to open the interface below.

Figure 19. Browse user details

2.2.3. User Group (Not Using LDAP)

Click on the Manage menu on the left navigation bar, then click on “User Group” to enter the user group interface. An administrator cannot create or delete a user group.

Figure 20. User group management

2.2.4. Users (Not Using LDAP)

Click on the Manage menu on the left navigation bar, then click on “User” to enter the user interface.

Figure 21. User management

2.2.4.1. Import a User

Click on the “Import” button to open the Import User Interface as shown below.

Figure 22. Import user

Import the following user-provided information:

Username: Choose the account name to be imported (Required).

Role: Choose the user’s role (Administrator, Operator, or Ordinary User).

First Name: Enter your first name.

Last Name: Enter your last name.

Billing Group: Choose the billing group to which you belong (Required). (Selected from the billing group list)

Email: Enter your email address.

After filling out the information as in the figure below, click the “Submit” button and the system will import the user account; when it is successfully imported, the user can log in.

2.2.4.2. Edit a User

An administrator can change user information such as role, user group, billing group, and email address.

Click the “Edit” button in a user’s action record, and the following dialog box will appear.

Figure 23. Edit user

After changing the user information, click the “Submit” button to complete editing.

2.2.4.3. Change User Password

An administrator can change passwords for operators or ordinary users, but not those of other administrators.

Click on the “ ” icon for the user, and the following dialog box will appear.

Figure 24. Change user password

After entering and confirming the new password, click the “Submit” button to change the user’s password.

2.2.4.4. Delete a User

An administrator can delete existing users. Click on the “ ” icon for the user, and the

following dialog box will appear.

Figure 25. Delete user

Click the “Submit” button and the system will delete the user.

2.2.4.5. Suspend a User

An administrator can suspend the accounts of operators or ordinary users, but not those of other administrators.

Click on the on/off button for a user’s suspension status, and the following dialog box will appear.

Figure 26. Suspend user

After entering the suspension time, click the “Submit” button to suspend the user.

2.2.4.6. Browse User Details

An administrator can browse user details. Click on the “ ” icon for a user to open the interface below.

Figure 27. Browse user details

2.2.5. Billing Group

Billing groups provide easy, consolidated management of user billing groups, which can be created, edited, credited, debited, and deleted.

During system initialization, a default billing group by the name of “default_bill_group” is created. It is recommended that the administrator create a new billing group as needed.

2.2.5.1. Create a Billing Group

Click on the User menu on the left navigation bar, then click on “Billing Group” to enter the page.

Figure 28. Billing group management

Click the “Create New” button and the following dialog box will appear:

Figure 29. Create billing group

The following information is needed to create a billing group:

Name: Billing group name (Cannot be duplicated).

Billing Rate: The fee per unit computing time. If the rate is 1, then any member of the billing group using 1 CPU kernel for 1 hour would be charged 1 US dollar.

Initial Amount: The amount in the account when the billing group was created.

Description: A description of the billing group.

Click the “Submit” button and wait for some time until the billing group is created.

2.2.5.2. Edit a Billing Group

Click on the “Edit” button for the billing group to be edited. As shown in the dialog box below, you can change the name, billing rate, and description for the billing group.

Figure 30. Edit billing group

2.2.5.3. Credit / Debit an Account

Click on the “ ” icon for the billing group and the following dialog box will appear:

Figure 31. Account operation dialog

“Name” is the name of the billing group to be used. “Account Balance” is the current balance in the account. In the Action drop-down box, select “Add Funds” or “Withdraw

Funds.” The amount should be entered in the “Amount” field. Click the “Submit” button to credit or debit the account.

Figure 32. Debit account

2.2.5.4. Delete a Billing Group

An administrator can delete a billing group no longer in use.

Click on the “ ” icon for the billing group record to open the following dialog box:

Figure 33. Delete billing group

2.2.6. Troubleshooting and Solutions

In the following circumstances, some user actions may fail:

1. Network problems exist on the server nodes of a cluster.

2. User groups or user accounts with identical names have been created in the operating system on the server nodes of a cluster.

3. There are inconsistencies in user group or user account information in the operating system on the server nodes of a cluster.

4. Slurm is not running properly.

Solutions:

1. Make sure the network connection is good, and root accounts can be accessed without a password across all server nodes.

2. Delete all failed user groups, billing groups, and user accounts and re-create these accounts.

3. Contact Lenovo after-sales service for technical support.

2.3. Monitor

Click on the Monitor menu on the left navigation bar and the following sub-options and functions will appear:

List View: Shows detailed information on all nodes in a cluster, and allows the user to perform corresponding actions on nodes in that cluster.

Physical View: Shows detailed node information based on the physical locations of all machines in the cluster.

Group View: Shows detailed information on group nodes based on the functions of all nodes in the cluster.

GPU View: Monitor information on every GPU based on the functions of all group nodes in the cluster.

Jobs: Shows the running status of jobs currently on the cluster.

Alerts: Check the details of triggered alerts and manage the status of these alerts.

Operation: Show the log of all action changes on the web page’s management interface.

2.3.1. List View

In the List View menu, information for all nodes in a cluster is displayed in a list as shown in the following figure, including information such as:

Figure 34. List View

Host Name: The host name for the node.

Status: Idle, in use, busy, off.

Power: On, off

Group Type: Compute, head, login, I/O, and other user-defined nodes.

BMC IP: The IP address of the head module BMC.

OS IP: The IP address for the node.

Hardware: The number of CPU cores on every node / the total memory on every node / the total storage on every node / the number of GPUs on every node (if there is no GPU, then the GPU section will not appear)

Group: The group to which the node currently belongs.

Actions: Power On/Off, console (serial console), SSH

Click on the “ ” icon or, after choosing a node, click on the “On” or “Off” buttons in the

upper left corner, which will turn the selected node on or off.

Figure 35. Power dialog

Click on the “ ” icon in the BMC IP list or, after choosing a node, click on the “Console”

button in the upper left corner, which will open the control panel for the selected node.

Figure 36. Console

Click on the “ ” icon in the OS IP list or, after choosing a node, click on the “SSH” button

in the upper left corner, which will connect to the selected node using an SSH connection.

Figure 37. SSH

Click on the host name of a node in the host name list to view detailed information about the current node, as shown:

Figure 38. Node details

Status: Idle, busy, off.

Alert Level: Critical, serious, alert, information.

Actions: On/off, iConsole connection, SSH connection

Hardware: CPU, GPU, memory, storage.

Monitor: Shows past monitoring for the current node (Including load, network, CPU, temperature, memory, energy consumption, and hard drive).

GPU: Shows real-time monitoring indicators for every GPU on the current node, as well as historical monitoring for GPUs (Includes GPU usage rate, memory, temperature):

Figure 39. Node GPU details

Every GPU is represented by a box at the top of the interface. Orange markings near the top of the columns indicate that the GPU is in use. The lower part of the interface shows monitoring details for the chosen GPU. Switch between GPUs by clicking on them.

Alert: Shows alert details related to the current node, including ID, name, grade, status, and time.

Job: Shows jobs that are being performed on the current node (Includes ID, job name, scheduler ID, submitting user, queue, and start time).

Information: Includes node type, BMC IP, OS IP, group, and model number.

2.3.2. Physical View

The Physical View menu shows server room information, including room name and location, number of nodes, and total power consumption, and presents a graphic view of the number of racks and the locations of nodes, as shown:

Figure 40. Physical view

By clicking on a rack, the user can view detailed information about the rack as shown, including:

Rack name

Rack location (relative to the server room)

Total number of nodes on a rack

Total power consumption of a rack

Figure 41. Rack view

By clicking on a node in a rack, the user can view detailed information on the selected node.

By clicking on the icons above a rack, the user can switch between displays of temperature, power consumption, CPU/load, memory utilization ratios, hard drive utilization ratios, network throughput, and jobs.

2.3.3. Group View

In the Group View menu, the information for all nodes in a cluster will be sorted by logical grouping. Click on the “Select Group” drop-down box in the top left corner in Group View and select a group to be displayed, as shown:

Figure 42. Group view

Group View offers the following monitoring types:

List: A list of all nodes in this group. Functionality is similar to 2.3.1 List View

Trends: Shows the trend diagram for the group, including load, CPU, memory, hard drive, network, energy consumption, temperature, and job use.

Popular: Shows a heat diagram including load, CPU, memory, hard drive, network, energy consumption, temperature, and job use for all nodes in the group.

The use of Group View for various list types is similar to that of List View.

Group Trend Charts: Shows historical trends in various monitored indicators for a group.

Figure 43. Group trend charts

Group Popular Charts: Shows monitored heat indicators for all nodes within a group.

Figure 44. Group popular charts

2.3.4. GPU View

The GPU View menu shows GPU information for the nodes in a group based on the logical grouping of all nodes in a cluster. Click on the group option in the upper left corner of GPU View, as shown:

Figure 45. GPU View

This interface presents real-time GPU data for the group in graphic form and allows the user to switch between GPU usage rates, memory, and temperature. Every frame in the image represents a node, with the name of the node written in the upper right corner of the frame. The column inside every frame represents a GPU, and the blue portion of the column represents the monitored values. An orange section at the top of the column means that the GPU is in use. Using the slide in the upper right portion of the mobile interface, the user can adjust the colors of the columns to filter and highlight GPUs in a given numerical range. Check the Color Reversal box to the right of the slider to switch the colors that denote values inside and outside the stated range.

2.3.5. Jobs

The Jobs menu shows job information and status, as well as jobs running in the current cluster, as shown:

Figure 46. Job monitoring

The jobs in the list can be filtered by the changing criteria at the top of the list, which include:

Queue: Filter the queues running on the system.

Submit User: Filter based on the user submitting the job.

Status: Select for running, waiting, or finished.

2.3.6. Alerts

The Alert menu shows alert information for all triggered alert rules. The displayed information includes:

ID: The alert ID corresponding to the alert rule.

Name: The alert name corresponding to the alert rule.

Grade: Unconfirmed, confirmed, or resolved.

Status: Critical, serious, alert, or information.

Time: The time the alert was triggered.

Node: The monitored node corresponding to the alert rule. When a GPU alert takes place, the GPU serial number increases (e.g., node1:gpu0).

Notes: Make notes on the alert.

Action: Confirm, resolve, or delete.

Alert events are classified into current events and all events, in which current events include only unconfirmed events, while all events include confirmed events.

Alert event information includes:

Serial Number: Unique ID for the alert event.

Alert Name: Name corresponding to the alert strategy.

Alert Grade: Grade of the corresponding alert strategy.

Status: The current status of the alert: Unconfirmed, unresolved, or resolved.

Alert Time: The time at which the alert took place.

Alert Node: The name of the node in which the alert occurred.

Notes: The administrator’s description of this alert.

Figure 47. Alert monitoring

Alert information can be filtered by selecting criteria at the top of the page, and multiple choices can be made for status and grade. Alert information can be filtered by time, such as last day, last three days, last week, and last month, or by time criteria manually set with start and end dates.

The user can act on a selected node by clicking the appropriate button in the action list, or by selecting a node and then clicking “Confirm,” “Resolve,” or “Delete.” The user can also select “Act on All” to perform the same action on all alert messages. Actions are defined as follows:

Confirm: Applicable to unconfirmed alerts. After confirmation, a reminder for the alert will not be shown in the upper right corner of the home page, and after action is taken, the status will be changed to “Confirmed.”

Resolve: Applicable to unconfirmed and confirmed alerts. After the administrator has handled the alert, this action can be taken and the status will be changed to “Resolved.”

Delete: Applicable to unconfirmed, confirmed, and resolved alerts. After deletion, the alert will not be shown on the list.

2.3.7. Operation

The Operation menu records the actions by all users for all targets in the system, as shown. The following items are displayed:

Operator: Operator account to which the action information belongs.

Module: The module of the action, such as user or job.

Action: Specific commands for the action, such as creation or deletion.

Target: The target of the action, such as a user or node.

Time: The time at which the target action occurred.

Figure 48. Operation monitoring

Information will be displayed at the top of the page according to the filtering criteria. Operator information can be viewed by selecting “Operator” from the drop-down list. The “Target/Action” drop-down list allows for the user to filter action information by targets and actions. Alert information can be filtered by time, such as last day, last three days, last week, and last month, or by time criteria manually set with a start and end date. The following target/action information is recorded in action monitoring:

User: Create, edit, delete

Job: Create, re-run, cancel, delete

Node: On, off

Alerts: Confirm, resolve, delete, notes

Strategy: Create, edit, delete

Billing Group: Create, edit, delete

Billing Account: Credit, debit

2.4. Reports

Reports include job, alert, and action reports:

Job Reports: Data such as job statistics and details, user statistics and details, and billing group statistics and details.

Alert Reports: Data such as alert statistics and details.

Action Reports: Run status, connected user, user login status, and user storage usage statistics for the node.

2.4.1. Job Reports

The Job Reports menu allows administrators to obtain reports on Jobs. The report filters include:

Job Type: Filter by job, user, or billing group.

Filter User/Billing Group: Filter by a specific user or billing group.

Time: Supports pre-defined and self-defined time periods of no longer than one year.

Figure 49. Job reports

The preview function includes:

Job Report Preview: Supports bar graphs and tables.

Figure 50. Preview with chart

Figure 51. Preview with table

User Report Preview: Supports pie charts, bar graphs, and tables.

Details: Pie charts and bar graphs are the default, but users can also show data in table form. Click on the right side of the pie chart to refresh current user/billing group job data.

Figure 52. User report preview

Billing group Report Preview: Supports pie charts, bar graphs, and tables.

Figure 53. Billing group report preview

The report exporting function includes:

Content: Supports the export of statistics and detailed data.

Report format: Supports Excel, PDF, and html.

Figure 54. Export report

2.4.2. Alert Reports

The Alert Reports menu allows administrators to obtain reports on alerts. The report filters include:

Date: Supports pre-defined and self-defined time periods of no longer than one year.

Figure 55. Alert reports

Click on “Preview” to directly preview alert data, which can be shown as a pie chart, bar graph, or table.

Figure 56. Alert report preview

Exporting the Report:

Statistic Type: Supports the export of statistics and detailed data.

Alert level: All, critical, serious, alert, or information

Filter Nodes: Filter the selected nodes.

Report Format: Supports Excel, PDF, and html.

Set the filters and the report format, then click “Submit” to export the report.

Figure 57. Export alert report

2.4.3. Action Reports

The Action Reports menu allows administrators to obtain reports on actions. The report filters include:

Data: Data on CPU, memory, and networks.

Filter Nodes: Filter the selected nodes.

Figure 58. Action report

The preview function includes:

Graph: Filtered data can be previewed in graph form.

Figure 59. Action report preview

The report exporting function includes:

Report Format: Supports Excel, PDF, and html

Figure 60. Export action report

2.5. Admin

After the administrator logs in, the administrator can click “Admin” on the home page’s left navigation bar to access “VNC”, “Operation Logs” or “System Logs”

2.5.1. VNC

The VNC menu shows VNC session information for compute nodes in the cluster and allows users to open the VNC.

Running certain jobs requires VNC support. The following is an example of a VNC job file. When running the job, create a VNC session first, and delete the created VNC session when the job is finished.

cat Job.pbs

#!/bin/bash

#PBS -N short

#PBS -q batch

#PBS -j oe

#PBS -l nodes=2:ppn=4

cd /share/users_root/user1

echo current job id is $PBS_JOBID >> /share/users_root/user1/$PBS_JOBID.log

echo job start time is `date` >> /share/users_root/user1/$PBS_JOBID.log

echo `hostname` >> /share/users_root/user1/$PBS_JOBID.log

session=`vncserver 2>&1`

sessionid=`echo "$session"| grep "^New"| awk -F ":" '{print $3}'`

echo "vncsession $sessionid is created" >> /share/users_root/user1/$PBS_JOBID.log

export DISPLAY=:$sessionid.0

./prog

vncserver -kill :$sessionid

echo job end time is `date` >> /share/users_root/user1/$PBS_JOBID.log

VNC sessions can be managed using the LiCO interface and command lines.

2.5.1.1. Manage on Web

The VNC interface shows all VNC sessions in real time, including the creator, node, port number, process ID, and index of the VNC session. Select a VNC session and click “Open” in the actions column to view (If the session is locked, only the password of the VNC session creator can be used to log in.)

Figure 61. View VNC Sessions

A user should only have one VNC session per node. However, too many VNC sessions may accumulate if VNC sessions are not deleted at the end of a job. Testing has shown

that a user may have more than 20 VNC sessions on one node, but the user may not be allowed to create a new VNC session, so unnecessary VNC sessions should be deleted.

When it is necessary to delete a VNC session, the user must click “Delete” in the corresponding actions column and then click “Confirm and Submit” in the dialog that pops up, as shown:

Figure 62. Delete VNC Session

2.5.1.2. Manage Using Command Lines

In a cluster node, the current user can create a session on the VNC server.

When creating a session, switch to a LiCO user via the command lines and enter the directory: /home/lico_5.x/cluster_monitor_project.

Next, start lico-vnc-slave using the command lines below on the node that runs the VNC server.

# service lico-vnc-slave start

Based on the circumstances, it may be necessary to change the URL parameters in the file /opt/lico/vnc-slave/etc/lico-vnc-slave.ini, changing the IP to the cluster head node IP. Otherwise, the page cannot obtain VNC information.

At a node on the cluster, the current user can only view VNC sessions he/she has created on VNC Server -List.

At a node on the cluster, the current user can only use VNC Server –Kill to delete VNC sessions he/she has created.

At a node on the cluster, view all VNC sessions on the node using the command ps -ef|grep Xvnc, and then delete VNC sessions using the deletion process. Please use kill rather than kill-9 when deleting.

The result of an action performed with the above command lines may be shown on the LiCO page. The jobs deleted by the user via command lines will disappear from LiCO after about 30 seconds. Sessions the user has newly created via command lines will show on the LiCO page after about 30 seconds.

2.5.2. Operation Logs

After logging in as an administrator, the user must click “Manage” in the left navigation bar, then select “Operation Logs” to enter the log, as shown:

Figure 63. Operation logs

Enter the dates required for the logs to be exported and click “Download.”

2.5.3. System Logs

After logging in as an administrator, the user must click “Manage” in the left navigation bar, then select “System Logs” to enter the log, as shown:

Figure 64. Operation logs

2.6. Settings

Settings allow the user to manage alert rules, alert notification groups, alert notification connections, and alert triggering scripts. After logging in as an administrator, click “Settings” in the left navigation bar, then select the desired sub-items.

2.6.1. Alert Policy

The Alert Policy menu allows administrators to view the alert policy for the current cluster, and add, update, or delete the alert rules, as shown.

Figure 65. Alert policy

Click “Add” in the upper left corner of the page to add an alert policy, as shown. Alert policy must include the following information:

Alert name: Self-defined alert name.

Monitor: LiCO provides alert monitoring for the CPU usage rate, temperature, GPU usage rate, GPU temperature, network status, storage usage rate, energy consumption, and hardware problems.

Condition: Set an alert trigger that is larger than, smaller than, or equal to a threshold value.

Duration: For some monitored items, set the duration of the triggering condition. The default is 60 seconds.

Risk Level: Self-defined risk level, including critical, serious, alert, and information.

Notification Group: Notify one or more groups of users after an alert policy is triggered.

Monitoring Node: Fill in the name of the node or nodes to be monitored. If left blank, the default is to monitor all nodes.

Configure Script: Choose a script to run automatically after an alert is triggered.

Notice: Turn on WeChat notifications and sound notifications.

Status: Immediately start this alert strategy.

Figure 66. Create alert policy

After filling in the alert rule, click “Submit” to save the alert rule. The user may then view the alert policy that has been added.

Click “Edit” in the actions column to edit an alert policy.

Figure 67. Edit alert policy

Click “Delete” in the actions column to delete an alert policy.

Figure 68. Edit alert policy

2.6.2. Notification Group

The Notification Group menu allows administrators to work with notification groups. Notification groups are user groups that are notified when an alert is triggered. Notification groups can be added, updated, or deleted.

Figure 69. Notification groups

Click “Create” in the upper left corner to add a notification group, as shown in the figure.

Figure 70. Create notification group

Enter the group name, email addresses, and mobile numbers for the notification group, and then click “Submit” to create the notification group. The newly-created notification group will appear on the list. Edit a notification group by clicking “Edit” in the actions column, as shown in the figure.

Figure 71. Edit notification group

Delete a notification group by clicking “Delete” in the actions column, as shown in the figure.

Figure 72. Delete notification group

2.6.3. Notification Setting The Notification Setting menu allows administrators to manage the email, SMS, and WeChat settings for the external alert API on the notification connection page, as shown.

Figure 73. Notification setting

Configure the email server as follows:

SSL: Null, SSL, or TLS.

SMTP ID

SMTP Password

SMTP Address

SMTP Port

Recipient Email

SMS settings must include the sending port, modem type, and daily SMS limit.

View the QR code for the official alert notification WeChat account on WeChat.

Turn the alert API on or off by clicking the “ON” or “OFF” buttons in the upper right corner. All changes to these settings will only be saved after clicking “Confirm” at the bottom.

Test alert notification connections by clicking “Test” at the bottom of the Settings page.

2.6.4. Scripts The Scripts menu allows administrators to manage scripts for creating alerts. The scripts displayed on the script management page have self-defined alert rules. The information displayed includes the script name, file size, and upload time, as shown:

Figure 74. Scripts

For security reasons, the page does not support uploading, updating, or deleting scripts. These actions should be performed using the backend platform, and these scripts are placed in the /var/lib/lico/scripts directory.

2.7. Operator Functions Click the user image in the upper right corner of the Administrator home page to switch to the Operator interface. The Operator home page is shown. The Operator home page displays the same information as the Administrator home page.

Figure 75. Operate home page

The Operator home page has the same Monitor and Report functions as the administrator home.

2.8. User Functions Please refer to the LiCO User Manual.

3. HPC Cluster Management Most HPC functions may be completed using the interface. However, because HPC cluster management is complicated, some more complex actions require command lines or other tools.

3.1. View HPC Cluster Details Click the Monitor icon on the navigation bar and select List View to see the status of every computer in the cluster, as shown in the figure below.

Figure 76. List view

The displayed information includes:

Host name: The cluster host name

Status: Idle, in use, busy, off.

Power: On, off

Type: Compute, head, login, I/O, and other user-defined nodes.

BMC IP: IP address of the head module XCC.

OS IP: IP address for the node.

Hardware: The number of CPU cores on every node/the total memory on every node /the total storageon every node/the number of GPUs on every node.

Custom Group: The group to which the node currently belongs.

3.2. Remote Management of HPC Cluster

Hardware

3.2.1. Interface Management Under Monitor, open List View and click on the “BMC IP” link in the node list.

Click the link to open a Lenovo XCC management module interface and perform remote hardware management, including a remote on/off switch, a remote console, and hardware configuration.

After entering the username /password (Factory Default: USERID/PASSW0RD), open the XCCmanagement interface, as shown in the figure below.

See the XCC User Manual for details.

http://sysmgt.lenovofiles.com/help/index.jsp

3.2.2. Command Line Management

Choose the required node, click the “ ” icon or click on “Console” after checking the

required node to open the control panel for the selected node.

3.2.3. xCAT Management Log in to the head node and perform remote management using xCAT commands.

Remote on/off command rpower:

rpower <noderange> [on|onstandby|off|suspend|reset|stat|state|boot]

# rpower c[01-03],d08 reset, Restart nodes c01,c02,c03,d08

# rpower c[01-03],d08 state, View the on/off status of nodes c01,c02,c03,d08

Set Boot Order Command rsetboot:

rsetboot <noderange> [net|hd|cd|floppy|def|stat]

# rsetboot c[01-03],d08 net, Set nodes c01,c02,c03,d08 to boot from the network

Remote view of the node hardware device asset information command rinv:

Figure 77. rinv

Remote view of the node hardware device log information command reventlog:

Figure 78. reventlog

Find more command techniques using the link below: http://sourceforge.net/p/xcat/wiki/XCAT_Commands/

3.3. Parallel Commands When using parallel commands on the SSH login head node, the following actions can be performed on nodes in a cluster in batches:

Parallel Command PSH: PSH Use: PSH node name shell command

Node names can be represented regularly or separated by commas. For example:

# psh c[01-03], d08 ls Running an LS command on nodes c01,c02,c03,d08

Parallel File Copying PSCP: PSCP Use: PSCP source file node name:/target directory

Node names can be represented regularly or separated by commas. For example:

# pscp data.txt c[01-03], d08:/opt Copying the data.txt file from the computer to the c01,c02,c03,d08 directories.

3.4. Job Scheduling Commands LiCO supports lifecycle actions such as uploading files, or submitting, cancelling, re-running, and deleting jobs. See the LiCO User Manual.

The administrator may use command lines to perform more complicated scheduling management.

3.5. Queue Commands Queue management includes viewing, creating, and modifying the queue. In queue management, the current user needs to log into the head node and utilize Slurm scheduler command lines.

SSH Login for the Head Node:

-- View the queue:

View the existing queue with Slurm commands

[root@mgt /]# sinfo -- Create a queue:

1. Modify the Slurm configuration file /etc/slurm/slurm.conf, and add the following content:

PartitionName=test Nodes=headnode, computenode1 Default=YES MaxTime=INFINITE State=UP 2. Restart Slurm-related services:

On the head node:

[root@mgt /]# service slurmctld restart On the compute node:

[root@mgt /]# service slurmd restart 3. Restart LiCO to sync the queue to the interface:

[root@mgt home]# service lico restart After completing the above steps, the newly-created queue can be viewed on the interface.

-- Modify the queue:

The queue may be modified by changing the configuration file /etc/slurm/slurm.conf. The steps are the same as those for queue creation. View the queue parameters via scontrol show partition.

For more queue management commands, please see: http://slurm.schedmd.com/

3.6. Job Management Job management can be performed on the LiCO interface. An administrator can view and act on a job by giving commands to the scheduler.

SSH Login to Head Node:

-- View job status

[root@mgt /]# squeue -a JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 428 compute zhangtes testuser R 5:19 1 testcomputenode01 429 compute zhangtes testuser R 4:49 1 testcomputenode01 430 compute mnist-pa ls-test R 4:37 1 testcomputenode01 -- View detailed job status

[root@mgt /]# scontrol show jobs -- Use jobid to view the detailed status of a certain job

[root@mgt /]# scontrol show jobs 428 -- Use jobid to cancel a job running or in the queue

[root@mgt /]# scancel 428 For more job management commands, please see: http://slurm.schedmd.com/

Note: If a job is submitted through Slurm command lines, it will not begin billing on the LiCO system.

4. Important Information

4.1. Restarting LiCO If LiCO malfunctions, please restart LiCO to resolve.

StopService:

[root@mgt lico_5.x]# service lico stop Start Service:

[root@mgt lico_5.x]# service lico start View LiCO Status:

[root@mgt lico_5.x]# service lico status When LiCO starts normally, the screen will appear as follows:

[root@mgt lico_5.x]# service lico status lico.service - lenovo hpc project Loaded: loaded (/usr/lib/systemd/system/lico.service; disabled; vendor preset: disabled) Active: active (running) since Thu 2017-11-02 16:48:05 CST; 3h 59min ago Main PID: 381046 (lico)

4.2. MPI Program Installation Location MPI software types such as MPICH, OpenMPI, and MVAPICH are installed in the cluster in the following locations:

/usr/local/mpich

/usr/local/mvapich

/usr/local/openmpi

Only one type of MPI software can be used at a time. After the cluster is established, MVAPICH is the default, and the MVAPICH bin is added to the system PATH.

If it is necessary to switch to another MPI software, remove the MVAPICH bin from the PATH, and then add the MPI software to the PATH or designate the MPI software to be used in the job file. For example, to directly designate MPICH in the job file, run the program as follows:

#!/bin/bash #PBS -N test

#PBS -q batch #PBS -j oe #PBS -m abe #PBS -M [email protected] #PBS -l nodes=2:ppn=1 cd /share/users_root/hpcadmin /usr/local/mpich/bin/mpiexec ./prog

4.3. Absolute User Directory Path The user root directory is “Myfolder,” which can be found in the File Manager after the user logs in. The directory corresponding to user_rootdir in opt/lico/core/etc/lico.ini will have a file folder with the user’s name. This file folder is the user root directory.

For example, the path for user_rootdir is /share/users_root. The user root directory for hpcadmin is “Myfolder” on the web page. The corresponding absolute path is /share/users_root/hpcadmin. The /share/users_root/hpcadmin is the user’s home directory.

Therefore, the final directory structure for hpcadmin is as follows:

/share/users_root/hpcadmin

4.4. Resolving a Failed Job Submission The failure to submit a job on the LiCO interface may be caused by a poorly-configured Slurm scheduler. To check the cause of the failure, try the following suggestions:

-- Use SSH to log into the head node and re-submit the job using command lines: CD to the current user directory, find the job file, submit the job through sbatch jobfile.slurm, then check to see which error message is returned. Resource limits may have been exceeded. For example, the job needs 100 kernels, but there are only 80 in the cluster.

-- Run Slurm command sinfo on the head node and view the compute node status and resource status for the cluster.

If sinfo returns no results, no nodes have been added to the scheduling node. Open /etc/slurm/slurm.conf and add a compute node using the following format:

NodeName=nodename CPUs=cores State=node status Following the addition, restarting the Slurm service at the head node may be required for the addition to take effect.

[root@mgt lico_5.x]# service Slurmctld restart If sinfo shows that some nodes are down, check whether Slurm services have been started in the down nodes.

[root@mgt lico_5.x]# service Slurmd status

-- Run Slurm command “scontrol show partition” on the head node to view the queue settings.

4.5. Resolving a Failed User or Group Creation To resolve a failed user or group creation attempt, try the following suggestions:

-- Check whether the status of the LDAP services at the head node are normal. If there are malfunctions in the service sssd status, restart these services.

-- Check LDAP, see if the value returned is correct: ldapsearch -x -b "dc=hpc,dc=com"

-- If the shared directory on which the user’s home directory is located is NFS, check if the NFS mountpoint is mounted with vers=3. A correct mounting should appear as follows: mount -t nfs -o vers=3 nfsserverip:/sharedir /mountpoint.

-- Check if users or user groups in the cluster have the same names.

Note: Create, edit, or delete users or groups only after /opt/lico/core/etc/lico.ini starts the switch using LDAP and the LDAP information has been configured.

4.6. Manage Users and Groups Using command

Lines In addition to managing users and groups on a web page, LiCO provides command lines for managing users and groups in the cluster. The examples that follow will demonstrate how to use LiCO to manage users and groups.

1. Creating a Billing Group

[root@mgt lico_5.x]# lico billgroup_add newbillgroup 1 1000000 # Create a new billing group with the name newbillgroup, a rate of 1, and an initial balance of 1,000,000.

2. Creating a User Group

[root@mgt lico_5.x]# lico osgroup_add newosgroup -ba cn=admin,dc=hpc,dc=com -bp openldap # Create a new user group with the name newosgroup. Enter LDAP password.

3. Creating a Cluster User

[root@mgt lico_5.x]# lico user_add newuser newbillgroup -ba cn=admin,dc=hpc,dc=com -bp openldap -g newosgroup -p Passw0rd@123 -r admin # Create a new user with the name newuser, the billing group newbillgroup, the user group newosgroup, the password Passw0rd@123, and the role of administrator.

Parameters for Creating a User:

Required Parameters:

• username requires the username.

• bill_group specifies the billing group for the user. This group name must already exist.

• -p password specifies the password for the user.

• -g os_group sets the user group name for a user. This group name must already exist.

• -ba BACKEND_ADMIN is the LDAP administrator username.

• -bp BACKEND_PASSWORD is the LDAP administrator password.

Optional Parameters:

• -r {user,operator,admin} specifies the user role.

• -e example@lico is the user email address (Optional).

• --first-name FIRST_NAME is the user’s first name.

• --last-name LAST_NAME is the user’s last name.

LiCO user management actions include the following:

• billgroup_add

• billgroup_list

• billgroup_remove

• osgroup_add

• osgroup_list

• osgroup_remove

• user_add

• user_auth

• user_import

• user_init

• user_list

• user_remove

Specific usages may be viewed using the command “lico user_auth -h.”

Note: If it is necessary to delete a user group or billing group, first delete all users in the group, then delete the user group or billing group.

4.7. Batch Deletion of Jobs in the Database After LiCO has been running for a long period, jobs will accumulate. Jobs can be deleted through the Manage interface, but if you want to delete jobs in batches, they can be deleted from the database directly. LiCO uses postgresql as a database. The database is in the head node and the database name is postgres. The username is postgres and the password is 123456. The corresponding job table is webconsole_job. You can use a visualization tool, and you can use command lines similar to those below to delete unnecessary jobs from the database.

>psql –h 127.0.0.1 –U postgres –d postgres postgres=#\c postgres postgres=#select * from webconsole_job postgres-#\g postgres=#delete from webconsole_job where id < 3 postgres-# \g postgres-# \q

4.8. Failure to View or Delete a VNC You may restart LiCO if you fail to view a VNC session on the LiCO page.

If you fail to delete a VNC session on the LiCO page, log in to the VNC session node, view the process number of the VNC session to be deleted with the command ps -ef|grep Xvnc, and delete the VNC session using the deletion process. Please use kill rather than kill-9 when deleting a session. Information on the LiCO page will update after about 30 seconds.

4.9. Data Sources for GPU Monitoring LiCO can only monitor GPUs produced by Nvidia. Monitoring data (including GPU usage rates, memory, temperature, and usage status) is obtained through the official Nvidia API.

If you need to check GPU monitoring data on the node’s operating system, run nvidia-smi on the command line to check.