kaas user guide - mirantis · cluster: openstack, aws, or bare metal. the deployment procedure is...

KaaS User Guideversion beta

ContentsCopyright notice 1Preface 2

Intended audience 2Documentation history 2

Create and manage a KaaS child cluster 3Create and manage a baremetal-based KaaS child cluster 3

Create a child cluster 3Add a bare metal host 7Add a machine 8Add a Ceph cluster 9Delete a child cluster 10

Create and manage an OpenStack-based KaaS child cluster 11Create a child cluster 11Add a machine 15Delete a child cluster 16

Create and manage an AWS-based KaaS child cluster 17Create a child cluster 17Add a machine 21Delete a child cluster 22

Change a cluster configuration 23Update a child cluster 23Delete a machine 25

Manage a KaaS management cluster 26Connect to a KaaS cluster 27Manage IAM 29

IAM CLI 29Configure IAM CLI 29Available IAM CLI commands 29

Role list 33Manage StackLight 36

Access StackLight web UIs 36

Mirantis Kubernetes-as-a-Service User Guide version beta

©2020, Mirantis Inc. Page i

View Grafana dashboards 36View Kibana dashboards 40Available StackLight alerts 40

Alertmanager 40AlertmanagerFailedReload 41AlertmanagerMembersInconsistent 41AlertmanagerNotificationFailureWarning 41AlertmanagerAlertsInvalidWarning 41

Calico 42CalicoDataplaneFailuresHigh 42CalicoDataplaneAddressMsgBatchSizeHigh 42CalicoDatapaneIfaceMsgBatchSizeHigh 42CalicoIPsetErrorsHigh 43CalicoIptablesSaveErrorsHigh 43CalicoIptablesRestoreErrorsHigh 43

Ceph 43CephClusterHealthMinor 44CephClusterHealthCritical 44CephMonQuorumAtRisk 44CephOsdDownMinor 45CephOSDDiskNotResponding 45CephOSDDiskUnavailable 45CephClusterNearFull 46CephClusterCriticallyFull 46CephOsdPgNumTooHighWarning 46CephOsdPgNumTooHighCritical 46CephMonHighNumberOfLeaderChanges 47CephNodeDown 47CephDataRecoveryTakingTooLong 47CephPGRepairTakingTooLong 47CephOSDVersionMismatch 48CephMonVersionMismatch 48

Elasticsearch 48


©2020, Mirantis Inc. Page ii

ElasticHeapUsageTooHigh 48ElasticHeapUsageWarning 49ElasticClusterRed 49ElasticClusterYellow 49NumberOfRelocationShards 50NumberOfInitializingShards 50NumberOfUnassignedShards 50NumberOfPendingTasks 50ElasticNoNewDocuments 51

etcd 51etcdInsufficientMembers 51etcdNoLeader 51etcdHighNumberOfLeaderChanges 52etcdGRPCRequestsSlow 52etcdMemberCommunicationSlow 52etcdHighNumberOfFailedProposals 52etcdHighFsyncDurations 53etcdHighCommitDurations 53

General alerts 53TargetDown 53NodeDown 54Watchdog 54

General node alerts 54SystemCpuFullWarning 55SystemLoadTooHighWarning 55SystemLoadTooHighCritical 55SystemDiskFullWarning 55SystemDiskFullMajor 56SystemMemoryFullWarning 56SystemMemoryFullMajor 56SystemDiskInodesFullWarning 56SystemDiskInodesFullMajor 57SystemDiskErrorsTooHigh 57


©2020, Mirantis Inc. Page iii

Ironic 57IronicMetricsMissing 57IronicApiOutage 58

Kubernetes applications 58KubePodCrashLooping 58KubePodNotReady 59KubeDeploymentGenerationMismatch 59KubeDeploymentReplicasMismatch 59KubeStatefulSetReplicasMismatch 59KubeStatefulSetGenerationMismatch 60KubeStatefulSetUpdateNotRolledOut 60KubeDaemonSetRolloutStuck 60KubeDaemonSetNotScheduled 60KubeDaemonSetMisScheduled 61KubeCronJobRunning 61KubeJobCompletion 61KubeJobFailed 62

Kubernetes resources 62KubeCPUOvercommitPods 62KubeMemOvercommitPods 62KubeCPUOvercommitNamespaces 63KubeMemOvercommitNamespaces 63KubeQuotaExceeded 63CPUThrottlingHigh 63

Kubernetes storage 64KubePersistentVolumeUsageCritical 64KubePersistentVolumeFullInFourDays 64KubePersistentVolumeErrors 64

Kubernetes system 65KubeNodeNotReady 65KubeVersionMismatch 65KubeClientErrors 66KubeletTooManyPods 66


©2020, Mirantis Inc. Page iv

KubeAPILatencyHighWarning 66KubeAPILatencyHighCritical 66KubeAPIErrorsHighCritical 67KubeAPIErrorsHighWarning 67KubeAPIResourceErrorsHighCritical 67KubeAPIResourceErrorsHighWarning 67KubeClientCertificateExpirationInSevenDays 68KubeClientCertificateExpirationInOneDay 68ContainerScrapeError 68

MongoDB 68MongodbCursorsOpenTooMany 69MongodbCursorTimeouts 69MongodbConnectionsTooMany 69MongodbMemoryUsageWarning 69

Netchecker 70NetCheckerAgentErrors 70NetCheckerReportsMissing 70NetCheckerTCPServerDelay 71NetCheckerDNSSlow 71

NGINX 71NginxServiceDown 71NginxDroppedIncomingConnections 71

Node network 72SystemRxPacketsErrorTooHigh 72SystemTxPacketsErrorTooHigh 72SystemRxPacketsDroppedTooHigh 72SystemTxPacketsDroppedTooHigh 73NodeNetworkInterfaceFlapping 73

Node time 73ClockSkewDetected 73

Prometheus 74PrometheusConfigReloadFailed 74PrometheusNotificationQueueRunningFull 74


©2020, Mirantis Inc. Page v

PrometheusErrorSendingAlertsWarning 74PrometheusErrorSendingAlertsCritical 75PrometheusNotConnectedToAlertmanagers 75PrometheusTSDBReloadsFailing 75PrometheusTSDBCompactionsFailing 76PrometheusTSDBWALCorruptions 76PrometheusNotIngestingSamples 76PrometheusTargetScrapesDuplicate 76PrometheusRuleEvaluationsFailed 77

Salesforce notifier 77SfNotifierDown 77SfNotifierAuthFailure 77

SMART disks 78SystemSMARTDiskUDMACrcErrorsTooHigh 78SystemSMARTDiskHealthStatus 78SystemSMARTDiskReadErrorRate 78SystemSMARTDiskSeekErrorRate 79SystemSMARTDiskTemperatureHigh 79SystemSMARTDiskReallocatedSectorsCount 79SystemSMARTDiskCurrentPendingSectors 80SystemSMARTDiskReportedUncorrectableErrors 80SystemSMARTDiskOfflineUncorrectableSectors 80SystemSMARTDiskEndToEndError 80

SSL certificates 81SSLCertExpirationWarning 81SSLCertExpirationCritical 81

Telemeter 81TelemeterClientAuthenticationFailed 82TelemeterClientFederationFailed 82

Disable workload monitoring 82


©2020, Mirantis Inc. Page vi

Copyright notice2020 Mirantis, Inc. All rights reserved.This product is protected by U.S. and international copyright and intellectual property laws. Nopart of this publication may be reproduced in any written, electronic, recording, or photocopyingform without written permission of Mirantis, Inc.Mirantis, Inc. reserves the right to modify the content of this document at any time without priornotice. Functionality described in the document may not be available at the moment. Thedocument contains the latest information at the time of publication.Mirantis, Inc. and the Mirantis Logo are trademarks of Mirantis, Inc. and/or its affiliates in theUnited States an other countries. Third party trademarks, service marks, and names mentionedin this document are the properties of their respective owners.


©2020, Mirantis Inc. Page 1

PrefaceThis documentation provides information on how to use Mirantis products to deploy cloudenvironments. The information is for reference purposes and is subject to change.

Intended audienceThis documentation assumes that the reader is familiar with network and cloud concepts and isintended for the following users:

• Infrastructure Operator

• Is member of the IT operations team• Has working knowledge of Linux, virtualization, Kubernetes API and CLI, and OpenStack

to support the application development team• Accesses Mirantis KaaS and Kubernetes through a local machine or web UI• Provides verified artifacts through a central repository to the Tenant DevOps engineers

• Tenant DevOps engineer

• Is member of the application development team and reports to line-of-business (LOB)• Has working knowledge of Linux, virtualization, Kubernetes API and CLI to support

application owners• Accesses Mirantis KaaS and Kubernetes through a local machine or web UI• Consumes artifacts from a central repository approved by the Infrastructure Operator

Documentation historyThe documentation set refers to Mirantis KaaS beta as to the latest released beta version of theproduct. For details about the KaaS beta minor releases dates, refer to KaaS releases.



https://docs.mirantis.com/kaas/beta/kaas-release-notes/kaas-releases.html

Create and manage a KaaS child cluster

NoteThis tutorial applies only to the KaaS web UI users with the writer or operator access roleassigned by the Infrastructure Operator.

After you deploy the KaaS management cluster, you can start creating the KaaS child clustersthat will be based on the same cloud provider type that you have for the KaaS managementcluster: OpenStack, AWS, or bare metal.The deployment procedure is performed using the KaaS web UI and comprises the followingsteps:

1. Create an initial cluster configuration depending on the provider type.2. For a baremetal-based child cluster, create and configure bare metal hosts with

corresponding labels for machines such as worker, control plane, or storage.3. Add the required amount of machines with the corresponding configuration to the child

cluster.4. For a baremetal-based child cluster, add a Ceph cluster.

Create and manage a baremetal-based KaaS child clusterAfter bootstrapping your baremetal-based KaaS management cluster as described in KaaSDeployment Guide: Deploy a baremetal-based management cluster, you start creating thebaremetal-based KaaS child clusters using the KaaS web UI.

Create a child clusterThis section instructs you on how to configure and deploy a Mirantis KaaS child cluster that isbased on the baremetal-based Mirantis KaaS management cluster through the KaaS web UI.To create a Mirantis KaaS child cluster on bare metal:

1. Log in to KaaS web UI with the operator or writer permissions.2. Select the required namespace.3. On the SSH keys page, click Add key (the + icon) to upload the public SSH key that will be

used for the SSH access to VMs.4. In the Clusters block, click Create cluster (the + icon).5. Configure the new cluster in the Create new cluster wizard that opens:

1. Define general and Kubernetes parameters:

Create new cluster: General, Provider, and Kubernetes



https://docs.mirantis.com/kaas/beta/kaas-deployment-guide/deploy-bm-mgmt.html

https://docs.mirantis.com/kaas/beta/kaas-deployment-guide/deploy-bm-mgmt.html

Section Parametername Description

Generalsettings

Name The cluster name.

Provider Select Baremetal.Region From the drop-down list, select Baremetal.Releaseversion

The Mirantis KaaS version.

Kubernetesapplications

Istio Select to enable Istio service mesh for application owners.

Caution!Istio is deprecated since the Cluster release 3.1.0and 2.2.0 and will be removed in future releases.

KubernetesDashboard

Select to enable the Kubernetes Dashboard to manageapplications that run on a Kubernetes cluster as well astroubleshoot them using the web UI.

SSH keys From the drop-down list, select the SSH key name thatyou have previously added for SSH access to theOpenStack VMs.

Provider LB host IP The IP address of the load balancer endpoint that will beused to access the Kubernetes API of the new cluster. ThisIP address must be from the same subnet as used forDHCP in Metal³.

LB addressrange

The range of IP addresses that can be assigned to loadbalancers for Kubernetes Services by MetalLB.

Kubernetes Node CIDR The Kubernetes worker nodes CIDR block. For example,10.10.10.0/24.

ServicesCIDR blocks

The Kubernetes Services CIDR blocks. For example,10.233.0.0/18.

Pods CIDRblocks

The Kubernetes pods CIDR blocks. For example,10.233.64.0/18.

2. Optional, recommended Enable and configure StackLight:

StackLight configuration




StackLight Enabled Select to enable StackLight monitoring.

NoteYou can also enable, disable, or configureStackLight parameters after deploying a KaaSchild cluster. For details, see Change a clusterconfiguration.

Multiservermode

Select to enable StackLight monitoring in the HAmode. For the differences between HA and non-HAmodes, see KaaS Reference Architecture: StackLightdeployment architecture.

Elasticsearch retentiontime

The Elasticsearch logs retention period in Logstash.

Elasticsearch persistentvolumeclaim size

The Elasticsearch persistent volume claim size.

Prometheusretentiontime

The Prometheus database retention period.

Prometheusretentionsize

The Prometheus database retention size.

Prometheuspersistentvolumeclaim size

The Prometheus persistent volume claim size.

EnableWatchdogalert

Select to enable the Watchdog alert that fires as longas the entire alerting pipeline is functional.



https://docs.mirantis.com/kaas/beta/kaas-ref-arch/components-stack/monitoring/deployment-arch.html


Customalerts

Specify alerting rules for new custom alerts or upload aYAML file in the following exemplary format:

- alert: HighErrorRate expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5 for: 10m labels: severity: page annotations: summary: High request latency

For details, see Official Prometheus documentation:Alerting rules. For the list of the predefined StackLightalerts, see KaaS User Guide: Available StackLightalerts.

StackLightemail alerts

Enabled Select to enable the StackLight email alerts.

Sendresolved

Select to enable notifications about resolvedStackLight alerts.

Require TLS Select to enable transmitting emails through TLS.Email alertsconfiguration forStackLight

Fill out the following email alerts parameters asrequired:

• To - the email address to send notifications to.• From - the sender address.• SmartHost - the SMTP host through which the

emails are sent.• Authentication username - the SMTP user name.• Authentication password - the SMTP password.• Authentication identity - the SMTP identity.• Authentication secret - the SMTP secret.

StackLightSlack alerts

Enabled Select to enable the StackLight Slack alerts.

Sendresolved


Slack alertsconfiguration forStackLight

Fill out the following Slack alerts parameters asrequired:

• API URL - The Slack webhook URL.• Channel - The channel to send notifications to, for

example, #channel-for-alerts.6. Click Create.

Now, proceed to Add a bare metal host.



https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/#defining-alerting-rules


Add a bare metal hostBefore you proceed with adding a bare metal host, verify that the physical network on the serverhas been configured correctly. See KaaS Reference Architecture: Network fabric for details.To add a bare metal host to a baremetal-based KaaS child cluster:

1. Log in to the Kaas web UI with the operator permissions.2. Select the required namespace.3. Add unique credentials for a new bare metal host:

1. On the upper right side of the namespace page, click Credentials. The Credentials pageopens.

2. Click Add Credential (the + icon).3. Type in a credential name.4. Select the Baremetal credential type.5. Select the Baremetal credential region.6. Enter Username and Password.7. Click Create.

NoteEvery bare metal host requires its own credentials.

4. On the upper right side of the namespace page, click Baremetal. The Baremetal pageopens.

5. On the Baremetal page, click Add BM Host (the + icon).6. Fill out the Add new BM host form as required:

• NameSpecify the name of the new bare metal host.

• CredentialSelect credentials that you created for the host in the step 3.

• Boot MAC AddressSpecify the MAC address of the PXE network interface.

• AddressSpecify the URL to access the BMC. Should start with https://.

• LabelAssign the machine label to the new host that defines which type of machine maybe deployed on this bare metal host. Only one label can be assigned to a host. Thesupported labels include:

• Worker



https://docs.mirantis.com/kaas/beta/kaas-ref-arch/kaas-reqs/bom-bm/bm-network-fabric.html

Assigned by default. The host with this label may be used to deploy theworker machine type. Assign this label to the bare metal hosts that havesufficient CPU and RAM resources, as described in KaaS ReferenceArchitecture: Reference hardware configuration.

• StorageAssign this label to the bare metal hosts that have sufficient storagedevices to match KaaS Reference Architecture: Reference hardwareconfiguration. Hosts with this label will be used to deploy machines withthe storage type that run Ceph OSDs.

• Control planeAssign this label to the bare metal hosts that may be used to deploymachines with the control plane type. These hosts must match the CPUand RAM requirements from KaaS Reference Architecture Referencehardware configuration.

7. Click CreateWhile adding the bare metal host, Mirantis KaaS discovers and inspects the hardware of thebare metal host and adds it to BareMetalHost.spec for future references.

Now, you can proceed to Add a machine.

Add a machineAfter you add a bare metal host to the child cluster as described in Add a bare metal host, youcan create a Kubernetes machine in your cluster.To add a Kubernetes machine to a baremetal-based KaaS child cluster:

1. Log in to the KaaS web UI with the operator or writer permissions.2. Select the namespace where to add the machine.3. In the Clusters block, click the required cluster name. The Machines page opens.4. On the Machines page, click Create machine (the + icon).5. Fill out the Create new machine form as required:

• CountSpecify the number of machines to add.

• Control PlaneSelect Control Plane to create a Kubernetes control plane node. Otherwise, theKubernetes worker node will be created. The recommended minimum number ofmachines is three for the control plane HA and two for the KaaS workloads.

• Bare metal host labelAssign the role to the new machine(s) to link the machine to a previously createdbare metal host with the corresponding label. You can assign one role type permachine. The supported labels include:

• Worker



https://docs.mirantis.com/kaas/beta/kaas-ref-arch/kaas-reqs/bom-bm/bm-hw-reqs.html






The default role for any node in a child cluster. Only the kubelet service isrunning on the machines of this type.

• Control planeThis node hosts the control plane services of the child cluster. For thereliability reasons, KaaS does not permit running end user workloads onthe control plane nodes or use them as storage nodes.

• StorageThis node is a worker node that also hosts Ceph OSD daemons andprovides its disk resources to Ceph. KaaS permits end users to runworkloads on storage nodes by default.

6. Click Create.At this point, Mirantis KaaS adds the new machine object to the specified KaaS child cluster. Andthe Bare Metal Operator controller creates the relation to BareMetalHost with the labelsmatching the roles.Provisioning of the newly created machine starts when the machine object is created andincludes the following stages:

1. Creation of partitions on the local disks as required by the operating system and theMirantis KaaS architecture.

2. Configuration of the network interfaces on the host as required by the operating system andthe Mirantis KaaS architecture.

3. Installation and configuration of the KaaS LCM agent.

Seealso

• Add a Ceph cluster• Connect to a KaaS cluster

Add a Ceph clusterAfter you add machines to your new bare metal KaaS child cluster as described in Add amachine, you can create a Ceph cluster on top of this child cluster using the KaaS web UI.The procedure below enables you to create a Ceph cluster with minimum three nodes thatprovides persistent volumes to the Kubernetes workloads in the KaaS child cluster.To create a Ceph cluster in the KaaS child cluster:

1. Log in to the KaaS web UI with the operator or writer permissions.2. Select the namespace.3. In the Ceph block, click Create Ceph cluster (the + icon).4. Configure the Ceph cluster in the Create new Ceph cluster wizard that opens:



Create new Ceph cluster


Generalsettings

Name The Ceph cluster name.

Cluster Select the name of the KaaS child cluster that will host thenew Ceph cluster.

Machines /Machine#1-3

Selectmachine

Select the name of the Kubernetes machine that will hostthe corresponding Ceph node in the Ceph cluster.

Manager,Monitor

Select the required Ceph services to install on the Cephnode.

Devices Select the disk that Ceph will use.

WarningDo not select the device for system services, forexample, sda.

5. To add more Ceph nodes to the new Ceph cluster, click + next to any Ceph Machine title inthe Machines tab. Configure a Ceph node as required.

WarningDo not add more than 3 Manager and/or Monitor services to the Ceph cluster.

6. After you add and configure all nodes in your Ceph cluster, click Create.

Delete a child clusterDeleting a baremetal-based KaaS child cluster does not require a preliminary deletion of themachines running on the cluster.To delete a baremetal-based KaaS child cluster:

1. Log in to the KaaS web UI with the writer permissions.2. Select the required namespace.3. Click the Delete cluster icon next to the name of the cluster you need to remove.4. Verify the list of machines to be removed. Confirm the deletion.5. Optional. If you do not plan to reuse the credentials of the deleted cluster, delete them:

1. On the upper right side of the required namespace page, click Credentials.



2. On the Credentials page, click the Delete credential action icon next to the name of thecredentials to be deleted. Confirm the deletion.

WarningYou can delete credentials only after deleting the KaaS cluster they relate to.

Deleting a cluster automatically frees up the resources allocated for this cluster, for example,instances, load balancers, networks, floating IPs, and so on.

Create and manage an OpenStack-based KaaS childclusterAfter bootstrapping your OpenStack-based KaaS management cluster as described in KaaSDeployment Guide: Deploy an OpenStack-based management cluster, you can create theOpenStack-based KaaS child clusters using the KaaS web UI.

Create a child clusterThis section describes how to create an OpenStack-based KaaS child cluster using the KaaS webUI of the OpenStack-based KaaS management cluster.To create an OpenStack-based KaaS child cluster:

1. Log in to the KaaS web UI with the writer permissions.2. Select the required namespace.3. On the upper right side of the namespace page, click SSH keys. The SSH keys page opens.4. On the SSH keys page, click Add key (the + icon) to upload the public SSH key that will be

used for the OpenStack VMs creation.5. On the upper right side of the namespace page, click Credentials. The Credentials page

opens.6. On the Credentials page, click Add credential (the + icon) to add your OpenStack

credentials. You can either upload your OpenStack clouds.yaml configuration file or fill inthe fields manually.

7. In the Clusters block, click Create cluster (the + icon) and fill out the form with the followingparameters as required:

1. Configure general settings and the Kubernetes parameters:

KaaS child cluster configuration

Section Parameter DescriptionGeneral settings Name Cluster name

Provider Select OpenStack



https://docs.mirantis.com/kaas/beta/kaas-deployment-guide/deploy-os-mgmt.html

https://docs.mirantis.com/kaas/beta/kaas-deployment-guide/deploy-os-mgmt.html

Provider credential From the drop-down list,select the OpenStackcredentials name that youcreated in the previousstep.

Release version The Mirantis KaaS version.Kubernetes applications Istio Select to enable Istio

service mesh forapplication owners.

Caution!Istio is deprecatedsince the Clusterrelease 3.1.0 and2.2.0 and will beremoved in futurereleases.

Kubernetes Dashboard Select to enable theKubernetes Dashboard tomanage applications thatrun on a Kubernetescluster as well astroubleshoot them usingthe web UI.

SSH keys From the drop-down list,select the SSH key namethat you have previouslyadded for SSH access toVMs.

Provider External network Type of the externalnetwork in the OpenStackcloud provider.

DNS name servers Comma-separated list ofthe DNS hosts IPs for theOpenStack VMsconfiguration.

Kubernetes Node CIDR The Kubernetes nodesCIDR block. For example,10.10.10.0/24.



Services CIDR blocks The Kubernetes ServicesCIDR block. For example,10.233.0.0/18.

Pods CIDR blocks The Kubernetes PodsCIDR block. For example,10.233.64.0/18.






Multiservermode
















EnableWatchdogalert


Customalerts






Sendresolved








Sendresolved





example, #channel-for-alerts.





8. Click Create.To view the deployment status, use the Status column in the Clusters tab. Once theUpdating status disappears, the deployment is complete.

9. Proceed with Add a machine.

SeealsoDelete a child cluster

Add a machineAfter you create a new OpenStack-based KaaS child cluster as described in Create a childcluster, proceed with adding machines to this cluster using the KaaS web UI.You can also use the instruction below to scale up an existing KaaS child cluster.To add a machine to an OpenStack-based KaaS child cluster:

1. Log in to the KaaS web UI with the writer permissions.2. Select the required namespace.3. In the Clusters block, click the required cluster name. The Machines page opens.4. On the Machines page, click Create machine (the + icon).5. Fill out the form with the following parameters as required:

KaaS machine configuration

Parameter DescriptionCount Add the required number of machines to create.

The recommended minimum number of machines is three for thecontrol plane HA and two for the KaaS workloads.Select Control Plane for a machine with the control plane role.Otherwise, the machine will have the worker role.

Flavor From the drop-down list, select the required hardwareconfiguration for the machine. The list of available flavorscorresponds to the one in your OpenStack environment.For the hardware requirements, see: Mirantis KaaS ReferenceArchitecture.

Image From the drop-down list, select the cloud image with Ubuntu18.04. If you do not have this image in the list, add it to yourOpenStack environment using the Horizon web UI by downloadingthe image from the Ubuntu official website.

Availability zone From the drop-down list, select the availability zone from whichthe new machine will be launched.



https://docs.mirantis.com/kaas/beta/kaas-ref-arch/kaas-reqs/bom-os.html

https://docs.mirantis.com/kaas/beta/kaas-ref-arch/kaas-reqs/bom-os.html

http://cloud-images.ubuntu.com/bionic/

6. Click Create.To view the deployment status, use the Status column on the Machines page. Once thestatus changes from Pending, Updating to Ready, the deployment is complete.

7. Repeat the steps above for the remaining number of machines.8. Verify the status of the cluster nodes as described in Connect to a KaaS cluster.

Deleting a machine does not require preliminary actions. You can delete a machine using theDelete machine icon on the Machines page of the KaaS web UI. Deleting a machineautomatically frees up the resources allocated for this machine.

WarningThe operational KaaS child cluster should contain minimum 3 Kubernetes control planenodes and 2 Kubernetes worker nodes. To meet the etcd quorum and to prevent thedeployment failure, scaling down of the control plane nodes is prohibited.

Delete a child clusterDeleting a KaaS child cluster does not require a preliminary deletion of VMs that run on thiscluster.To delete an OpenStack-based KaaS child cluster:

1. Log in to the KaaS web UI with the writer permissions.2. Select the required namespace.3. In the Clusters block, click the Delete cluster action icon next to the name of the cluster to

be deleted.4. Verify the list of machines to be removed. Confirm the deletion.

Deleting a cluster automatically frees up the resources allocated for this cluster, forexample, instances, load balancers, networks, floating IPs.

5. If Istio and Harbor were enabled, verify the OpenStack volumes. Since the Istio and Harborstorage is external, manually delete the corresponding resources using the OpenStack webUI or API.

Caution!

Deprecation notesSince the Cluster release 3.1.0 and 2.2.0, removed Harbor support and deprecatedIstio support. Istio will be removed in future Cluster releases.



6. If the cluster deletion hangs and the The cluster is being deleted message does notdisappear for a while:

1. In the upper right corner of the KaaS web UI, click the arrow next to your user name toopen the drop-down menu.

2. In the drop-down menu, click Download kubeconfig to download kubeconfig of yourKaaS management cluster.

3. Log in to any local machine with kubectl installed.4. Copy the downloaded kubeconfig to this machine.5. Run the following command:

kubectl --kubeconfig <KUBECONFIG_PATH> edit -n <NAMESPACE_NAME> cluster <CHILD_CLUSTER_NAME>

6. Edit the opened kubeconfig by removing the following lines:

finalizers:- cluster.cluster.k8s.io

7. Optional. If you do not plan to reuse the credentials of the deleted cluster, delete them:

1. On the upper right side of the required namespace page, click Credentials.2. On the Credentials page, click the Delete credential action icon next to the name of the

credentials to be deleted. Confirm the deletion.


Create and manage an AWS-based KaaS child clusterAfter bootstrapping your AWS-based KaaS management cluster as described in KaaSDeployment Guide: Deploy an AWS-based management cluster, you can create the AWS-basedKaaS child clusters using the KaaS web UI.

Create a child clusterThis section describes how to create an AWS-based KaaS child cluster using the KaaS web UI ofthe AWS-based KaaS management cluster.To create an AWS-based KaaS child cluster:

1. Log in to the KaaS web UI with the writer permissions.2. Select the required namespace.3. On the upper right side of the namespace page, click SSH keys. The SSH keys page opens.



https://docs.mirantis.com/kaas/beta/kaas-deployment-guide/deploy-aws-mgmt.html

https://docs.mirantis.com/kaas/beta/kaas-deployment-guide/deploy-aws-mgmt.html

4. On the SSH keys page, click Add key (the + icon) to upload the public SSH key that will beused for the AWS VMs creation.

5. On the upper right side of the namespace page, click Credentials. The Credentials pageopens.

6. On the Credentials page, click Add credential (the + icon) and fill in the required fields toadd your AWS credentials.

7. Return to the namespace page.8. In the Clusters block, click Create cluster (the + icon) and fill out the form with the following

parameters as required:

1. Configure general settings and the Kubernetes parameters:

KaaS child cluster configuration

Section Parameter DescriptionGeneral settings Name Cluster name

Provider Select AWSProvider credential From the drop-down list,

select the previouslycreated AWS credentialsname.

Release version The Mirantis KaaS version.Kubernetes applications Istio Select to enable Istio

service mesh forapplication owners.

Caution!Istio is deprecatedsince the Clusterrelease 3.1.0 and2.2.0 and will beremoved in futurereleases.

Kubernetes Dashboard Select to enable theKubernetes Dashboard tomanage applications thatrun on a Kubernetescluster as well astroubleshoot them usingthe web UI.



SSH keys From the drop-down list,select the SSH key namethat you have previouslyadded for SSH access toVMs.

Provider AWS region Type in the AWS Regionfor the KaaS child cluster.For example, us-east-2.

Services CIDR blocks The Kubernetes ServicesCIDR block. For example,10.233.0.0/18.

Pods CIDR blocks The Kubernetes PodsCIDR block. For example,10.233.64.0/18.






Multiservermode
















EnableWatchdogalert


Customalerts






Sendresolved












Sendresolved





example, #channel-for-alerts.9. Click Create.

To view the deployment status, use the Status column in the Clusters tab. Once theUpdating status disappears, the deployment is complete.

10.Proceed with Add a machine.

SeealsoDelete a child cluster

Add a machineAfter you create a new AWS-based KaaS child cluster as described in Create a child cluster,proceed with adding machines to this cluster using the KaaS web UI.You can also use the instruction below to scale up an existing KaaS child cluster.To add a machine to an AWS-based KaaS child cluster:

1. Log in to the KaaS web UI with the writer permissions.2. Select the required namespace.3. In the Clusters block, click the required cluster name. The Machines page opens.4. On the Machines page, click Create machine (the + icon).

KaaS machine configuration

Parameter DescriptionCount Add the required number of machines to

create.The recommended minimum number ofmachines is three for the control plane HAand two for the KaaS workloads.Select Control Plane for a machine with thecontrol plane role. Otherwise, the machinewill have the worker role.



Instance type Type in the AWS instance type that isc5d.2xlarge.

AMI ID Type in the required AMI ID of Ubuntu18.04. For example,ami-033a0960d9d83ead0.

Root device size Select the required root device size, 40 bydefault.

5. Click Create.6. Repeat the steps above for the remaining number of machines.

To view the deployment status, use the Status column in the Machines page. Once thestatus changes from Pending, Updating to Ready, the deployment is complete.

7. Verify the status of the cluster nodes as described in Connect to a KaaS cluster.Deleting a machine does not require preliminary actions. You can delete a machine using theDelete machine icon on the Machines page of the KaaS web UI. Deleting a machineautomatically frees up the resources allocated for this machine.


Delete a child clusterDeleting a KaaS child cluster does not require a preliminary deletion of VMs that run on thiscluster.To delete an AWS-based KaaS child cluster:

1. Log in to the KaaS web UI with the writer permissions.2. Select the required namespace.3. In the Clusters block, click the Delete cluster action icon next to the name of the cluster to

be deleted.4. Verify the list of machines to be removed. Confirm the deletion.

Deleting a cluster automatically removes the Amazon Virtual Private Cloud (VPC) connectedwith this cluster and frees up the resources allocated for this cluster, for example,instances, load balancers, networks, floating IPs.

5. If Istio and Harbor were enabled, verify the AWS volumes. Since the Istio and Harbor storageis external, manually delete the corresponding resources using the AWS API or AWSManagement Console.



Caution!

Deprecation notesSince the Cluster release 3.1.0 and 2.2.0, removed Harbor support and deprecatedIstio support. Istio will be removed in future Cluster releases.

6. Optional. If you do not plan to reuse the credentials of the deleted cluster, delete them:

1. On the upper right side of the required namespace page, click Credentials.2. On the Credentials page, click the Delete credential action icon next to the name of the

credentials to be deleted. Confirm the deletion.


Change a cluster configurationAfter deploying a KaaS child cluster, you can change the configuration of the following clustercomponents using the KaaS web UI:

• Enable or disable Istio Deprecated since KaaS release 1.4.0 and Kubernetes Dashboard• Enable or disable StackLight and configure its parameters if enabled

To change a cluster configuration:

1. Log in to the KaaS web UI with the writer permissions.2. Select the required namespace.3. On the right side of the required cluster block, click the gear icon.4. In the Configure cluster window, select or deselect the required Kubernetes application. If

StackLight is enabled, configure its parameters as required.5. Click Update to apply the changes.

Update a child clusterA KaaS management cluster automatically upgrades to a new available KaaS release versionthat supports new Cluster releases. Once done, a newer version of a Cluster release becomesavailable for KaaS child clusters and the Update button appears in the KaaS web UI.



Caution!

Mirantis highly recommends updating your clusters that are based on Kubernetes 1.16 tothe latest supported Cluster release that is based on Kubernetes 1.17. Be aware that:

• Before the KaaS release 1.9.0, any Cluster release was supported at least by twoKaaS releases.

• Starting from the KaaS release 1.9.0:

• For the sake of development and the upcoming UCP-based Cluster release, oneKaaS release supports only one Cluster release that is based on Kubernetes 1.16and continues supporting two Cluster releases that are based on Kubernetes1.17.

• Only new deployments of the KaaS clusters based on Kubernetes 1.16 aresupported.

• An update from a previous KaaS release based on Kubernetes 1.16 is notsupported anymore.

Caution!

Make sure to update the Cluster release version of your KaaS child cluster before thecurrent Cluster release version becomes unsupported by a new KaaS release version.Otherwise, KaaS stops auto-upgrade and eventually Mirantis KaaS itself becomesunsupported.

This section describes how to update a KaaS child cluster of any provider type using the KaaSweb UI.To update a KaaS child cluster:

1. Log in to the KaaS web UI with the writer permissions.2. Select the required namespace.3. In the Clusters block, click Update where available.4. In the Release update window, select the required Cluster release to update your child

cluster to.The Description section contains the list of components versions to be installed with a newCluster release. The release notes for each KaaS and Cluster release are available at KaaSRelease Notes: KaaS releases and KaaS Release Notes: Cluster releases.

5. Click Update.To view the update status, verify the Updating status of the cluster in the Clusters block.Once the Updating status disappears, the update is complete.





https://docs.mirantis.com/kaas/beta/kaas-release-notes/cluster-releases.html

Delete a machineThis section instructs you on how to scale down an existing KaaS child cluster through the KaaSweb UI.

WarningA machine with the control plane node role cannot be deleted manually. A machine withsuch role is automatically deleted during the KaaS child cluster deletion.

To delete a machine from a KaaS child cluster:

1. Log in to the KaaS web UI with the writer permissions.2. Select the required namespace.3. Click on the cluster name to open the list of machines running in it.4. Click the Delete machine icon next to the machine you want to remove. Confirm the

deletion.Deleting a machine automatically frees up the resources allocated to this machine.




Manage a KaaS management clusterThe KaaS web UI enables you to perform the following operations with a KaaS managementcluster:

• View the cluster details (such as cluster ID, creation date, nodes count, and so on) as well asobtain a list of the cluster endpoints including the StackLight components, depending onyour deployment configuration.To view generic cluster details, in the Clusters block, click the Cluster info action icon nextto the name of the required management cluster.

• Verify the current release version of the cluster including the list of installed componentswith their versions and the cluster release change log.To view a cluster release version details, in the Clusters block, click the version next to thename of the required management cluster.A management cluster upgrade to a newer version is performed automatically once a newKaaS version is released. For more details about the KaaS release upgrade mechanism, see:KaaS Reference Architecture: KaaS release controller.

WarningDue to architecture limitations, a baremetal-based management cluster upgradefrom the KaaS release 1.6.0 to 1.7.0 is not supported.

SeealsoConnect to a KaaS cluster



https://docs.mirantis.com/kaas/beta/kaas-ref-arch/components-stack/kaas-release-controller.html

Connect to a KaaS clusterAfter you deploy a KaaS management or child cluster, connect to the cluster to verify theavailability and status of the nodes as described below.This section also describes how to SSH to a node of a cluster where Bastion host is used for SSHaccess. For example, on the OpenStack-based management cluster or AWS-based managementand child clusters.To connect to a KaaS child cluster:

1. Log in to the KaaS web UI with the writer permissions.2. Select the required namespace.3. In the Clusters block, click the required cluster name. The Machines page opens.4. Verify the status of the control plane nodes. Once the first control plane node is deployed

and has the Ready status, the Download kubeconfig action icon for the cluster beingdeployed becomes active.

5. Click Download kubeconfig:

1. Enter your user password.2. Not recommended. Select Offline token to generate an offline IAM token. Otherwise, for

security reasons, the kubeconfig token expires every 30 minutes of the KaaS API idletime and you have to download kubeconfig again with a newly generated token.

3. Click Download.6. Verify the availability of the KaaS child cluster machines:

1. Export the kubeconfig parameters to your local machine with access to kubectl. Forexample:

export KUBECONFIG=~/Downloads/kubeconfig-test-cluster.yml

2. Obtain the list of available KaaS machines:

kubectl get nodes -o wide

The system response must contain the details of the nodes in the READY status.To connect to a KaaS management cluster:

1. Log in to a local machine where your KaaS management cluster kubeconfig is located andwhere kubectl is installed.

NoteThe KaaS management cluster kubeconfig is created during the last stage of the KaaSmanagement cluster bootstrap.



2. Obtain the list of available KaaS management cluster machines:


The system response must contain the details of the nodes in the READY status.To SSH to a KaaS cluster node if Bastion is used:

1. Obtain kubeconfig of the KaaS management or child cluster as described in the proceduresabove.

2. Obtain the internal IP address of a node you require access to:


3. Obtain the Bastion public IP:

kubectl get cluster -o jsonpath='{.status.providerStatus.bastion.publicIP}' \-n <namespace> <cluster_name>

4. Run the following command:

ssh -i <private_key> ubuntu@<node_internal_ip> -o "proxycommand ssh -W %h:%p \-i <private_key> ubuntu@<bastion_public_ip>"

Substitute the parameters enclosed in angle brackets with the corresponding values of yourcluster obtained in previous steps. The <private_key> for a KaaS management cluster islocated at ~/.ssh/openstack_tmp. For a KaaS child cluster, this is the SSH key that youadded in the KaaS web UI before the child cluster creation.



Manage IAMIAM CLIIAM CLI is a user-facing command-line tool for managing scopes, roles, and grants. Using yourpersonal credentials, you can perform different IAM operations through the iamctl tool. Forexample, you can verify the current status of the IAM service, request or revoke service tokens,verify your own grants within Mirantis KaaS as well as your token details.

Configure IAM CLIThe iamctl command-line interface uses the iamctl.yaml configuration file to interact with IAM.To create the IAM CLI configuration file:

1. Log in to the KaaS management cluster.2. Change the directory to one of the following:

• $HOME/.iamctl• $HOME• $HOME/etc• /etc/iamctl

3. Create iamctl.yaml with the following exemplary parameters and values that correspond toyour deployment:

server: <IAM_API_ADDRESS>timeout: 60verbose: 99 # Verbosity level, from 0 to 99

tls: enabled: true ca: <PATH_TO_CA_BUNDLE>

auth: issuer: <IAM_REALM_IN_KEYCLOAK> ca: <PATH_TO_CA_BUNDLE> client_id: iam client_secret:

The <IAM_REALM_IN_KEYCLOAK> value has the<keycloak-url>/auth/realms/<realm-name> format, where <realm-name> defaults to iam.

Available IAM CLI commandsUsing iamctl, you can perform different role-based access control operations in your Kubernetescluster. For example:

• Grant or revoke access to a Kubernetes cluster to a specific user for troubleshooting



• Grant or revoke access to a KaaS namespace that contains several Kubernetes clusters• Create or delete tokens for the KaaS services with a specific set of grants as well as identify

when a service token was used the last timeThe iamctl command-line interface contains the following set of commands:

• General commands• Account information commands• Scope commands• Role commands• Grant commands• Service token commands• User commands

The following tables describe the iamctl commands with their descriptions.

General commands

Usage Descriptioniamctl --help, iamctl help Output the list of available commands.iamctl help <command> Output the description of a specific command.

Account information commands

Usage Descriptioniamctl account info Output detailed account information such as user email, user

name, the details of their active and offline sessions, tokensstatuses and expiration dates.

iamctl account login Log in the current user. The system prompts to enter yourauthentication credentials. After a successful login, your usertoken is added to the $HOME/.iamctl directory.

iamctl account logout Log out the current user. Once done, the user information isremoved from $HOME/.iamctl.

Scope commands

Usage Description



iamctl scope list List the IAM scopes available for the current environment.Example output:

+---------------+-----------------+| NAME | DESCRIPTION |+---------------+-----------------+| m:iam | IAM scope || m:kaas | KaaS scope || m:k8s:managed | || m:k8s | Kubernetes scope|| m:cloud | Cloud scope |+---------------+-----------------+

iamctl scope list [prefix]

Output the specified scope list. For example: iamctl m:k8s.

Role commands

Usage Descriptioniamctl role list <scope> List the roles for the specified scope in IAM.iamctl role show <scope> <role>

Output the details of the specified scope role including therole name (admin, viewer, reader), its description, and anexample of the grant command. For example:iamctl role show m:iam admin.

Grant commands

Usage Descriptioniamctl grant give [username] [scope] [role]

Provide a user with a role in a scope. For example, theiamctl grant give jdoe m:iam admin command provides theIAM admin role in the m:iam scope to John Doe.For the list of supported IAM scopes and roles, see: Role list.

NoteTo lock or disable a user, use LDAP or Google OAuthdepending on the external provider integrated to yourdeployment.



iamctl grant list <username> List the grants provided to the specified user. For example:iamctl grant list jdoe.Example output:

+--------+--------+---------------+| SCOPE | ROLE | GRANT FQN |+--------+--------+---------------+| m:iam | admin | m:iam@admin || m:sl | viewer | m:sl@viewer || m:kaas | writer | m:kaas@writer |+--------+--------+---------------+

• m:iam@admin - admin rights in all IAM-relatedapplications

• m:sl@viewer - viewer rights in all StackLight-relatedapplications

• m:kaas@writer - writer rights in KaaSiamctl grant revoke [username] [scope] [role]

Revoke the grants provided to the user.

Service token commands

Usage Descriptioniamctl servicetoken list [--all] List the details of all service tokens created by the current

user. The output includes the following service token details:

• ID• Alias, for example, nova, jenkins-ci• Creation date and time• Creation owner• Grants• Last refresh date and time• IP address

iamctl servicetoken show [ID] Output the details of a service token with the specified ID.iamctl servicetoken create [alias] [service] [grant1 grants2...]

Create a token for a specific service with the specified set ofgrants. For example,iamctl servicetoken create new-token iam m:iam@viewer.

iamctl servicetoken delete [ID1 ID2...]

Delete a service token with the specified ID.

User commands



Usage Descriptioniamctl user list List user names and emails of all current users.iamctl user show <username>

Output the details of the specified user.

Role listMirantis KaaS creates the IAM roles in scopes. For each application type, such as iam, k8s, orkaas, KaaS creates a scope in Keycloak. And every scope contains a set of roles such as admin,user, viewer. The default IAM roles can be changed during a KaaS child cluster deployment. Youcan grant or revoke a role access using the IAM CLI. For details, see: IAM CLI.Example of the structure of a cluster-admin role in a Kubernetes cluster:

m:k8s:kaas-tenant-name:k8s-cluster-name@cluster-admin

• m - prefix for all IAM roles in Mirantis KaaS• k8s - application type, Kubernetes• kaas-tenant-name:k8s-cluster-name - a Kubernetes cluster identifier in KaaS (CLUSTER_ID)• @ - delimiter between a scope and role• cluster-admin - name of the role within the Kubernetes scope

The following tables include the scopes and their roles descriptions by Mirantis KaaScomponents:

• IAM• KaaS• Kubernetes• StackLight

IAM

Scopeidentifier Role name Grant example Role description

m:iam admin m:iam@admin 1 Access Keycloak, the IAM APIand web UI.

user m:iam@user 1 Access the IAM API and webUI.

viewer m:iam@viewer 1 Access the data to be usedby the monitoring systems.



KaaS


m:kaas reader m:kaas@reader 1 List the Kubernetes clusterswithin the KaaS scope.

writer m:kaas@writer 1 Create or delete theKubernetes clusters withinthe KaaS scope.

m:kaas:$<CLUSTER_ID>

reader m:kaas:$<CLUSTER_ID>@reader

List the Kubernetes clusterswithin the specified KaaScluster ID.

writer m:kaas:$<CLUSTER_ID>@writer

Create or delete theKubernetes clusters withinthe specified KaaS cluster ID.

operator m:kaas@operator Add or delete the bare metalhosts within the KaaS scope.

1(1, 2, 3, 4, 5) Grant is available by default. Other grants can be added during a KaaSmanagement and child cluster deployment.

Kubernetes


m:k8s:<CLUSTER_ID>

cluster-admin m:k8s:<CLUSTER_ID>@cluster-admin

Allow the super-user accessto perform any action on anyresource on the cluster level.When used inClusterRoleBinding, providefull control over everyresource in a cluster and allKubernetes namespaces.

StackLight




m:sl:$<CLUSTER_ID> or m:sl:$<CLUSTER_ID>:<SERVICE_NAME>

admin• m:sl:$<CLUSTER_ID>@admin• m:sl:$<CLUSTER_ID>:alerta@admin• m:sl:$<CLUSTER_ID>:alertmngmnt@admin• m:sl:$<CLUSTER_ID>:kibana@admin• m:sl:$<CLUSTER_ID>:graphana@admin• m:sl:$<CLUSTER_ID>:prometheus@admin

Assign roles to other userswithin the scope.

viewer• m:sl:$<CLUSTER_ID>@viewer• m:sl:$<CLUSTER_ID>:alerta@viewer• m:sl:$<CLUSTER_ID>:alertmngmnt@viewer• m:sl:$<CLUSTER_ID>:kibana@viewer• m:sl:$<CLUSTER_ID>:graphana@viewer• m:sl:$<CLUSTER_ID>:prometheus@viewer

Access the specified webUI(s) within the scope.The m:sl:$<CLUSTER_ID>@viewer grant provides accessto all StackLight web UIs:Prometheus, Alerta,Alertmanager, Kibana,Grafana.



Manage StackLightUsing StackLight, you can monitor the components deployed in Mirantis KaaS and be quicklynotified of critical conditions that may occur in the system to prevent service downtimes.

Access StackLight web UIsStackLight provides five web UIs including Prometheus, Alertmanager, Alerta, Kibana, andGrafana. This section describes how to access any of these web UIs.To access a StackLight web UI:

1. Log in to the KaaS web UI.2. Select the required namespace.3. In the Clusters tab, click the required cluster.4. In the dialog box with the cluster information, copy the required endpoint IP from the

StackLight endpoints parameter.5. Paste the copied IP to a web browser and use the default credentials to log in to the web UI.

Once done, you are automatically authenticated to all StackLight web UIs.

Seealso

• KaaS Reference Architecture: Deployment architecture• KaaS Reference Architecture: Authentication flow

View Grafana dashboardsUsing the Grafana web UI, you can view the visual representation of the metric graphs based onthe time series databases.To view the Grafana dashboards:

1. Log in to the Grafana web UI as described in Access StackLight web UIs.2. From the drop-down list, select the required dashboard to inspect the status and statistics

of the corresponding service in your KaaS management or child cluster:

Component

Dashboard Description




https://docs.mirantis.com/kaas/beta/kaas-ref-arch/components-stack/monitoring/authentication-flow.html

Cephcluster

Ceph cluster Provides the overall health status of the Ceph cluster,capacity, latency, and recovery metrics.

Ceph NodesAvailable since KaaS1.5.0

Provides an overview of the host-related metrics, such as thenumber of monitors, OSD hosts, average usage of resourcesacross the cluster, network and hosts load.

NoteSince KaaS 1.5.0, Ceph hosts overview is renamed toCeph Nodes.

Ceph OSD Availablesince KaaS 1.5.0

Provides metrics for Ceph OSDs, including the OSD read andwrite latencies, distribution of PGs per OSD, Ceph OSDs andphysical device performance.

NotePrior to KaaS 1.5.0, Ceph OSDs metrics are included inthe Ceph OSDs overview and Ceph OSDs details Grafanadashboards.

Ceph pools Availablesince KaaS 1.5.0

Provides metrics for Ceph pools, including the client IOPS andthroughput by pool and pools capacity usage.

NoteSince KaaS 1.5.0, Ceph pool overview is renamed toCeph Pools.



KaaSclusters

Clusters overviewAvailable since KaaS1.8.0

Represents the main cluster capacity statistics for all clustersof a KaaS deployment where StackLight is installed.

NoteThis dashboard is not available yet for the bare metalprovider.

Kubernetesservices

Kubernetes Calico Provides metrics of the entire Calico cluster usage, includingthe cluster status, host status, and Felix resources.

Kubernetes cluster Provides metrics for the entire Kubernetes cluster, includingthe cluster status, host status, and resources consumption.

Kubernetesdeployments

Provides information on the desired and current state of allKaaS cluster service replicas deployed.

Kubernetesnamespace

Provides the pods state summary and the CPU, MEM, network,and IOPS resources consumption per name space.

Kubernetes node Provides charts showing resources consumption per KaaScluster node.

Kubernetes pod Provides charts showing resources consumption per deployedpod.

MongoDB

MongoDB Provides the summary for the query operations, informationabout the database health and resource consumption.



NGINX

NGINX Provides the overall status of the NGINX cluster andinformation about NGINX requests and connections.

StackLight

Alertmanager Provides performance metrics on the overall health status ofthe Prometheus Alertmanager service, the number of firingand resolved alerts received for various periods, the rate ofsuccessful and failed notifications, and the resourcesconsumption.

Elasticsearch Provides information about the overall health status of theElasticsearch cluster, including the resources consumption andthe state of the shards.

Grafana Provides performance metrics for the Grafana service,including the total number of Grafana entities, CPU andmemory consumption.

PrometheusAvailable since KaaS1.5.0

Provides the availability and performance behavior of thePrometheus servers, the sample ingestion rate, and systemusage statistics per server. Also, provides statistics about theoverall status and uptime of the Prometheus service, thechunks number of the local storage memory, target scrapes,and queries duration.

NotePrior to KaaS 1.5.0, Prometheus metrics are included inthe Prometheus performances and Prometheus statsGrafana dashboards.

Pushgateway Provides performance metrics and the overall health status ofthe service, the rate of samples received for various periods,and the resources consumption.

Prometheus Relay Provides service status and resources consumption metrics.



Telemeter serverAvailable since KaaS1.8.0

Provides statistics and the overall health status of theTelemeter service.

NoteThis dashboard is not available yet for the bare metalprovider.

System

System Provides a detailed resource consumption and operatingsystem information per KaaS cluster node.

View Kibana dashboardsUsing the Kibana web UI, you can view the visual representation of logs and Kubernetes eventsof your deployment.To view the Kibana dashboards:

1. Log in to the Kibana web UI as described in Access StackLight web UIs.2. Click the required dashboard to inspect the visualizations or perform a search:

Dashboard DescriptionLogs Provides visualization and search of logs.Kubernetesevents Availablesince KaaS 1.3.0

Provides visualization and search of Kubernetes events.

Available StackLight alertsThis section provides an overview of the available predefined StackLight alerts. To view thealerts, use the Prometheus, Alertmanager, or Alerta web UI.

AlertmanagerThis section describes the alerts for the Alertmanager service.

• AlertmanagerFailedReload• AlertmanagerMembersInconsistent• AlertmanagerNotificationFailureWarning• AlertmanagerAlertsInvalidWarning



AlertmanagerFailedReload

Severity

Warning

Summary

Failure to reload the Alertmanager configuration.

Description

Reloading the Alertmanager configuration failed for{{ $labels.namespace }}/{{ $labels.pod }}.

AlertmanagerMembersInconsistent

Severity

Critical

Summary

Alertmanager did not detect all cluster members.

Description

Alertmanager did not detect all other members of the cluster.

AlertmanagerNotificationFailureWarning

Severity

Warning

Summary

Alertmanager has failed notifications.

Description

An average of {{ $value }} Alertmanager {{ $labels.integration }} notifications onthe {{ $labels.instance }} instance fail for 2 minutes.

AlertmanagerAlertsInvalidWarning

Severity

Warning

Summary

Alertmanager has invalid alerts.

Description

An average of {{ $value }} Alertmanager {{ $labels.integration }} alerts on the{{ $labels.instance }} instance are invalid for 2 minutes.



CalicoThis section describes the alerts for Calico.

• CalicoDataplaneFailuresHigh• CalicoDataplaneAddressMsgBatchSizeHigh• CalicoDatapaneIfaceMsgBatchSizeHigh• CalicoIPsetErrorsHigh• CalicoIptablesSaveErrorsHigh• CalicoIptablesRestoreErrorsHigh

CalicoDataplaneFailuresHigh

Severity

Warning

Summary

High number of data plane failures within Felix.

Description

The {{ $labels.instance }} Felix instance has {{ $value }} data plane failures withinthe last hour.

CalicoDataplaneAddressMsgBatchSizeHigh

Severity

Warning

Summary

Felix address message batch size is higher than 5.

Description

The size of the data plane address message batch on the {{ $labels.instance }} Felixinstance is {{ $value }}.

CalicoDatapaneIfaceMsgBatchSizeHigh

Severity

Warning

Summary

Felix interface message batch size is higher than 5.



Description

The size of the data plane interface message batch on the {{ $labels.instance }}Felix instance is {{ $value }}.

CalicoIPsetErrorsHigh

Severity

Warning

Summary

More than 5 IPset errors occur in Felix per hour.

Description

The {{ $labels.instance }} Felix instance has {{ $value }} IPset errors within the lasthour.

CalicoIptablesSaveErrorsHigh

Severity

Warning

Summary

More than 5 iptable save errors occur in Felix per hour.

Description

The {{ $labels.instance }} Felix instance has {{ $value }} iptable save errors withinthe last hour.

CalicoIptablesRestoreErrorsHigh

Severity

Warning

Summary

More than 5 iptable restore errors occur in Felix per hour.

Description

The {{ $labels.instance }} Felix instance has {{ $value }} iptable restore errorswithin the last hour.

CephThis section describes the alerts for the Ceph cluster.

• CephClusterHealthMinor• CephClusterHealthCritical• CephMonQuorumAtRisk



• CephOsdDownMinor• CephOSDDiskNotResponding• CephOSDDiskUnavailable• CephClusterNearFull• CephClusterCriticallyFull• CephOsdPgNumTooHighWarning• CephOsdPgNumTooHighCritical• CephMonHighNumberOfLeaderChanges• CephNodeDown• CephDataRecoveryTakingTooLong• CephPGRepairTakingTooLong• CephOSDVersionMismatch• CephMonVersionMismatch

CephClusterHealthMinor

Severity

Minor

Summary

Ceph cluster health is WARNING

Description

The Ceph cluster is in the WARNING state. For details, run ceph -s.

CephClusterHealthCritical

Severity

Critical

Summary

Ceph cluster health is CRITICAL.

Description

The Ceph cluster is in the CRITICAL state. For details, run ceph -s.

CephMonQuorumAtRisk



Severity

Critical

Summary

Storage quorum is at risk.

Description

The storage cluster quorum is low.

CephOsdDownMinor

Severity

Minor

Summary

Ceph OSDs are down.

Description

{{ $value }} of Ceph OSD nodes in the Ceph cluster are down. For details, runceph osd tree.

CephOSDDiskNotResponding

Severity

Critical

Summary

Disk is not responding.

Description

The {{ $labels.device }} disk device is not responding on the {{ $labels.host }}host.

CephOSDDiskUnavailable

Severity

Critical

Summary

Disk is not accessible.

Description

The {{ $labels.device }} disk device is not accessible on the {{ $labels.host }} host.



CephClusterNearFull

Severity

Warning

Summary

Storage cluster is nearly full. Expansion is required.

Description

The storage cluster capacity is less than 85%.

CephClusterCriticallyFull

Severity

Critical

Summary

Storage cluster is critically full and needs immediate expansion.

Description

The storage cluster capacity is less than 95%.

CephOsdPgNumTooHighWarning

Severity

Warning

Summary

Some Ceph OSDs have more than 200 PGs.

Description

Some Ceph OSDs contain more than 200 PGs. This may have a negative impact onthe cluster performance. For details, run ceph pg dump.

CephOsdPgNumTooHighCritical

Severity

Critical

Summary

Some Ceph OSDs have more than 300 PGs.

Description

Some Ceph OSDs contain more than 300 PGs. This may have a negative impact onthe cluster performance. For details, run ceph pg dump.



CephMonHighNumberOfLeaderChanges

Severity

Warning

Summary

Many leader changes occur in the storage cluster.

Description

{{ $value }} leader changes per minute occur for the {{ $labels.instance }}instance of the {{ $labels.job }} Ceph Monitor.

CephNodeDown

Severity

Critical

Summary

Storage node {{ $labels.node }} went down.

Description

The {{ $labels.node }} storage node is down and requires immediate verification.

CephDataRecoveryTakingTooLong

Severity

Warning

Summary

Data recovery is slow.

Description

Data recovery has been active for more than two hours.

CephPGRepairTakingTooLong

Severity

Warning

Summary

Self-heal issues detected.



Description

The self-heal operations take an excessive amount of time.

CephOSDVersionMismatch

Severity

Warning

Summary

Multiple versions of storage services are running.

Description

{{ $value }} different versions of Ceph OSD components are running.

CephMonVersionMismatch

Severity

Warning

Summary

Multiple versions of storage services are running.

Description

{{ $value }} different versions of Ceph Monitor components are running.

ElasticsearchThis section describes the alerts for the Elasticsearch service.

• ElasticHeapUsageTooHigh• ElasticHeapUsageWarning• ElasticClusterRed• ElasticClusterYellow• NumberOfRelocationShards• NumberOfInitializingShards• NumberOfUnassignedShards• NumberOfPendingTasks• ElasticNoNewDocuments

ElasticHeapUsageTooHigh



Severity

Critical

Summary

Elasticsearch heap usage is too high (>90%).

Description

Elasticsearch heap usage is over 90% for 5 minutes.

ElasticHeapUsageWarning

Severity

Warning

Summary

Elasticsearch heap usage is high (>80%).

Description

Elasticsearch heap usage is over 80% for 5 minutes.

ElasticClusterRed

Severity

Critical

Summary

Elasticsearch cluster is RED.

Description

The Elasticsearch cluster status is RED.

ElasticClusterYellow

Severity

Warning

Summary

Elasticsearch cluster is YELLOW.

Description

The Elasticsearch cluster status is YELLOW.



NumberOfRelocationShards

Severity

Critical

Summary

Shards relocation takes more than 20 minutes.

Description

Elasticsearch has {{ $value }} relocating shards for 20 minutes.

NumberOfInitializingShards

Severity

Critical

Summary

Shards initialization takes more than 10 minutes.

Description

Elasticsearch has {{ $value }} shards being initialized for 10 minutes.

NumberOfUnassignedShards

Severity

Critical

Summary

Shards have unassigned status for 5 minutes.

Description

Elasticsearch has {{ $value }} unassigned shards for 5 minutes.

NumberOfPendingTasks

Severity

Warning

Summary

Tasks have pending state for 10 minutes.

Description

Elasticsearch has {{ $value }} pending tasks for 10 minutes. The cluster worksslowly.



ElasticNoNewDocuments

Severity

Warning

Summary

Elasticsearch has no new documents for 10 minutes.

Description

Elasticsearch obtains no new documents for 10 minutes.

etcdThis section describes the alerts for the etcd service.

• etcdInsufficientMembers• etcdNoLeader• etcdHighNumberOfLeaderChanges• etcdGRPCRequestsSlow• etcdMemberCommunicationSlow• etcdHighNumberOfFailedProposals• etcdHighFsyncDurations• etcdHighCommitDurations

etcdInsufficientMembers

Severity

Critical

Summary

The etcd cluster has insufficient members.

Description

The {{ $labels.job }} etcd cluster has {{ $value }} insufficient members.

etcdNoLeader

Severity

Critical



Summary

The etcd cluster has no leader.

Description

The {{ $labels.instance }} member of the {{ $labels.job }} etcd cluster has noleader.

etcdHighNumberOfLeaderChanges

Severity

Warning

Summary

More than 3 leader changes occurred in the the etcd cluster within the last hour.

Description

The {{ $labels.instance }} instance of the {{ $labels.job }} etcd cluster has{{ $value }} leader changes within the last hour.

etcdGRPCRequestsSlow

Severity

Critical

Summary

The etcd cluster has slow gRPC requests.

Description

The gRPC requests to {{ $labels.grpc_method }} take {{ $value }}s on{{ $labels.instance }} instance of the {{ $labels.job }} etcd cluster.

etcdMemberCommunicationSlow

Severity

Warning

Summary

The etcd cluster has slow member communication.

Description

The member communication with {{ $labels.To }} on the {{ $labels.instance }}instance of the {{ $labels.job }} etcd cluster takes {{ $value }}s.

etcdHighNumberOfFailedProposals



Severity

Warning

Summary

The etcd cluster has more than 5 proposal failures.

Description

The {{ $labels.job }} etcd cluster has {{ $value }} proposal failures on the{{ $labels.instance }} etcd instance within the last hour.

etcdHighFsyncDurations

Severity

Warning

Summary

The etcd cluster has high fync duration.

Description

The duration of 99% of all fync operations on the {{ $labels.instance }} of the{{ $labels.job }} etcd cluster is {{ $value }}s.

etcdHighCommitDurations

Severity

Warning

Summary

The etcd cluster has high commit duration.

Description

The duration of 99% of all commit operations on the {{ $labels.instance }} of the{{ $labels.job }} etcd cluster is {{ $value }}s.

General alertsThis section lists the general available alerts.

• TargetDown• NodeDown• Watchdog

TargetDown

Severity

Critical



Summary

The {{ $labels.job }} target is down.

Description

The {{ $labels.job }}/{{ $labels.instance }} target is down.

NodeDown

Severity

Critical

Summary

The {{ $labels.node }} node is down.

Description

The {{ $labels.node }} node is down. Kubernetes treats {{ $labels.node }} as notReady and kubelet is not accessible from Prometheus.

Watchdog

Severity

None

Summary

Watchdog alert that is always firing.

Description

This alert ensures that the entire alerting pipeline is functional. This alert shouldalways be firing in Alertmanager against a receiver. Some integrations with variousnotification mechanisms can send a notification when this alert is not firing. Forexample, the DeadMansSnitch integration in PagerDuty.

General node alertsThis section lists the general alerts for Kubernetes nodes.

• SystemCpuFullWarning• SystemLoadTooHighWarning• SystemLoadTooHighCritical• SystemDiskFullWarning• SystemDiskFullMajor• SystemMemoryFullWarning• SystemMemoryFullMajor• SystemDiskInodesFullWarning• SystemDiskInodesFullMajor



• SystemDiskErrorsTooHigh

SystemCpuFullWarning

Severity

Warning

Summary

High CPU consumption.

Description

The average CPU consumption on the {{ $labels.node }} node is {{ $value }}% for2 minutes.

SystemLoadTooHighWarning

Severity

Warning

Summary

System load is more than 1 per CPU.

Description

The system load per CPU on the {{ $labels.node }} node is {{ $value }} for 5minutes.

SystemLoadTooHighCritical

Severity

Critical

Summary

System load is more than 2 per CPU.

Description

The system load per CPU on the {{ $labels.node }} node is {{ $value }} for 5minutes.

SystemDiskFullWarning

Severity

Warning

Summary

Disk partition {{ $labels.mountpoint }} is 85% full.



Description

The {{ $labels.mountpoint }} partition of the {{ $labels.device }} disk on the{{ $labels.node }} node is {{ $value }}% full for 2 minutes.

SystemDiskFullMajor

Severity

Major

Summary

Disk partition {{ $labels.mountpoint }} is 95% full.

Description

The {{ $labels.mountpoint }} partition of the {{ $labels.device }} disk on the{{ $labels.node }} node is {{ $value }}% full for 2 minutes.

SystemMemoryFullWarning

Severity

Warning

Summary

More than 90% of memory is used or less than 8 GB is available.

Description

The {{ $labels.node }} node consumes {{ $value }}% of memory for 2 minutes.

SystemMemoryFullMajor

Severity

Major

Summary

More than 95% of memory is used or less than 4 GB of memory is available.

Description

The {{ $labels.node }} node consumes {{ $value }}% of memory for 2 minutes.

SystemDiskInodesFullWarning

Severity

Warning



Summary

The {{ $labels.mountpoint }} volume uses 85% of inodes.

Description

The {{ $labels.device }} disk on the {{ $labels.node }} node consumes{{ $value }}% of disk inodes in the {{ $labels.mountpoint }} volume for 2 minutes.

SystemDiskInodesFullMajor

Severity

Warning

Summary

The {{ $labels.mountpoint }} volume uses 95% of inodes.

Description

The {{ $labels.device }} disk on the {{ $labels.node }} node consumes{{ $value }}% of disk inodes in the {{ $labels.mountpoint }} volume for 2 minutes.

SystemDiskErrorsTooHigh

Severity

Warning

Summary

The {{ $labels.device }} disk is failing.

Description

The {{ $labels.device }} disk on the {{ $labels.node }} node is reporting errors for 5minutes.

IronicThis section describes the alerts for Ironic. The alerted events include Ironic API availability andIronic processes availability.

• IronicMetricsMissing• IronicApiOutage

IronicMetricsMissing

Severity

Critical

Summary

Ironic metrics missing.



Description

Metrics retrieved from the Ironic API are not available for 2 minutes.

IronicApiOutage

Severity

Critical

Summary

Ironic API outage.

Description

The Ironic API is not accessible.

Kubernetes applicationsThis section lists the alerts for Kubernetes applications.

• KubePodCrashLooping• KubePodNotReady• KubeDeploymentGenerationMismatch• KubeDeploymentReplicasMismatch• KubeStatefulSetReplicasMismatch• KubeStatefulSetGenerationMismatch• KubeStatefulSetUpdateNotRolledOut• KubeDaemonSetRolloutStuck• KubeDaemonSetNotScheduled• KubeDaemonSetMisScheduled• KubeCronJobRunning• KubeJobCompletion• KubeJobFailed

KubePodCrashLooping

Severity

Critical

Summary

The {{ $labels.pod }} Pod is restarting.



Description

The {{ $labels.namespace }}/{{ $labels.pod }} Pod ({{ $labels.container }}) isrestarting {{ printf "%.2f" $value }} times per 5 minutes.

KubePodNotReady

Severity

Critical

Summary

The {{ $labels.pod }} Pod is in the non-ready state.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} is in the non-ready state for longerthan an hour.

KubeDeploymentGenerationMismatch

Severity

Critical

Summary

The {{ $labels.deployment }} deployment generation does not match the metadata.

Description

The deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }}does not match the metadata, indicating that the deployment failed but has not beenrolled back.

KubeDeploymentReplicasMismatch

Severity

Critical

Summary

The {{ $labels.deployment }} deployment has a wrong number of replicas.

Description

The {{ $labels.namespace }}/{{ $labels.deployment }} deployment does not matchthe expected number of replicas for longer than one hour.

KubeStatefulSetReplicasMismatch

Severity

Critical



Summary

The {{ $labels.statefulset }} StatefulSet has a wrong number of replicas.

Description

The {{ $labels.namespace }}/{{ $labels.statefulset }} StatefulSet does not matchthe expected number of replicas for longer than 15 minutes.

KubeStatefulSetGenerationMismatch

Severity

Critical

Summary

The {{ $labels.statefulset }} StatefulSet generation does not match the metadata.

Description

The StatefulSet generation for {{ $labels.namespace }}/{{ $labels.statefulset }}does not match the metadata, indicating that the StatefulSet failed but has not beenrolled back.

KubeStatefulSetUpdateNotRolledOut

Severity

Critical

Summary

The {{ $labels.statefulset }} StatefulSet update has not been rolled out.

Description

The {{ $labels.namespace }}/{{ $labels.statefulset }} StatefulSet update has notbeen rolled out.

KubeDaemonSetRolloutStuck

Severity

Critical

Summary

The {{ $labels.daemonset }} DaemonSet is not ready.

Description

Only {{ $value }}% of the desired Pods of the{{ $labels.namespace }}/{{ $labels.daemonset }} DaemonSet are scheduled andready.

KubeDaemonSetNotScheduled



Severity

Warning

Summary

The {{ $labels.daemonset }} DaemonSet has not scheduled Pods.

Description

The {{ $labels.namespace }}/{{ $labels.daemonset }} DaemonSet has{{ $value }} not scheduled Pods.

KubeDaemonSetMisScheduled

Severity

Warning

Summary

The {{ $labels.daemonset }} DaemonSet has incorrectly scheduled Pods.

Description

The {{ $labels.namespace }}/{{ $labels.daemonset }} has {{ $value }} Podsrunning where they are not supposed to run.

KubeCronJobRunning

Severity

Warning

Summary

The {{ $labels.cronjob }} CronJob is not ready for more than one hour.

Description

The {{ $labels.namespace }}/{{ $labels.cronjob }} CronJob takes more than onehour to complete.

KubeJobCompletion

Severity

Warning

Summary

The {{ $labels.job_name }} job is not ready for more than one hour.

Description

The {{ $labels.namespace }}/{{ $labels.job_name }} job takes more than one hourto complete.



KubeJobFailed

Severity

Warning

Summary

The {{ $labels.job_name }} job failed.

Description

The {{ $labels.namespace }}/{{ $labels.job_name }} job failed to complete.

Kubernetes resourcesThis section lists the alerts for Kubernetes resources.

• KubeCPUOvercommitPods• KubeMemOvercommitPods• KubeCPUOvercommitNamespaces• KubeMemOvercommitNamespaces• KubeQuotaExceeded• CPUThrottlingHigh

KubeCPUOvercommitPods

Severity

Warning

Summary

Cluster has overcommitted CPU requests.

Description

The cluster has overcommitted CPU resource requests for Pods and cannot toleratenode failure.

KubeMemOvercommitPods

Severity

Warning

Summary

Cluster has overcommitted memory requests.

Description

The cluster has overcommitted memory resource requests for Pods and cannottolerate node failure.



KubeCPUOvercommitNamespaces

Severity

Warning

Summary

Cluster has overcommitted CPU requests for namespaces.

Description

The cluster has overcommitted CPU resource requests for namespaces.

KubeMemOvercommitNamespaces

Severity

Warning

Summary

Cluster has overcommitted memory requests for namespaces.

Description

The cluster has overcommitted memory resource requests for namespaces.

KubeQuotaExceeded

Severity

Warning

Summary

The {{ $labels.namespace }} namespace consumes more than 90% of its{{ $labels.resource }} quota.

Description

The {{ $labels.namespace }} namespace consumes {{ printf "%0.0f" $value }}% ofits {{ $labels.resource }} quota.

CPUThrottlingHigh

Severity

Warning

Summary

The {{ $labels.pod_name }} Pod has CPU throttling.



Description

The CPU in the {{ $labels.namespace }} namespace for the{{ $labels.container_name }} container in the {{ $labels.pod_name }} Pod has{{ printf "%0.0f" $value }}% throttling.

Kubernetes storageThis section lists the alerts for Kubernetes storage.

• KubePersistentVolumeUsageCritical• KubePersistentVolumeFullInFourDays• KubePersistentVolumeErrors

KubePersistentVolumeUsageCritical

Severity

Critical

Summary

The {{ $labels.persistentvolumeclaim }} PersistentVolume has less than 3% of freespace.

Description

The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in the{{ $labels.namespace }} namespace is only {{ printf "%0.2f" $value }}% free.

KubePersistentVolumeFullInFourDays

Severity

Critical

Summary

The {{ $labels.persistentvolumeclaim }} PersistentVolume is expected to fill up in 4days.

Description

Based on the recent sampling, the PersistentVolume claimed by{{ $labels.persistentvolumeclaim }} in the {{ $labels.namespace }} namespace isexpected to fill up within four days. Currently, {{ printf "%0.2f" $value }}% of freespace is available.

KubePersistentVolumeErrors

Severity

Critical

Summary

The status of the {{ $labels.persistentvolume }} PersistentVolume is{{ $labels.phase }}.



Description

The status of the {{ $labels.persistentvolume }} PersistentVolume is{{ $labels.phase }}.

Kubernetes systemThis section lists the alerts for the Kubernetes system.

• KubeNodeNotReady• KubeVersionMismatch• KubeClientErrors• KubeletTooManyPods• KubeAPILatencyHighWarning• KubeAPILatencyHighCritical• KubeAPIErrorsHighCritical• KubeAPIErrorsHighWarning• KubeAPIResourceErrorsHighCritical• KubeAPIResourceErrorsHighWarning• KubeClientCertificateExpirationInSevenDays• KubeClientCertificateExpirationInOneDay• ContainerScrapeError

KubeNodeNotReady

Severity

Warning

Summary

The {{ $labels.node }} node is not ready for more than one hour.

Description

The Kubernetes {{ $labels.node }} node is not ready for more than one hour.

KubeVersionMismatch

Severity

Warning

Summary

Kubernetes components have mismatching versions.



Description

Kubernetes has components with {{ $value }} different semantic versions running.

KubeClientErrors

Severity

Warning

Summary

Kubernetes API client has more than 1% of error requests.

Description

The {{ $labels.job }}/{{ $labels.instance }} Kubernetes API server client has{{ printf "%0.0f" $value }}% errors.

KubeletTooManyPods

Severity

Warning

Summary

kubelet reached 90% of Pods limit.

Description

The {{ $labels.instance }}/{{ $labels.node }} kubelet runs {{ $value }} Pods, closeto the limit of 110.

KubeAPILatencyHighWarning

Severity

Warning

Summary

The API server has a 99th percentile latency of more than 1 second.

Description

The API server has a 99th percentile latency of {{ $value }} seconds for{{ $labels.verb }} {{ $labels.resource }}.

KubeAPILatencyHighCritical

Severity

Critical



Summary

The API server has a 99th percentile latency of more than 4 seconds.

Description

The API server has a 99th percentile latency of {{ $value }} seconds for{{ $labels.verb }} {{ $labels.resource }}.

KubeAPIErrorsHighCritical

Severity

Critical

Summary

API server returns errors for more than 3% of requests.

Description

The API server returns errors for {{ $value }}% of requests.

KubeAPIErrorsHighWarning

Severity

Warning

Summary

API server returns errors for more than 1% of requests.

Description

The API server returns errors for {{ $value }}% of requests.

KubeAPIResourceErrorsHighCritical

Severity

Critical

Summary

API server returns errors for 10% of requests.

Description

The API server returns errors for {{ $value }}% of requests for{{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}.

KubeAPIResourceErrorsHighWarning



Severity

Warning

Summary

API server returns errors for 5% of requests.

Description

The API server returns errors for {{ $value }}% of requests for{{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}.

KubeClientCertificateExpirationInSevenDays

Severity

Warning

Summary

An authentication client certificate for the API server expires in less than 7.0 days.

Description

A client certificate used to authenticate to the API server expires in less than 7.0days.

KubeClientCertificateExpirationInOneDay

Severity

Critical

Summary

An authentication client certificate for the API server expires in less than 24.0 hours.

Description

A client certificate used to authenticate to the API server expires in less than 24.0.

ContainerScrapeError

Severity

Warning

Summary

Failure to get Kubernetes container metrics.

Description

Prometheus was not able to scrape metrics from the container on the{{ $labels.node }} Kubernetes node.

MongoDBThis section lists the alerts for the MongoDB service.



• MongodbCursorsOpenTooMany• MongodbCursorTimeouts• MongodbConnectionsTooMany• MongodbMemoryUsageWarning

MongodbCursorsOpenTooMany

Severity

Warning

Summary

MongoDB has a high number of open cursors.

Description

{{ $value }} MongoDB cursors are open for the {{ $labels.instance }} instanceclients.

MongodbCursorTimeouts

Severity

Warning

Summary

MongoDB cursor timeouts.

Description

{{ $value }} MongoDB cursors timed out for the {{ $labels.instance }} instance.

MongodbConnectionsTooMany

Severity

Warning

Summary

Too many connections in MongoDB.

Description

The MongoDB {{ $labels.instance }} instance has {{ $value }} active connections.

MongodbMemoryUsageWarning



Severity

Warning

Summary

MongoDB high memory consumption.

Description

The MongoDB {{ $labels.instance }} instance virtual memory reached 80% ofmemory available to the container.

Netchecker

WarningThis feature is available starting from the KaaS release version 1.3.0.

This section lists the alerts for the Netchecker service.

• NetCheckerAgentErrors• NetCheckerReportsMissing• NetCheckerTCPServerDelay• NetCheckerDNSSlow

NetCheckerAgentErrors

Severity

Warning

Summary

Netchecker has a high number of errors.

Description

The {{ $labels.agent }} Netchecker agent had {{ $value }} errors within the lasthour.

NetCheckerReportsMissing

Severity

Warning

Summary

The number of agent reports is lower than expected.

Description

The {{ $labels.agent }} Netchecker agent has not reported anything for the last 5minutes.



NetCheckerTCPServerDelay

Severity

Warning

Summary

The TCP connection to Netchecker server takes too much time.

Description

The {{ $labels.agent }} Netchecker agent TCP connection time to the Netcheckerserver has increased by {{ $value }} within the last 5 minutes.

NetCheckerDNSSlow

Severity

Warning

Summary

The DNS lookup time is too high.

Description

The DNS lookup time on the {{ $labels.agent }} Netchecker agent has increased by{{ $value }} within the last 5 minutes.

NGINXThis section lists the alerts for the NGINX service.

• NginxServiceDown• NginxDroppedIncomingConnections

NginxServiceDown

Severity

Minor

Summary

The NGINX service is down.

Description

The NGINX service on the {{ $labels.node }} node is down.

NginxDroppedIncomingConnections



Severity

Minor

Summary

NGINX drops incoming connections.

Description

NGINX on the {{ $labels.node }} node drops {{ $value }} accepted connections persecond for 5 minutes.

Node networkThis section lists the alerts for a Kubernetes node network.

• SystemRxPacketsErrorTooHigh• SystemTxPacketsErrorTooHigh• SystemRxPacketsDroppedTooHigh• SystemTxPacketsDroppedTooHigh• NodeNetworkInterfaceFlapping

SystemRxPacketsErrorTooHigh

Severity

Warning

Summary

The {{ $labels.node }} has package receive errors.

Description

The {{ $labels.device }} network interface has receive errors on the{{ $labels.namespace }}/{{ $labels.pod }} node exporter.

SystemTxPacketsErrorTooHigh

Severity

Warning

Summary

The {{ $labels.node }} node has package transmit errors.

Description

The {{ $labels.device }} network interface has transmit errors on the{{ $labels.namespace }}/{{ $labels.pod }} node exporter.

SystemRxPacketsDroppedTooHigh



Severity

Warning

Summary

60 or more received packets were dropped.

Description

{{ $value }} packets received by the {{ $labels.device }} interface on the{{ $labels.node }} node were dropped during the last minute.

SystemTxPacketsDroppedTooHigh

Severity

Warning

Summary

100 transmitted packets were dropped.

Description

{{ $value }} packets transmitted by the {{ $labels.device }} interface on the{{ $labels.node }} node were dropped during the last minute.

NodeNetworkInterfaceFlapping

Severity

Warning

Summary

The {{ $labels.node }} node has flapping interface.

Description

The {{ $labels.device }} network interface often changes its UP status on the{{ $labels.namespace }}/{{ $labels.pod }} node exporter.

Node timeThis section lists the alerts for a Kubernetes node time.

ClockSkewDetected

Severity

Warning

Summary

The NTP offset reached the limit of 0.03 seconds.

Description

Clock skew was detected on the {{ $labels.namespace }}/{{ $labels.pod }} nodeexporter. Verify that NTP is configured correctly on this host.



PrometheusThis section describes the alerts for the Prometheus service.

• PrometheusConfigReloadFailed• PrometheusNotificationQueueRunningFull• PrometheusErrorSendingAlertsWarning• PrometheusErrorSendingAlertsCritical• PrometheusNotConnectedToAlertmanagers• PrometheusTSDBReloadsFailing• PrometheusTSDBCompactionsFailing• PrometheusTSDBWALCorruptions• PrometheusNotIngestingSamples• PrometheusTargetScrapesDuplicate• PrometheusRuleEvaluationsFailed

PrometheusConfigReloadFailed

Severity

Warning

Summary

Failure to reload the Prometheus configuration.

Description

Reloading of the Prometheus configuration failed for{{$labels.namespace}}/{{$labels.pod}}.

PrometheusNotificationQueueRunningFull

Severity

Warning

Summary

Prometheus alert notification queue is running full.

Description

The Prometheus alert notification queue is running full for{{$labels.namespace}}/{{ $labels.pod}}.

PrometheusErrorSendingAlertsWarning



Severity

Warning

Summary

Errors occur while sending alerts from Prometheus.

Description

1% of errors occur while sending alerts from Prometheus{{$labels.namespace}}/{{ $labels.pod}} to Alertmanager{{$labels.Alertmanager}}.

PrometheusErrorSendingAlertsCritical

Severity

Critical

Summary

Errors occur while sending alerts from Prometheus.

Description

3% of errors occur while sending alerts from Prometheus{{$labels.namespace}}/{{ $labels.pod}} to Alertmanager{{$labels.Alertmanager}}.

PrometheusNotConnectedToAlertmanagers

Severity

Warning

Summary

Prometheus is not connected to Alertmanager.

Description

Prometheus {{ $labels.namespace }}/{{ $labels.pod}} is not connected to anyAlertmanager instance.

PrometheusTSDBReloadsFailing

Severity

Warning

Summary

Prometheus has issues reloading data blocks from disk.

Description

The Prometheus server on the {{$labels.instance}} instance has{{$value | humanize}} reload failures over the last four hours.



PrometheusTSDBCompactionsFailing

Severity

Warning

Summary

Prometheus has issues compacting sample blocks.

Description

The Prometheus server on the {{$labels.instance}} instance has{{$value | humanize}} compaction failures over the last four hours.

PrometheusTSDBWALCorruptions

Severity

Warning

Summary

Prometheus write-ahead log is corrupted.

Description

The Prometheus server on the {{$labels.instance}} instance has a corruptedwrite-ahead log (WAL).

PrometheusNotIngestingSamples

Severity

Warning

Summary

Prometheus does not ingest samples.

Description

Prometheus {{ $labels.namespace }}/{{ $labels.pod}} does not ingest samples.

PrometheusTargetScrapesDuplicate

Severity

Warning

Summary

Prometheus has many rejected samples.



Description

Prometheus {{$labels.namespace}}/{{$labels.pod}} has many rejected samplesbecause of duplicate timestamps but different values.

PrometheusRuleEvaluationsFailed

Severity

Warning

Summary

Prometheus failed to evaluate recording rules.

Description

Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed evaluations forrecording rules. Verify the rules state in the Status/Rules section of the PrometheusWeb UI.

Salesforce notifierThis section lists the alerts for the Salesforce notifier service.

• SfNotifierDown• SfNotifierAuthFailure

SfNotifierDown

Severity

Critical

Summary

The sf-notifier service is down.

Description

The sf-notifier service is down for 2 minutes.

SfNotifierAuthFailure

Severity

Critical

Summary

Failure to authenticate to Salesforce.

Description

The sf-notifier service fails to authenticate to Salesforce for 2 minutes.



SMART disksThis section describes the alerts for SMART disks.

• SystemSMARTDiskUDMACrcErrorsTooHigh• SystemSMARTDiskHealthStatus• SystemSMARTDiskReadErrorRate• SystemSMARTDiskSeekErrorRate• SystemSMARTDiskTemperatureHigh• SystemSMARTDiskReallocatedSectorsCount• SystemSMARTDiskCurrentPendingSectors• SystemSMARTDiskReportedUncorrectableErrors• SystemSMARTDiskOfflineUncorrectableSectors• SystemSMARTDiskEndToEndError

SystemSMARTDiskUDMACrcErrorsTooHigh

Severity

Warning

Summary

The {{ $labels.device }} disk has UDMA CRC errors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting SMARTUDMA CRC errors for 5 minutes.

SystemSMARTDiskHealthStatus

Severity

Warning

Summary

The {{ $labels.device }} disk has bad health.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting a badhealth status for 1 minute.

SystemSMARTDiskReadErrorRate



Severity

Warning

Summary

The {{ $labels.device }} disk has read errors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting anincreased read error rate for 5 minutes.

SystemSMARTDiskSeekErrorRate

Severity

Warning

Summary

The {{ $labels.device }} disk has seek errors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting anincreased seek error rate for 5 minutes.

SystemSMARTDiskTemperatureHigh

Severity

Warning

Summary

The {{ $labels.device }} disk temperature is high.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has a temperature of{{ $value }}C for 5 minutes.

SystemSMARTDiskReallocatedSectorsCount

Severity

Major

Summary

The {{ $labels.device }} disk has reallocated sectors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has reallocated{{ $value }} sectors.



SystemSMARTDiskCurrentPendingSectors

Severity

Major

Summary

The {{ $labels.device }} disk has current pending sectors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has {{ $value }}current pending sectors.

SystemSMARTDiskReportedUncorrectableErrors

Severity

Major

Summary

The {{ $labels.device }} disk has reported uncorrectable errors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has {{ $value }}reported uncorrectable errors.

SystemSMARTDiskOfflineUncorrectableSectors

Severity

Major

Summary

The {{ $labels.device }} disk has offline uncorrectable sectors

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has {{ $value }}offline uncorrectable sectors.

SystemSMARTDiskEndToEndError

Severity

Major

Summary

The {{ $labels.device }} disk has end-to-end errors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has {{ $value }}end-to-end errors.



SSL certificatesThis section lists the alerts for SSL cetificates.

• SSLCertExpirationWarning• SSLCertExpirationCritical

SSLCertExpirationWarning

Severity

Warning

Summary

SSL certificate expires in 30 days.

Description

The SSL certificate for {{ $labels.instance }} expires in 30 days.

SSLCertExpirationCritical

Severity

Critical

Summary

SSL certificate expires in 10 days.

Description

The SSL certificate for {{ $labels.instance }} expires in 10 days.

Telemeter

WarningThis feature is available starting from the KaaS release 1.8.0.

Caution!

The Telemeter support for the bare metal provider is currently under developement andwill be announced shortly.



This section describes the alerts for the Telemeter service.

• TelemeterClientAuthenticationFailed• TelemeterClientFederationFailed

TelemeterClientAuthenticationFailed

Severity

Warning

Summary

Telemeter client failed to authenticate to the server.

Description

Telemeter client has failed to authenticate to the Telemeter server twice for the last30 minutes. Verify the telemeter-client container logs. Typically, such error occurs incase of incorrect ClusterID or Token set in telemeter-client settings.

TelemeterClientFederationFailed

Severity

Warning

Summary

Telemeter client failed to send data to the server.

Description

Telemeter client has failed to send data to the Telemeter server twice for the last 30minutes. Verify the telemeter-client container logs.

Disable workload monitoring

Caution!

This feature is available starting from the KaaS release 1.6.0.

On the clusters that run large-scale workloads, the workload monitoring generates a big amountof metrics that are resource-consuming. You can disable workload monitoring in the StackLightmetrics and monitor infrastructure only to prevent generation of excessive metrics.The feature is implemented using the metricFilter parameter that enables the cAdvisor(Container Advisor) and kubeStateMetrics metric ingestion filters for Prometheus. The feature isdisabled by default. If enabled, you can select the required namespaces to which the filter willapply.



To disable workload monitoring on a KaaS child or management cluster:

1. Log in to the KaaS web UI with the writer permissions.2. Select the required namespace.3. In the upper right corner of the KaaS web UI, click the arrow next to your user name to open

the drop-down menu.4. In the drop-down menu, click Download kubeconfig to download kubeconfig of your KaaS

management cluster.5. Log in to any local machine with kubectl installed.6. Copy the downloaded kubeconfig to this machine.7. Run one of the following commands:

• For a KaaS management cluster:

kubectl --kubeconfig <KUBECONFIG_PATH> edit -n <NAMESPACE_NAME> cluster <MANAGEMENT_CLUSTER_NAME>

• For a KaaS child cluster:

kubectl --kubeconfig <KUBECONFIG_PATH> edit -n <NAMESPACE_NAME> cluster <CHILD_CLUSTER_NAME>

8. Edit the opened manifest. For example:

spec: providerSpec: value: helmReleases: - name: stacklight values: metricFilter: enabled: true action: keep namespaces: kube-system: true stacklight: true kaas: true

• enabled - enable or disable metricFilter using true or false• action - action to take by Prometheus:

• keep - keep only metrics from namespaces that are defined in the namespaces list• drop - ignore metrics from namespaces that are defined in the namespaces list

• namespaces - list of namespaces to keep or drop metrics from regardless of theboolean value for every namespace