When the management interface and edge port are disconnected simultaneously, only the virtual machine HA in the two-node scenario takes effect, and the HA in the three-node scenario does not.
The virtual machine needs to enable the HA function, and other hosts have sufficient resources.
HA is only applicable to HCI clusters; it is not supported for VMware and third-party clusters.
A virtual machine only supports HA when it is running on virtual storage, FC, or iSCSI.

Steps:

Step 1: Log in to the SCP platform and navigate to Reliability > HA. When SCP manage several HCI clusters, In the left-up corner you can select different HCI cluster.

Step 2: Set response methods for scenarios with hardware failures or hardware health issues by Click Hardware.

The hardware failures scenarios are as follows:

Module	Impact	HA Response Method
CPU Failure	Nodes will be restarted or go offline, and the corresponding VMs will be shut down.	No selection. VM will restart and HA to the health node.
Memory Failure	Nodes will be restarted or go offline, and the corresponding VMs will be shut down.	No selection. VM will restart and HA to the health node.
System Disk Failure	Nodes will go offline, and the corresponding VMs will be shut down.	No selection. VM will restart and HA to the health node.
Motherboard Failure	Nodes will go offline, and the corresponding VMs will be shut down.	No selection. VM will restart and HA to the health node.
Encryption Card Failure	The corresponding VMs cannot run properly, causing business service interruption.	Select 1: Restart VM on Healthy Node. Select 2: Ignore.
GPU Failure	The corresponding VMs cannot run properly, causing business service interruption.	Select 1: Restart VM on Healthy Node. Select 2: Ignore.

For the Hardware Health Issues, if you want to trigger the HA you need to enable the Host Health Monitoring first. The following is the step to set the Host Health Monitoring:

Click Health Monitoring in the left-up corner you can select the HCI cluster.

图形用户界面, 应用程序

AI 生成的内容可能不正确。

In the Node Health check Host hardware (CPU, memory, system disk or RAID card) anomaly check the following item:

Low CPU Microcode Version

High CPU Temperature: A node has low reliability if its reliability score is lower than 30 points. When a node encounters memory, disk, or RAID card failures, the reliability score will be reduced based on the frequency and type of these failures.

ECC Memory Isolated Threshold Exceeds: The available value ranges from 5 to 50.

System Disk Lifetime Below Threshold: The available value ranges from 10 to 90.

Only One Power Module Available

Host crash occurs times within n (n stand for an option) days

Low node reliability: A node has low reliability if its reliability score is lower than 30 points. When a node encounters memory, disk, or RAID card failures, the reliability score will be reduced based on the frequency and type of these failures. (for the score can refer to table Node Performance score method in section 3.2)

Check Schedule: Here, you can choose how often to check (in minutes) and whether to check the option "Start check at each host startup or restart.

Recovery Method: Go back to HA to configure recovery methods for host hardware failures and network health issues.

Auto Removal: Auto remove healthy nodes from unhealthy node list.

The Hardware Health Issues scenarios are as follows (if not enable the Host Health Monitoring the Trigger will be show disable):

Module	Trigger	Impact	Response Method
Low CPU Microcode Version	Enabled	Nodes will be exposed to security vulnerabilities and performance risks and may have stability issues.	Select 1: Hot-Migrate VM. Select 2: Ignore.
High CPU Temperature	Enabled	Constant high CPU temperature can easily affect node performance and reliability, and cause system stability issues and hardware failures.	Select 1: Hot-Migrate VM. Select 2: Ignore.
Isolated ECC Memory Over Threshold	Enabled Memory isolated exceeds 20%	Node reliability and available memory resources will decrease dramatically, leading to higher risks of node failures.	Select 1: Hot-Migrate VM. Select 2: Ignore.
System Disk Lifetime Below Threshold	Enabled Lifetime remaining is less than 12%	When a system disk approaches the end of its life, both the hardware performance and platform performance will degrade, and hardware failures are more likely to occur, affecting the node reliability.	Select 1: Hot-Migrate VM. Select 2: Ignore.
Only One Power Module Available	Enabled	When there is no redundant power supply for a node, a power outage will cause the node to go offline directly, and the power supply may also be insufficient, affecting the node reliability.	Select 1: Hot-Migrate VM. Select 2: Ignore.

Step 3: Set response methods for scenarios with response methods for network disconnection and node health issues by Click Platform. You can set the Disconnection Duration for 10 seconds, 1 min, 2 mins, 5mins and 10 mins.

The Network Disconnection Condition scenarios are as follows:

Module	Impact	Response Method
Management Network Disconnection	Nodes can still run normally, but they cannot be managed by the platform. The high availability mechanism will not work, leading to higher business service risks.	Ignore
Business Network Disconnection	VMs running on the nodes cannot provide business services as expected.	Select 1: Restart VM on Healthy Node. Select 2: Hot-Migrate VM. Select 3: Ignore.
VXLAN Network Disconnection	The east-west traffic of the nodes will be interrupted, making business services unavailable.	Select 1: Restart VM on Healthy Node. Select 2: Hot-Migrate VM. Select 3: Ignore.
Storage Network Disconnection	VMs running on the nodes cannot provide business services as expected.	Restart VM on Healthy Node
aDesk Communication Network Disconnection	The VDI VMs running on the nodes will become inaccessible, greatly affecting the experience of end users.	Select 1: Hot-Migrate VM. Select 2: Ignore.

For Node Health, if you want to trigger the HA you need to enable the Host Health Monitoring first. The following is the step to set the Host Health Monitoring:

Click Health Monitoring in the left-up corner you can select the HCI cluster.

图形用户界面, 应用程序

AI 生成的内容可能不正确。

In the Node Health check Host hardware following item:

Host crash occurs: You can set the number of crashes for the physical server within a certain period of time.

Low node reliability: A node has low reliability if its reliability score is lower than 30 points (for the score refers to section 3.2 Table Node Performance score method). When a node encounters memory, disk, or RAID card failures, the reliability score will be reduced based on the frequency and type of these failures.

The Node Health Issues scenarios are as follows (if not enable the Host Health Monitoring the Trigger will be show disable):

Module

Trigger

Impact

Response Method

Frequent Node Crashes

Enabled

A node is deemed unhealthy if it crashes 3 times within 1 day.

A node crashes frequently due to reasons such as poor node quality and end of node lifetime.

Select 1: Hot-Migrate VM.

Select 2: Ignore.

Low Node Reliability

Enabled

The node hardware reliability decreases, with many abnormal hardware indicators detected.

Select 1: Hot-Migrate VM.

Select 2: Ignore.

Step 4: For the Application part When VMs operate abnormally, the platform can automatically restart the VMs for quick business recovery, no need extra setting.

Module	Impact	Response Method
Process Crash	VM business services will be interrupted unexpectedly.	VM will restart and HA to the health node.
I/O Hang	VM business services will be interrupted unexpectedly.	VM will restart and HA to the health node.

Here, Process Crash and I/O Hang refer to the virtual machine processes and I/O suspension at the HCI level, not internal exceptions within the virtual machine.

Step 5: For the Advanced Settings part, you can configure resource reservation, node health monitoring, and live migration schedule for the HA service.

Resource Reservation

Memory and CPU resources can be reserved on a node so that VMs running on other nodes can be recovered immediately to this node in the event of node failures. Click Configure Now go to the new page you can config the resource reservation. Click Add set the related resource reservation.

• Resource Pool: Select the HCI cluster that requires resource reservation.

• Node: Select a physical node within the cluster.

• Total CPU: The total CPU of the HCI cluster you select.

• Reserved CPU: The CPU quota you want to use for HA.

• Host Memory: The total memory of the HCI cluster you select.

• Reserved Memory: The memory quota you want to use for HA

Host Health Monitoring

The system can automatically identify whether a node is healthy. Healthy nodes are preferred for VM startup, HA and other tasks. VMs running on unhealthy nodes will be hot-migrated to or restarted on other healthy nodes to guarantee normal business operation. Click Configure Now go to the new page you can set the Health Monitoring configuration. For the Host hardware (CPU, memory, system disk or RAID card) anomaly part you can refuter to section 3.1 and step 2.

Step 6: For the HA History part. You can check which virtual machines have undergone HA. When viewing the VM HA records, you can select different time periods for filtering.

HA Test Guide

{{ $t('index.defaultHeader.logo') }}

Sangfor Cloud Platform (SCP)

HA