Platform availability management mainly ensures business continuity by checking the platform's CPU, memory, storage, and other resources.
Pay attention to the CPU usage of the HCI platform. Consider expanding the capacity of the HCI platform in any of the following situations:
• The continuous usage of the CPU exceeds 70%
• In a common cluster scenario, the redundancy of CPU resources is insufficient to support the business system's normal operation when one or two (two-copy, three-copy) nodes are damaged.
• In the stretched cluster scenario, the redundancy of CPU resources is insufficient to support the business system's normal operation when a node in a fault domain is damaged.
Pay attention to the memory configuration and memory usage of the HCI platform. Consider expanding the capacity of the HCI platform in any of the following situations:
• Sustained memory usage exceeds 80%
• The redundancy of memory resources in common cluster scenarios is insufficient to support the business system's normal operation when one or two (two-copy, three-copy) nodes are damaged.
• In the stretched cluster scenario, the redundancy of memory resources is insufficient to support the business system's normal operation when a node in a fault domain is damaged.
Pay attention to the storage capacity utilization rate of the HCI platform, and consider expanding the storage capacity in any of the following situations:
• Storage usage exceeds 90%
• The administrator can predict the number of days that the remaining storage capacity of the HCI Platform can support the use of the business system through the usage trend of storage capacity and adjust the expansion plan according to the forecast data. When the predicted remaining storage capacity is not enough to support the business system for 90 days, it is necessary to consider expanding the storage resources of the HCI platform.
Pay attention to the storage performance of the HCI platform, and evaluate whether the current HCI platform's storage performance can meet the business system's needs by analyzing the IOPS of important virtual machines. When the IO latency of the business system is greater than 20ms, the storage performance of the HCI platform needs to be expanded.
Pay attention to the HCI platform's business system backup and disaster recovery implementation. If necessary, you can select some business systems to perform recovery drills to ensure that the business systems' backup and disaster recovery data are normal. In extreme cases, they can be restored through backup or disaster recovery data business.
The specific indicators of concern refer to the following table:
| Maintenance Project |
Concerns |
Maintenance Description |
| CPU usage |
CPU continuous usage does not exceed 70% |
View the peak usage and continuous usage of the CPU in the last month |
| CPU resource redundancy |
In a common cluster scenario, the redundancy of CPU resources is insufficient to support the business system's normal operation when one or two (two-copy, three-copy) nodes are damaged. In the stretched cluster scenario, the redundancy of CPU resources is insufficient to support the business system's normal operation when a fault domain node is damaged. |
Calculate whether the redundancy of the current CPU resources meets the requirements |
| Memory usage |
Sustained memory usage does not exceed 80% |
View the peak usage and sustained usage of memory in the last month |
| Redundancy of memory resource |
In common cluster scenarios, the redundancy of memory resources is insufficient to support the business system's normal operation when one or two (two-copy, three-copy) nodes are damaged. In the stretched cluster scenario, the redundancy of memory resources is insufficient to support the business system's normal operation when a fault domain node is damaged. |
Calculate whether the redundancy of the current memory resources can meet the requirements |
| Storage usage |
The storage utilization rate does not exceed 90%, and the predicted remaining storage capacity is sufficient to support the business system for more than 90 days. |
View current storage resource usage and forecast capacity usage after 90 days |
| Business backup or disaster recovery |
Business backup and disaster recovery tasks are performed normally. Back to normal |
Check the execution status of business backup and disaster recovery tasks Conduct business recovery drills |
Table 16:Availability Management