Sangfor HCI and aSV provide a unified infrastructure combining compute, storage, networking, and built-in security to simplify deployment, operations, and services.
When a production or platform process runs in a memory location with an ECC error, the kernel will receive the ECC error message reported by the memory. Sangfor HCI memory isolation mechanism will try to isolate the memory space to prevent subsequent business or platform processes from using it again.
Precautions
The memory ECC function requires the server CPU to support the MCE function. If hardware confirmation is involved, please contact Sangfor technical support.
Prerequisites
None.
Steps
Log in to the HCI platform. When ECC memory isolation occurs, an alert prompt will appear on the interface, prompting " The node (XX.XX.XX.XX)’s RAM stick (CPUx-MCx-CHANx-DIMMx) contains too many correctable ECC errors that could not be isolated. ".
• When the isolation ratio of a single memory module exceeds 10%, a general alert will be generated.
• When the isolation ratio of a single memory module exceeds 25%, there is very little available memory, and an emergency alert will be generated.
When an alert appears on the platform, it is recommended to contact the service provider to replace the memory module. When replacing a memory module, please turn on the node maintenance mode of the corresponding node before replacing it.
The memory module address in the alert prompt is the logical address of the memory module. You need to refer to the memory layout diagram of the server to replace the memory module.