Sangfor HCI and aSV provide a unified infrastructure combining compute, storage, networking, and built-in security to simplify deployment, operations, and services.
When the fault memory space of ECC memory UE is located in the application layer process, the physical node will not go down but only kill the process. This is because the memory isolation mechanism will try to isolate the memory space. The physical node will be down when the wrong memory space of ECC memory UE is located in the kernel during the layering process. After restarting, the HCI platform will receive an ECC UE error message. The memory isolation mechanism will try to isolate the memory space to prevent subsequent services or platform processes from using the memory space again.
Precautions
The memory UE isolation function has related hardware requirements:
CPU architecture model must be Purley platform.
The BIOS version requirement is BIOS RC code 610D02 (BKC: Purley BKC IPU 2020.2) and above.
The server's firmware cannot shield the scanned UE/CE and UE/CE errors triggered by user mode access.
BIOS function setting switch: System memory poison > Enabled (UE recovery function, required), and Enable patrol scrub (memory hardware scan function, required), Patrol Scrub Error Downgrade (error downgrade function, optional).
For confirmation of hardware information, you can contact Sangfor technical support to obtain it.
Prerequisites
None.
Steps
Log in to the HCI platform. When ECC memory isolation occurs, an alert prompt will appear on the interface, prompting, "The node (XX.XX.XX.XX)’s RAM stick (CPUx-MCx-CHANx-DIMMx) contains too many uncorrectable ECC errors that could not be isolated. ".