{{ $t('productDocDetail.guideClickSwitch') }}
{{ $t('productDocDetail.know') }}
{{ $t('productDocDetail.dontRemind') }}
6.11.3
{{sendMatomoQuery("Sangfor Cloud Platform (SCP)","O&M Guidelines")}}

O&M Guidelines

{{ $t('productDocDetail.updateTime') }}: 2025-12-26
  1. Monitoring dimensions: Pay close attention to the following core dimensions, ensuring that all layers are monitored in real time:
  1. Node layer: Monitor the CPU usage, temperature, and underclocking, memory usage and underclocking, and host swap partition usage to identify alerts triggered due to resource usage exceeding the threshold and hardware anomalies, such as CPU usage remaining above 90% or memory underclocking.
  2. Physical network layer: Monitor the status (offline or abnormal) of interfaces, packet error rate, peak traffic volume, and VXLAN connectivity. Prioritize critical alerts triggered due to network disconnection and high packet error rate.
  3. Storage layer: Monitor the IO usage and latency, connections between datastores and nodes, RAID card status, and storage usage (including backup repositories). Pay close attention to alerts triggered due to offline datastore, storage usage exceeding 90%, and high IO latency.
  4. VM layer: Monitor the CPU and memory usage, integrity of image files, backup status, and connectivity to external networks. For VMs that run core services, pay close attention to alerts triggered due to session connections exceeding the threshold and abnormal shutdown.
  5. Cluster and licensing layer: Monitor offline nodes, expired and abnormal license keys, and health check notifications. Make sure that no single point of failure exists in clusters and license keys are valid.
  1. Alert configuration principles:
  1. Thresholds: Dynamically adjust alert thresholds based on service loads, and avoid false positive alerts and false negative alerts. It is recommended to follow the SCP best practices. For example, trigger an alert when the node CPU usage has been above 90% for 10 minutes.[1]
  2. Notification methods: It is recommended to send alert notifications by email and SMS message, ensuring that O&M personnel can receive critical alerts 24/7.[2]
  3. Alert priorities: Critical alerts, such as node or datastore offline, must be fixed within 15 minutes. Adjustment plans must be provided to fix regular alerts, such as resource usage exceeding the threshold, within 24 hours. After an alert is fixed, it must be marked as Fixed on SCP.[3]