Log in to SCP and view alerts triggered on the current day. Prioritize critical alerts, such as node offline, datastore offline, and abnormal VM shutdown. Troubleshooting must be conducted within 15 minutes after a critical alert is triggered. For example, check the network connectivity if a node is offline, and check the connection if a datastore is offline.
Regular alerts, such as alerts triggered due to CPU usage exceeding the threshold or low packet error rate, must be recorded in the daily O&M log. Adjustment plans must be provided within 24 hours to fix regular alerts. For example, if the memory usage is high, assess whether the memory capacity needs to be expanded or idle VMs need to be shut down.
Core resource status check:
Host layer: Check the connection status of nodes and the usage of CPU, memory, and storage resources (trigger alerts when resource usage exceeds the threshold). Pay close attention to the status of the cluster controller and witness nodes.
Storage layer: Check the connection status of nodes and datastores, storage usage (trigger alerts when the usage exceeds 90%), and IO status of backup repositories (to avoid backup failures).
VM layer: Check the status of VMs that run core services (such as databases and production systems), CPU and memory loads, and connectivity to external networks, mitigating the risks of response failures and backup failures.
Entry point: View information on the homepage of SCP or in the specific resource module.[7]
Logs and closed-loop management:
After daily O&M, fill in the SCP Daily O&M Form, including information about alert fixing, resource anomalies, and pending issues. Make sure that all issues are traceable.