Hello everyone,
Before describing my case, I will summarise the issue. I currently have three PVE machines within a cluster. I will refer to these hosts as A, B and C. Each of these hosts has an equal vote within the quorum. The virtual machines in this cluster are stored in CEPH storage. We have noticed some strange behaviour on the part of host C. If it loses connection with A, it tends to restart on its own (in connection with watchdog) despite the fact that it manages to maintain contact with host B.
Please find below a summary of an incident that occurred earlier this week.
So this is indeed an example of the kind of case where we do not understand the behaviour of host C. Have you ever seen this kind of behaviour and/or are you aware of any settings that could explain this restart? I am available if you have any questions or need configuration extracts.
Have a nice day,
Before describing my case, I will summarise the issue. I currently have three PVE machines within a cluster. I will refer to these hosts as A, B and C. Each of these hosts has an equal vote within the quorum. The virtual machines in this cluster are stored in CEPH storage. We have noticed some strange behaviour on the part of host C. If it loses connection with A, it tends to restart on its own (in connection with watchdog) despite the fact that it manages to maintain contact with host B.
Please find below a summary of an incident that occurred earlier this week.
- Initially, we launched multiple actions aimed at moving virtual machines from host C to host A. This VM migration increased the RAM usage of the destination host until it reached its maximum capacity. (For information, we do not currently have SWAP in place, only 2 GB, but it is in our internal implementation roadmap in terms of optimisation).
- 12:11:51: PVE C - VMs begin migrating to host A (an initial batch of around thirty)
- 12:13:11: PVE C - VMs begin migrating to host B (around ten)
- 12:15:00: PVE C - Migration tasks to A are still active, we are waiting. Those to PVE B are completing
- 12:17:21: PVE C - Some migration tasks to A are completing, but the majority remain pending
- 12:18:23: PVE A - Low-level Perl errors are starting to occur on the QemuMigrate.pm code, and migrations are still not progressing
- 12:18:43: PVE A - CEPH logs => all its OSDs are down, no response on its TCP/IP network (whether its LAN IP or other IPs). Seems to be completely down and restarts a few moments later. Total resource saturation likely due to VM movement
- 12:18:43: PVE C - CEPH logs => detection of host A DOWN and its OSDs are unreachable (heartbeat_check: no reply from 10.0.243.1 ..)
- 12:23:08: PVE C - visible in the cluster with two hosts (B + C) but low-level cascading error on QEMU Migrate, 95 VMs to migrate
- 12:25:09: PVE C - Watchdog timeout and subsequent reboot
- 12:25:26: PVE B - appears alone in the cluster and therefore logically restarts
- 12:27:11: PVE A has finished rebooting and the system is online
- 12:29:05: PVE C has finished rebooting and the system is online
- 12:29:39: PVE B has finished rebooting and the system is online
So this is indeed an example of the kind of case where we do not understand the behaviour of host C. Have you ever seen this kind of behaviour and/or are you aware of any settings that could explain this restart? I am available if you have any questions or need configuration extracts.
Have a nice day,