Strange behaviour of a PVE in connection with Watchdogs

SkyTek

New Member
Jun 6, 2025
3
0
1
Hello everyone,

Before describing my case, I will summarise the issue. I currently have three PVE machines within a cluster. I will refer to these hosts as A, B and C. Each of these hosts has an equal vote within the quorum. The virtual machines in this cluster are stored in CEPH storage. We have noticed some strange behaviour on the part of host C. If it loses connection with A, it tends to restart on its own (in connection with watchdog) despite the fact that it manages to maintain contact with host B.

Please find below a summary of an incident that occurred earlier this week.

  • Initially, we launched multiple actions aimed at moving virtual machines from host C to host A. This VM migration increased the RAM usage of the destination host until it reached its maximum capacity. (For information, we do not currently have SWAP in place, only 2 GB, but it is in our internal implementation roadmap in terms of optimisation).
  • 12:11:51: PVE C - VMs begin migrating to host A (an initial batch of around thirty)
  • 12:13:11: PVE C - VMs begin migrating to host B (around ten)
  • 12:15:00: PVE C - Migration tasks to A are still active, we are waiting. Those to PVE B are completing
  • 12:17:21: PVE C - Some migration tasks to A are completing, but the majority remain pending
  • 12:18:23: PVE A - Low-level Perl errors are starting to occur on the QemuMigrate.pm code, and migrations are still not progressing
  • 12:18:43: PVE A - CEPH logs => all its OSDs are down, no response on its TCP/IP network (whether its LAN IP or other IPs). Seems to be completely down and restarts a few moments later. Total resource saturation likely due to VM movement
  • 12:18:43: PVE C - CEPH logs => detection of host A DOWN and its OSDs are unreachable (heartbeat_check: no reply from 10.0.243.1 ..)
  • 12:23:08: PVE C - visible in the cluster with two hosts (B + C) but low-level cascading error on QEMU Migrate, 95 VMs to migrate
  • 12:25:09: PVE C - Watchdog timeout and subsequent reboot
  • 12:25:26: PVE B - appears alone in the cluster and therefore logically restarts
  • 12:27:11: PVE A has finished rebooting and the system is online
  • 12:29:05: PVE C has finished rebooting and the system is online
  • 12:29:39: PVE B has finished rebooting and the system is online

So this is indeed an example of the kind of case where we do not understand the behaviour of host C. Have you ever seen this kind of behaviour and/or are you aware of any settings that could explain this restart? I am available if you have any questions or need configuration extracts.

Have a nice day,
 
I'm not an expert but this looks like your nodes lost qorum and fenced (rebooted) themself.

Do you use dedicated networks/nics for corosync, migration and ceph?

If not and you start a migration the network will be saturated and because corosync is very latency sensitive you could run into this issue.

PVE cluster requirements
 
Hello, thank you for your feedback.

Indeed, we have a dedicated network for the CEPH part, the rest of the elements share the same network. So if I understand the feedback correctly, is it the ideal or an obligation according to you to have a dedicated network for corosync?

Have a good day,
 
It's a recommendation. You should at least setup multiple corosync rings for redundancy. If one network is saturated then corosync can communicate over another network. If all your networks are saturated and the low latency requirements for corosync can not be guaranteed you should consider using an additional dedicated nic / network just for corosync.
 
Hello, thank you again for your feedback.

We will study the thing and see especially if we can add physical network links to create this redundancy.

Have a good day!