This is curious.
Would it be okay for you to gather a bit more information? Because it seems that for some reason, the pvestatd service still collects and distributes the old pre PVE 9 metric format, but under the new key...
So to further see...
Swap is more than just an escape for low memory: https://chrisdown.name/2018/01/02/in-defence-of-swap.html
But given that the host has ~185G of memory, you could consider disabling swap, as that is a lot of memory, and if you run out of memory...
Hmm, those versions look new enough. Can you please restart the pvestatd service on the hosts? Either in the Node→System panel, or with
systemctl restart pvestatd
Does that help to get rid the log messages?
Yeah. I assume you have one interface for everything on the hosts, that goes to the switch, right?
The single point of failure there is the switch.
If you can add a direct cable between the hosts, without a switch in between, you can configure a...
Well, those long timeouts are most likely the explanation. If corosync takes too long to form a new quorum with just the QDevice, it might take longer than the 60s timeout of the LRM!
Please set it back to defaults, from one of my test clusters...
Can you please post your /etc/pve/corosync.conf file? And make sure that the /etc/pve/corosync.conf and /etc/corosync/corosync.conf files are the same.
Any tool that allows booting a live system on the physical host and the target VM to transfer disk contents.
That can be just a regular Linux live system and dd + ssh on both sides. Or something more guided like Clonezilla. There are surely also...
Has that host been installed a while ago? Because, IIRC, since about 8.1, the installer limits the ARC by default. If you installed earlier, you can manually set a limit on the ARC...
yeah, if you can do another test, I am interested in the pvecm status output of node pve/Node1 a few seconds after you disconnected pve1/Node2, but before it will eventually fence itself (if something is wrong).
I think, that info is not yet in...
There is a misunderstanding in how fencing works.
It is handled by the LRM on each node. If it is in "active" mode and the host lost the connection to the quorum for more than 60 seconds, it will not renew the watchdog. Once the watchdog runs...
Can you disable the HA resource, wait ~10 minutes until all LRMs are idle, and then do the following please? With no active LRM, the nodes won't fence.
1. get pvecm status while all nodes are up and working
2. disconnect one of the nodes
3. get...
not great, because the Ceph Cluster network is only used for the replication between the OSDs. Everything else, also IO from the guests to their virtual disks, is going via the Ceph Public network. Which is probably why you see the rather meh...
Does the Ceph network actually provide 10Gbit? Check with ethtool if they autonegiotated to the expected speed, and if, run some iperf tests between the nodes to see how much bandwidth it can provide.
Are both, Ceph Public and Ceph Cluster...
Hmm, can you post the output of pveversion -v on the host where you have/had guest 130? Ideally inside of tags (or use the formatting options in the buttons above (</> for a code block)
Is guest 130 a VM or a CT?
okay. that is curious. are all guests powered on or are some powered off?
For example, guest VMID 130 in that error message from the first post. Was it on or off at that time?
Weils im englischen Forum auch gerade vorkam, hier meine Antwort dort mit ein paar Details bez. des jetzt einfacheren Pinnings: https://forum.proxmox.com/threads/network-drops-on-boot.65210/#post-793255