Unfortunately I was excited too quickly. It worked ok on 1 VM, but just tested another and it keeps on timing out on it, and I'm referring to the Console display. It keeps on "Connecting..." and there is no further display.
VM 192 qmp command 'change' failed - unable to connect...
Backups time out on some VM's but succeed on others. It happens also in the GUI when trying to open the console for example. Copying the VM over to another hypervisor also won't work, we have to take the VM down and then the copy succeeds. Only on the new hypervisor the same behavior occurs.
We have once again a strange problem where random VM's get a 'qmp socket - timeout' when using the Console, or 'time out' on backups. The issue started to occur when we upgraded proxmox to the latest version.
We run all latest updates and run an NVME Ceph cluster.
Some VM's (KVM) work fine and...
We experience daily crashes of corosync3 also and all nodes that crash are on the BNX2 driver.
We run a separate cluster network so don't rely on external networks. It also has it's dedicated switch.
Does proxmox already have a reason / solution for it? This is ongoing since we moved ot PVE 6.
I don't know if that 'increase' in token time is an actual solution.
We actually didn't experience this in proxmox VE 4, we only started getting it on VE 6.
So it's better to find the actual cause of this and since more people are reporting it, there must be a solution found.
Just to update everyone, the change we made didn't do the trick, corosync is still crashing randomly (only it stayed stable for about 24h and then started again). We thought we solved it but still doesn't seem like a solution.
We have been experiencing the same issue and the corosync service kept on crashing on our Proxmox VE 6 on and off on different nodes in the cluster.
We solved it by simply disabling IPV6 on the cluster network interfaces
echo net.ipv6.conf.eno4.disable_ipv6 = 1 >> /etc/sysctl.conf
Found the issue why those 4 OSD's were not getting used. I created a crush name outside of the existing pool, so those OSD's were not being used by the existing pool (dumb I know!).
I increased the pg_num value to 1024 and same for pgp_num, marked 2 OSD's as out and now it's...