We still don't have a solution and it starts to happen randomly on other VM's now also. They also become unavailable and console gives timeout so there is no way to debug. Logs don't show anything.
@t.lamprecht
Unfortunately I was excited too quickly. It worked ok on 1 VM, but just tested another and it keeps on timing out on it, and I'm referring to the Console display. It keeps on "Connecting..." and there is no further display.
()
VM 192 qmp command 'change' failed - unable to connect...
Backups time out on some VM's but succeed on others. It happens also in the GUI when trying to open the console for example. Copying the VM over to another hypervisor also won't work, we have to take the VM down and then the copy succeeds. Only on the new hypervisor the same behavior occurs.
If...
We have once again a strange problem where random VM's get a 'qmp socket - timeout' when using the Console, or 'time out' on backups. The issue started to occur when we upgraded proxmox to the latest version.
We run all latest updates and run an NVME Ceph cluster.
Some VM's (KVM) work fine and...
We experience daily crashes of corosync3 also and all nodes that crash are on the BNX2 driver.
We run a separate cluster network so don't rely on external networks. It also has it's dedicated switch.
Does proxmox already have a reason / solution for it? This is ongoing since we moved ot PVE 6.
I don't know if that 'increase' in token time is an actual solution.
We actually didn't experience this in proxmox VE 4, we only started getting it on VE 6.
So it's better to find the actual cause of this and since more people are reporting it, there must be a solution found.
Just to update everyone, the change we made didn't do the trick, corosync is still crashing randomly (only it stayed stable for about 24h and then started again). We thought we solved it but still doesn't seem like a solution.
We have been experiencing the same issue and the corosync service kept on crashing on our Proxmox VE 6 on and off on different nodes in the cluster.
We solved it by simply disabling IPV6 on the cluster network interfaces
echo net.ipv6.conf.eno4.disable_ipv6 = 1 >> /etc/sysctl.conf
sysctl -p
**...
Update:
Found the issue why those 4 OSD's were not getting used. I created a crush name outside of the existing pool, so those OSD's were not being used by the existing pool (dumb I know!).
I increased the pg_num value to 1024 and same for pgp_num, marked 2 OSD's as out and now it's...
We recently started to add extra storage cause the main storage nodes were getting full and offloaded some. Only the cluster is not rebalancing correctly and OSD's are getting too full (85%). We added a new storage a few days ago but it's not getting filled at all and remains at 0.00 or 0.01%...
The cause laid in several hacked files located in /etc/init.d because this node was chrooted.
We have removed the files, ran the updates, migrated the last templates that were on and it's being wiped /reinstalled now.
Thanks for the help!
Trying to run an apt-get -y upgrade but it keeps on failing. I tried 100 things already but I simply can't get it work or find a solution.
I run VE 4.4 (pve-manager/4.4-1/eb2d6f1e (running kernel: 4.4.35-1-pve))
This is the output:
Reading package lists...
Building dependency tree...
Reading...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.