I have this server running PVE 6.3 with ZFS 2.0.4 that I did 'zpool upgrade -a' on and didn't expect it to be able to reboot, but it did:
Now I'm wondering why it worked and if there is something specific that I can check to see if the others will also work?
Is not booting with GRUB...
In the Proxmox VE 6.4 release notes known issues section it says:
Please avoid using zpool upgrade on the "rpool" (root pool) itself, when upgrading to ZFS 2.0 on a system booted by GRUB in legacy mode, as that will break pool import by GRUB.
ZFS 2.0.4 is also available in 6.3-6 which I'm...
Just to conclude this thread:
TL;DR
proxmox 6.2-4 -> zfs 0.8.3 normal IO delay
proxmox 6.3-3 -> zfs 0.8.5 has problems causing excessive IO delay, huge performance penalty
proxmox 6.3-6 -> zfs 2.0.4 normal IO delay
I moved all VMs to the identically configured host and had the identical...
so https://forum.proxmox.com/threads/zfs-on-hdd-massive-performance-drop-after-update-from-proxmox-6-2-to-6-3.81820/ doesn't count?
same before and after version and exactly the same problem.
I will try this, thanks.
Yes, it's unfortunate that I replaced a drive and upgraded at the same time...
Hello Proxmox community,
TL;DR
Huge performance hit after upgrading to 6.3-3, probably having to do with IO delay.
Replication from system to system will stall completely for up to a minute at a time.
This graph shows the sharp rise in IO delay. The rise corresponds precisely with the...
I have the EXACT same symptoms caused by upgrading to 6.3-3 from 6.2-4.
That sharp rise in IO delay happened when I upgraded. same load before and after. I also started having LONG hangs when doing replication between hosts, almost a minute long. I have an extremely fast pool of NVMe...
That's what I thought initially too. But the reason SSH is inaccessible is because there are files that SSHD needs that live in /etc/pve hierarchy. /etc/pve becomes inaccessible when corosync/pve-cluster stop functioning, so anything that touches it hangs.
*and*
restarting...
Hopefully this is still being looked at. I still have nodes that go offline and can be brought back by restarting corosync, then restarting pve-cluster. I have 2 4-node clusters. 1 node in each cluster has never gone offline - they have 128G RAM in them. The other 3 nodes in each cluster...
I just figured out from this post https://forum.proxmox.com/threads/pve-5-4-11-corosync-3-x-major-issues.56124/post-262788 that I can bring my 'grey' node back online by restarting corosync and pve-cluster on all nodes in the cluster. After doing this, I can now SSH into and out of the once...
I'm reviving this thread. I've moved the clustering network to it's own physical NIC on each node going across an isolated switch. This is did not fix the problem. I currently have 1 node that is in the 'grey' state - VMs continuing to run and function normally, but I can't SSH into or out of...
I am using ZFS as my storage... However, the thread you referenced doesn't seem to be related. When my nodes go offline, the VMs and containers keep running just fine. My heaviest node has 83 VMs/containers that are very active on the filesystem and network. They all keep running... for days...
Yes, all traffic is on X.Y.241 (10GE, but only about 15% utilization max). I will try putting corosync on it's own network, which will hopefull fix it - but it seems like a pretty serious bug in corosync if missing a packet can make a server become unresponsive. It seems like missing a packet...
I would like to conclude this thread by apologizing for wasting anyone's time. It turns out that 'pct snapshot' is reliable.
Here's what was happening:
The script goes through and snapshots all the VMs with 'qm snapshot', then snapshots all the containers with 'pct snapshot'. The script then...
HA is not enabled.
Another data point and this may be the most relevant one... I have 14 nodes running proxmox 6.0:
6 of them are standalone - all stable.
4 of them are in the first cluster - only 3 of them have gone offline
4 of them are in the second cluster - only 3 of them have gone...
another data point:
If i'm logged into the console I can NOT ssh out of the server, but all the VMs and containers continue to run without issue - reading/writing to disk, reading/writing to the network.
Another data point:
if I 'systemctl restart sshd' it restarts and the old hung sessions are still hung and I still can't ssh into the server.
If there are any other commands you want me to run on the console while it's in this hung state, let me know.
Another data point:
I was able to log into the console and 'systemctl restart corosync'. and corosync seemed to come back up and regain quorum - but I still can't SSH into the box and that node in the GUI then changes from a grey question mark to a red X, just like as if it was powered down.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.