Hey All,
I can't seem to get a stable Proxmox server. This was my last post on the subject. I got sick of fiddling with one or two variables and then waiting 2 weeks to see if it would hang, so I went ahead and changed almost everything in the system. I upgraded from Proxmox 5.4 to 6, I swapped out all the RAM with known-good RAM from another system, and I swapped out the HD with another SSD. I also moved all the VMs from NFS storage to the local OS SSD, a 2TB Intel 760p formatted with ZFS.
The only thing my new build has in common with the old build is the motherboard/SoC, the SuperMicro M11SDV-8C-LN4F Epyc 3251 motherboard. That and all the VMs/containers I'm running, of course.
Today, after about two weeks of uptime, during a VM backup the hypervisor and VMs hung. I saw this on the console:
"WARNING: Pool 'rpool' has encountered an uncorrectable I/O failure and has been suspended."
I can still enter characters at the login prompt but then it hangs when I hit "enter". I can't ssh into the Proxmox host and I can't ssh into the VMs. Everything is unresponsive. Recovering the server requires a full power-off-power-on. (just rebooting always leads to the same bad state very quickly, if it boots at all).
Here is what I can gather:
I can't seem to get a stable Proxmox server. This was my last post on the subject. I got sick of fiddling with one or two variables and then waiting 2 weeks to see if it would hang, so I went ahead and changed almost everything in the system. I upgraded from Proxmox 5.4 to 6, I swapped out all the RAM with known-good RAM from another system, and I swapped out the HD with another SSD. I also moved all the VMs from NFS storage to the local OS SSD, a 2TB Intel 760p formatted with ZFS.
The only thing my new build has in common with the old build is the motherboard/SoC, the SuperMicro M11SDV-8C-LN4F Epyc 3251 motherboard. That and all the VMs/containers I'm running, of course.
Today, after about two weeks of uptime, during a VM backup the hypervisor and VMs hung. I saw this on the console:
"WARNING: Pool 'rpool' has encountered an uncorrectable I/O failure and has been suspended."
I can still enter characters at the login prompt but then it hangs when I hit "enter". I can't ssh into the Proxmox host and I can't ssh into the VMs. Everything is unresponsive. Recovering the server requires a full power-off-power-on. (just rebooting always leads to the same bad state very quickly, if it boots at all).
Here is what I can gather:
- No crash logging in kern.log or syslog. The message above never appears in kern.log so I assume as soon as it is hit the system can no longer log failures. I can find no logging from the time of the crash and no indication of the problem. This is the ZFS volume I'm using for the OS and SSDs, so obviously this coincides with the server dying.
- `zpool history rpool` reports "zpool import -N rpool" was the only command run close to the crash
- `zpool status rpool` reports "No known data errors"
- The system died while backing up a Windows VM (that has the GPU passed-through, in case that matters) to an NFS share.