I've been doing some testing on disaster scenarios with ZFS/Proxmox.
I've noticed that across multiple hosts, if a VM has a storage qcow on a ZFS with a bad disk (I have a disk for testing that passes all tests but generates occasional read errors) the whole system will hang without a recovery in many cases. I have to reboot.
For example, in the situation, I can log in at console, but a prompt never shows after the MOTD.
Please understand: this is a testing environment. This is a long established working server with established and recently tested good base hardware that I am injecting a known problem to.
This PVE in normal operations (normal operations established by long periods of time between 24 and 1 hour of solid availability between tests):
Average load: 2.09 Of a Xeon D-1518 2.2GHz processor
Average RAM: 11.39GB of 128GB of RAM.
Two VMs (1 windows, 1 Ubuntu)
- Each have 60gb single qcows.
- One on the internal SSD zpool. One on the internal HDD zpool.
When I add a external sata (Motherboard supports multiport) container with 4 drives (one bad) and make a zpool and let it sit, it runs indefinitely and ram does increase reflectively to the size (I've tried different sizes)
If I backup to it, attach it to either vms and copy files to it:
The system load jumps to 70 after hitting a delayed read error and that's the end of the log. The whole system hangs. If I try to log in at console, I don't even get a prompt. All VMS freeze and all IOs seem to halt until a reboot.
I can manage this, but I've not seen this issue before and is likely related to how the drive is failing, but still is this expected behavior.
OH-- Also, I tested this morning on another PVE Node and the problem followed the drive.
Is there any recommended configuration to stave off this situation? I understand why it may be happening, but not sure if it could be mitigated.
I've noticed that across multiple hosts, if a VM has a storage qcow on a ZFS with a bad disk (I have a disk for testing that passes all tests but generates occasional read errors) the whole system will hang without a recovery in many cases. I have to reboot.
For example, in the situation, I can log in at console, but a prompt never shows after the MOTD.
Please understand: this is a testing environment. This is a long established working server with established and recently tested good base hardware that I am injecting a known problem to.
This PVE in normal operations (normal operations established by long periods of time between 24 and 1 hour of solid availability between tests):
Average load: 2.09 Of a Xeon D-1518 2.2GHz processor
Average RAM: 11.39GB of 128GB of RAM.
Two VMs (1 windows, 1 Ubuntu)
- Each have 60gb single qcows.
- One on the internal SSD zpool. One on the internal HDD zpool.
When I add a external sata (Motherboard supports multiport) container with 4 drives (one bad) and make a zpool and let it sit, it runs indefinitely and ram does increase reflectively to the size (I've tried different sizes)
If I backup to it, attach it to either vms and copy files to it:
The system load jumps to 70 after hitting a delayed read error and that's the end of the log. The whole system hangs. If I try to log in at console, I don't even get a prompt. All VMS freeze and all IOs seem to halt until a reboot.
I can manage this, but I've not seen this issue before and is likely related to how the drive is failing, but still is this expected behavior.
OH-- Also, I tested this morning on another PVE Node and the problem followed the drive.
Is there any recommended configuration to stave off this situation? I understand why it may be happening, but not sure if it could be mitigated.