We have a working 4 hosts PVE 6.3-3 cluster installed with ZFS. All those machine are enterprise gear on working condition. With those machine we archive, digitize and produce very large video files to different format. These are often intensive processes that require a lot of resources and generate high disk usage.
3 host has a 2To zpool made of 2 mirror (2 times 2 mirrored disk), without separate log disk. The 4th host is special because it contains 3 zpool :
The 4th host is the central piece of this setup because it serve different purposes. zpool10 contains 2 dataset for VM and LXC storage, and 4 dataset which are network shared via
For a while now, we observe different problem. First thing we noticed is when a machine does intensive task, for example transcoding a large video file or even deleting a large ZFS snapshot, the network shares disconnect and became inaccessible. More recently we notice that some VM and container hangs for a while during intensive task. In those cases we can stop the VM, but when we try to start it again we got a
When this happens, all datasets and zpool are affected regardless of which dataset is used for intensive operation.
Those problem are very annoying because it means when we perform a demanding task we must stop using the network shares and potentially some of the VMs. For now, we just wanted to know if those problem are caused by a known bug or limitation on Proxmox itself (which I doubt), or if we setup some things wrong and have to rethink our infrastructure. The zpool aren’t full, we do not reach 100% io delay and we do not run out of RAM. It just look like everything is freezing even if we do not reach the potential limit of our machines. Most importantely, i can’t find any relevant log when those problem happen.
Do any of you have similar concerns with similar setup ? What would you do differently from us ?
Thanks !
3 host has a 2To zpool made of 2 mirror (2 times 2 mirrored disk), without separate log disk. The 4th host is special because it contains 3 zpool :
- zpool10, a 54TiB zpool made of 9 mirrors + 1 NVME mirror for log
- zpool11, a 130TiB zpool made of 7 raidz1 + 1 NVME mirror for log
- zpool12, a 29TiB zpool made of 3 mirror with no log disk
The 4th host is the central piece of this setup because it serve different purposes. zpool10 contains 2 dataset for VM and LXC storage, and 4 dataset which are network shared via
net usershare
(samba), accessible from our office and mounted on different VMs. zpool11 contains 4 other dataset which are also shared via net usershare
. zpool12 contains a dedicated dataset for borg backup but isn’t used at the moment. The shared folder are used by Linux CentOS 7 and MacOS High Sierra machines.For a while now, we observe different problem. First thing we noticed is when a machine does intensive task, for example transcoding a large video file or even deleting a large ZFS snapshot, the network shares disconnect and became inaccessible. More recently we notice that some VM and container hangs for a while during intensive task. In those cases we can stop the VM, but when we try to start it again we got a
systemd timeout
. The « solution » if often to wait for the operation to finish and try and start the VM again / restart smbd.service
. We try and separate the LXC disk to a separate LVM volume, and are considering doing the same for our VMs.When this happens, all datasets and zpool are affected regardless of which dataset is used for intensive operation.
Those problem are very annoying because it means when we perform a demanding task we must stop using the network shares and potentially some of the VMs. For now, we just wanted to know if those problem are caused by a known bug or limitation on Proxmox itself (which I doubt), or if we setup some things wrong and have to rethink our infrastructure. The zpool aren’t full, we do not reach 100% io delay and we do not run out of RAM. It just look like everything is freezing even if we do not reach the potential limit of our machines. Most importantely, i can’t find any relevant log when those problem happen.
Do any of you have similar concerns with similar setup ? What would you do differently from us ?
Thanks !
Last edited: