This issue has been with us since we upgraded our cluster to Proxmox 4.x, and converted our guests from OpenVZ to KVM. We have single and dual socket Westmere, Sandy Bridge and Ivy Bridge nodes, using ZFS RAID10 HDD or ZFS RAIDZ SSD arrays, and every one of them is affected.
Description
When there is high IO load on the ZFS pools during vzdump, restore or migrate, the guests' IO slows down extremely or even freezes for a few seconds, resulting in:
- lost network connectivity (Windows guests often lose Remote Desktop connections)
- huge latency in network services
- CPU soft lockups
- CPU / rcu_scheduler stalls
- application blocked in syslog or even stack dump
We run monitoring services that poll the websites and other network services served by these guests every minute, that's how we started to realize we have a problem, because we started getting alerts during nightly backups.
This is today's soft lockup in a Debian 7 guest during the restore of another KVM to zfs-local (6x HDD ZFS RAID10) on a single socket Ivy Bridge system. There was no load on the system apart from the restore, other guests are mostly inactive:
Also a Windows KVM was unreachable during that time.
Mitigation steps
We have tried many tweaks to eliminate the problem:
- disabling C-states on Westmere systems
- enabling performance governor
- recommended swap settings from the ZFS wiki, also vm.swappiness=1
- increasing vm.min_free_kbytes on both hosts and guests
- decreasing vm.dirty_ratio to 5-15, vm.dirty_background_ratio to 1-5
- installing NVME SSDs as SLOG/L2ARC, also for swap
Some of these tweaks helped a little, but the issue is still happening, maybe less during backups but still heavily during restores and migrations. The issue seems to be connected to the Linux kernel's VM (virtual memory) subsystem, because if you set vm.vfs_cache_pressure to a high value (like 1000) in a KVM guest, the lockups happen much more often. Also Debian looks more sensitive to it than Ubuntu for example, but Windows guests are also affected.
Help needed
I am looking for input from others who also run KVM guests on zfs-local (zvols), interested if they also experience these symptoms (you won't see a 1-2 minute freeze unless you run some kind of monitoring). I would also welcome advice on how to diagnose it further, to decide if the issue is in QEMU/KVM, ZFS or some other parts of the kernel.
Description
When there is high IO load on the ZFS pools during vzdump, restore or migrate, the guests' IO slows down extremely or even freezes for a few seconds, resulting in:
- lost network connectivity (Windows guests often lose Remote Desktop connections)
- huge latency in network services
- CPU soft lockups
- CPU / rcu_scheduler stalls
- application blocked in syslog or even stack dump
We run monitoring services that poll the websites and other network services served by these guests every minute, that's how we started to realize we have a problem, because we started getting alerts during nightly backups.
This is today's soft lockup in a Debian 7 guest during the restore of another KVM to zfs-local (6x HDD ZFS RAID10) on a single socket Ivy Bridge system. There was no load on the system apart from the restore, other guests are mostly inactive:
Also a Windows KVM was unreachable during that time.
Mitigation steps
We have tried many tweaks to eliminate the problem:
- disabling C-states on Westmere systems
- enabling performance governor
- recommended swap settings from the ZFS wiki, also vm.swappiness=1
- increasing vm.min_free_kbytes on both hosts and guests
- decreasing vm.dirty_ratio to 5-15, vm.dirty_background_ratio to 1-5
- installing NVME SSDs as SLOG/L2ARC, also for swap
Some of these tweaks helped a little, but the issue is still happening, maybe less during backups but still heavily during restores and migrations. The issue seems to be connected to the Linux kernel's VM (virtual memory) subsystem, because if you set vm.vfs_cache_pressure to a high value (like 1000) in a KVM guest, the lockups happen much more often. Also Debian looks more sensitive to it than Ubuntu for example, but Windows guests are also affected.
Help needed
I am looking for input from others who also run KVM guests on zfs-local (zvols), interested if they also experience these symptoms (you won't see a 1-2 minute freeze unless you run some kind of monitoring). I would also welcome advice on how to diagnose it further, to decide if the issue is in QEMU/KVM, ZFS or some other parts of the kernel.
Last edited: