KVM guests freeze (hung tasks) during backup/restore/migrate

Marcin Kubica

New Member
Feb 13, 2019
5
0
1
42
One thing, as we seem to be restoring on this host before without issues. Could this be the cause?
WARNING: Sum of all thin volume sizes (2.20 TiB) exceeds the size of thin pool fst/vms and the size of whole volume group (2.18 TiB)!
 

gkovacs

Well-Known Member
Dec 22, 2008
503
45
48
Budapest, Hungary
I'm new to Proxmox in general, but having similar problem. Proxmox 5.2-5 Upon restore all guests went frozen after 2 minutes and reported sata failures. One guest pretty bad, lost partition tables on two disks. No indication in host logs of any issue. How about setting no cache on guest disks? Can this be a cause?
This looks exactly like the problem that I (and many others) reported for years in this thread and others. The problem is most likely a Linux kernel / KVM issue.

What seems to happen is when local disks are busy due to a restore operation, probably the dirty page IO of the host system is blocked, leading to CPU hangs / lockups in KVM guests which sometimes result in guest kernel panics. Unfortunately putting the swap partition to a different SSD than the array being restored to does not solve the problem. The problem happens (at least) since Proxmox 3, and affects many hardware and software configurations and filesystems (ext4 on LVM, ZFS, etc.).

The issue even appeared to me a few days ago on a fuilly updated Proxmox 5 node when restoring a VM from Ceph to local-zfs, even though I set a restore limit of 100 MB/s on a 4 disk RAIDZ1 SATA SSD pool. Regardless of several times higher system IO capacity, websites hosted on KVM guests on this node timed out for minutes.

Unfortunately, neither the Proxmox nor the KVM developers acknowledge the issue, let alone own (and investigate) it, so no one works on a solution. (Also people fail to read the starter post of this thread, therefore the discussion gets quickly derailed)

I have opened a bug in the Proxmox bugzilla, but it was closed by @wolfgang as a "load issue" while is clearly a bug in the kernel or QEMU/KVM code:
https://bugzilla.proxmox.com/show_bug.cgi?id=1453

Mitigations
I don't see any real solution happening to this problem, apart from using the tweaks that I posted many times before which don't solve the issue, but lessen it's impact:
- bandwidth limit backups, restores and migrations
- put swap on an NVMe SSD, also ZIL+L2ARC if you use ZFS
- use recommended swap settings from the ZFS wiki
- use vm.swappiness=1 on host and in guests
- increase vm.min_free_kbytes on both hosts and guests
- decrease vm.dirty_ratio to 2, vm.dirty_background_ratio to 1 on both hosts and guests
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!