KVM guests freeze (hung tasks) during backup/restore/migrate

Marcin Kubica

New Member
Feb 13, 2019
7
0
1
42
One thing, as we seem to be restoring on this host before without issues. Could this be the cause?
WARNING: Sum of all thin volume sizes (2.20 TiB) exceeds the size of thin pool fst/vms and the size of whole volume group (2.18 TiB)!
 

gkovacs

Well-Known Member
Dec 22, 2008
506
45
48
Budapest, Hungary
I'm new to Proxmox in general, but having similar problem. Proxmox 5.2-5 Upon restore all guests went frozen after 2 minutes and reported sata failures. One guest pretty bad, lost partition tables on two disks. No indication in host logs of any issue. How about setting no cache on guest disks? Can this be a cause?
This looks exactly like the problem that I (and many others) reported for years in this thread and others. The problem is most likely a Linux kernel / KVM issue.

What seems to happen is when local disks are busy due to a restore operation, probably the dirty page IO of the host system is blocked, leading to CPU hangs / lockups in KVM guests which sometimes result in guest kernel panics. Unfortunately putting the swap partition to a different SSD than the array being restored to does not solve the problem. The problem happens (at least) since Proxmox 3, and affects many hardware and software configurations and filesystems (ext4 on LVM, ZFS, etc.).

The issue even appeared to me a few days ago on a fuilly updated Proxmox 5 node when restoring a VM from Ceph to local-zfs, even though I set a restore limit of 100 MB/s on a 4 disk RAIDZ1 SATA SSD pool. Regardless of several times higher system IO capacity, websites hosted on KVM guests on this node timed out for minutes.

Unfortunately, neither the Proxmox nor the KVM developers acknowledge the issue, let alone own (and investigate) it, so no one works on a solution. (Also people fail to read the starter post of this thread, therefore the discussion gets quickly derailed)

I have opened a bug in the Proxmox bugzilla, but it was closed by @wolfgang as a "load issue" while is clearly a bug in the kernel or QEMU/KVM code:
https://bugzilla.proxmox.com/show_bug.cgi?id=1453

Mitigations
I don't see any real solution happening to this problem, apart from using the tweaks that I posted many times before which don't solve the issue, but lessen it's impact:
- bandwidth limit backups, restores and migrations
- put swap on an NVMe SSD, also ZIL+L2ARC if you use ZFS
- use recommended swap settings from the ZFS wiki
- use vm.swappiness=1 on host and in guests
- increase vm.min_free_kbytes on both hosts and guests
- decrease vm.dirty_ratio to 2, vm.dirty_background_ratio to 1 on both hosts and guests
 
Last edited:

voluhar

New Member
Jan 29, 2020
5
0
1
28
Hello, I am new to the forum.
I am pretty happy with proxmox in my lab and also have it on production, few months ago I migrated it from my VirtualBox setup on same server.
Server is Supermicro with 2x NVMe SSD ZFS mirror, 12 x Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz (1 Socket) and 128GB of RAM.
For backup I attached folder through NFS on Synology storage.
Kernel Version Linux 5.0.21-5-pve #1 SMP PVE 5.0.21-10 (Wed, 13 Nov 2019 08:27:10 +0100)
PVE Manager Version pve-manager/6.1-5/9bf06119

But last week I enabled backups and this bacame nightmare.
It does not matter which compression I chose (none/lzo/gzip) for backup all seems to have simmilar issues.
Gzip seems to have lot more that lzo and none bot all have.
Last thing that I tried was to limit backup speed to 500mbps and this decreased number of CPU stuck errors but still does not resolve problem.
All of VMs have this CPU stuck problem. On one we have Gitlab on Archlinux and it stuck every night and restart of VM is required. After CPU lock file system switch to read only and no one can connect to it anymore.
Another crash that we have during this time, was that on Windows DC file system crashed so badly that after restart it was not possible to get it booted again, of course restore from backup was not successfull because of crashed state that was already in oldest backup...
Also one pfSense router did crashed but it seems to be more robust than Linux and Windows.
Screenshot is from one of VMs for pfSense router.

Was anyone able to resolve this problem and not just disable backup as 500mbps is pretty slow.

I will configure backup with zfs send but I do not see it as real solution.
 

Attachments

Last edited:

gkovacs

Well-Known Member
Dec 22, 2008
506
45
48
Budapest, Hungary
Was anyone able to resolve this problem and not just disable backup as 500mbps is pretty slow.

I will configure backup with zfs send but I do not see it as real solution.
Read my previous posts... this is a several years old bug in the kernel scheduler or KVM (or both), and nobody cares about solving it. Proxmox devs killed my bugreport, KVM devs do not even acknowledge the bugreport.

What happens is that during heavy IO load on the host (disk or network), KVM guests get starved of memory accesses (or CPU time), even when the load is not using all system resources. For example: a 100 mbytes / sec restore from an NFS backup or local disk to an SSD can completely starve a KVM guest for minutes at a time, even though the SSD can write at least 400 mbytes / sec sequentially.

There is no real solution to this bug, you can only mitigate it's effects by limiting the load on the host and turning down the page cache activity:
- bandwidth limit backups, restores and migrations
- use vm.swappiness=1 on host and in guests
- decrease vm.dirty_ratio to 2, vm.dirty_background_ratio to 1 on both hosts and guests
- increase vm.min_free_kbytes on both hosts and guests (at least 262144 helps)
 

mmenaz

Active Member
Jun 25, 2009
741
5
38
Northern east Italy
[...]
I have opened a bug in the Proxmox bugzilla, but it was closed by @wolfgang as a "load issue" while is clearly a bug in the kernel or QEMU/KVM code:
https://bugzilla.proxmox.com/show_bug.cgi?id=1453
[...]
Wolfgang in his last but one post (comment 7) said that new features have solved the issue and no one replied that it was not the case, so he closed the issue. Maybe if you add comments there about this it could help to reopen and solve the problem?
 

gkovacs

Well-Known Member
Dec 22, 2008
506
45
48
Budapest, Hungary
Wolfgang in his last but one post (comment 7) said that new features have solved the issue and no one replied that it was not the case, so he closed the issue. Maybe if you add comments there about this it could help to reopen and solve the problem?
Good point @mmenaz ! Unfortunately none of the new features solved this problem completely, also his observation about this being a load issue is superficial in light of the facts.

I have reopened the bugreport and commented our latest experience. I encourage everyone to do the same.
 
Last edited:

e100

Renowned Member
Nov 6, 2010
1,238
25
68
Columbus, Ohio
ulbuilder.wordpress.com
@gkovacs After reading every comment in this thread I am in agreement with you. This problem exists, has existed for years now and while the sysctl settings mentioned help they do not entirely resolve it.
I am tired of having outages because of it. The whole point of being able to move disks to different storage and restore VMs is to avoid outages, not create them!

I just started moving to Proxmox 6.x and I've seen the issue with it too.

I've mostly seen this using disk move or restoring a VM, bandwidth limits do not help.
Ideally, if a single process like the restore, is consuming all of the IO, that process should be throttled and slowed down to allow other processes to also access storage. But this does not happen. What I find most troubling is that once the kernel gets in this state, not only is disk IO an issue, things like networking and other tasks that require no disk IO do not get scheduled. Additionally, if I am doing heavy writes to slow SATA rust and trigger this issue, why would that also halt IO to my NVMe and disrupt networking?

Does not matter if I'm using LVM, EXT4 or zfs it happens.
I've seen it on systems with 16GB of RAM and 256GB RAM, fast/slow CPU does not matter.

I would like to see this fixed, how can I help?
 
  • Like
Reactions: guletz

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!