KVM guests freeze (hung tasks) during backup/restore/migrate

guletz · Jun 19, 2017

gkovacs said:
so it's probably the huge volume of sequential IO causing some buffers / caches to fill up and in turn block the IO of KVM guests.

For sure, zfs cand not have large sequential IO. If you have non-mirror zpool, it is very unlikely to have something like zis. Also if the free space in the pool is let say under around 85 % the seek time it will be higher, so IOPs will be bad.

I can not say nothing about qcow2, but I see that the performance /IOPs is very bad compared with raw.

And like I said before, try to avoid the results with zfs send-recive, for example.

good luck

guletz · Jun 20, 2017

Maybe the new zfs upgrade will help.

gkovacs · Jun 20, 2017

guletz said:
Maybe the new zfs upgrade will help.

What ZFS upgrade? Do you have any more information on that?

BTW this forum is full of threads about this issue, and many users are experiencing hung tasks and cpu stalls on LVM+RAW, LVM+ext4 and NFS filesystems when backups, restores and migrations are running, so I don't expect that a ZFS upgrade would solve this.

This is most likely a kernel+KVM IO scheduling issue that has not been fixed for a long time.

guletz · Jun 21, 2017

gkovacs said:
What ZFS upgrade? Do you have any more information on that?

ZoL was make a new release(6 days ago).

gkovacs said:
This is most likely a kernel+KVM IO scheduling issue that has not been fixed for a long time.

... this it is nice

gkovacs · Jun 24, 2017

gkovacs said:
Now I suspect vm.dirty_ratio and vm.dirty_background_ ratio again, will see if I'm right tonight by changing from 15/5 to 50/1.

Increasing vm.dirty_ratio and vm.dirty_background_ ratio did not help, but decreasing them afterwards did help a little: when set to the below values on the host, the KVM guests still slow down during backups (website response times increase 10x), but they less often fully time out (or produce hung tasks):

/etc/sysctl.conf

Code:

vm.dirty_ratio = 5
vm.dirty_background_ ratio = 1

guletz · Jun 24, 2017

Good to know.

proxtom · Jul 11, 2017

Any news on this?
As I am having the same problem on several Proxmox installs with ZFS (2node Clusters, v4.latest, AND even with new v5.0!).
Servers are DELL (different Modells, R815, R630, R410, .. ) with different HBAs, Disk-Drives/-Brands and even CPU-Architecture (AMD, Intel).
Symptom(s) started IFRC showing up with some upgrade in 4.3/4.4.
I always hoped, there will be a solution with a later update and tried to cope with that situation and to avoid actions which caused those system 'hangs'/stalls as much as possible.
But now, as this still shows up with Proxmox v5.x I think I need to reach out for help on how to further debug/analyze this to find a mitigation.
Unfortunately I am more kind of a 'user' rather than a linux debug pro - so I need some helping hand in tackling this down.

googling around revealed some others complaining about High IO wait and stalled systems while doing relatively 'normal' jobs (with Proxmox on ZFS), which could be easily handled by other filesystems (e.g.: ext4) with same hardware. The ZFS guys are getting bug reports about similiar problems/bugs as well. But until now, none of the threads I found offered a proven solution.

I think someone from the proxmox dev team needs to team up with somone from the ZFS team. I think nothing can be done from the normal proxmox user side to find a root cause and to fix it (at least not form my side, as I am not a developer nor a heavily skilled linux problem analyst).

But I am willing to provide info/test results as requested/adviced by others.
So please lend a helping hand and guide me on how to contribute to a solution.
Anyone else having same problem?
Pleas raise your hand, so proxmox team know about this beeing a more 'widespread' problem, not just an 'edge case'.

gkovacs · Jul 13, 2017

proxtom said:
Any news on this?
As I am having the same problem on several Proxmox installs with ZFS (2node Clusters, v4.latest, AND even with new v5.0!).
Servers are DELL (different Modells, R815, R630, R410, .. ) with different HBAs, Disk-Drives/-Brands and even CPU-Architecture (AMD, Intel).
Symptom(s) started IFRC showing up with some upgrade in 4.3/4.4.
I always hoped, there will be a solution with a later update and tried to cope with that situation and to avoid actions which caused those system 'hangs'/stalls as much as possible.
But now, as this still shows up with Proxmox v5.x I think I need to reach out for help on how to further debug/analyze this to find a mitigation.
Unfortunately I am more kind of a 'user' rather than a linux debug pro - so I need some helping hand in tackling this down.

I was really hoping that jumping to 4.10 kernel in PVE 5 would solve this issue, but unfortunately, from the reports pouring in it looks like this kernel / KVM / VirtIO issue is still there.

Let's recap what we know: when the host does heavy IO on the block device where KVM guests are stored (backups, restores, migrations), guests using VirtIO disks are denied of CPU resources for so long that their kernel notices, hence CPU stuck / stall / soft lockup messages appear in syslog, network services get disrupted and time out.

All guest OS types seem to be affected (we have confirmed Linux and Windows, but Debian 7 and 8 are hit the hardest, with both 3.x and 4.x kernels.) All host filesystems seem to be affected (LVM+ext4, XFS, ZFS confirmed by us and others), so it's not just a ZFS issue.

All host hardware architectures seem to be affected (single and dual socket Westmere, Sandy Bridge, Ivy Bridge confirmed by us, more platforms confirmed by others). The problem appears on single drive, hardware RAID and ZFS software RAID as well.

So far there is no solution for the problem apart from connecting the guest's disk to the IDE controller. There are some sysctl settings on the host that lessen the impact of this issue (most important is decreasing vm.dirty_ratio to 5 or less and vm.dirty_background_ ratio to 1), but none of them solve it.

I think someone from the proxmox dev team needs to team up with somone from the ZFS team. I think nothing can be done from the normal proxmox user side to find a root cause and to fix it (at least not form my side, as I am not a developer nor a heavily skilled linux problem analyst).

But I am willing to provide info/test results as requested/adviced by others.
So please lend a helping hand and guide me on how to contribute to a solution.
Anyone else having same problem?
Pleas raise your hand, so proxmox team know about this beeing a more 'widespread' problem, not just an 'edge case'.

I agree with your points raised. The Proxmox dev team @dietmar @tom @fabian needs to acknowledge and test for this issue, as they have much more experience than any of us here, but most likely it's not something they can fix on their own. This is most likely a nasty kernel / KVM / VirtIO issue hitting ZFS users especially hard, but googling around shows it has appeared for many years for KVM users on many platforms.

We are also willing to provide any help, info or test results to someone who can better diagnose this issue, and would welcome other users do the same.

proxtom · Jul 13, 2017

I can add some more/new observations from my side.
While testing of Install of debian-netinst-8.8 into a new VM and simultanously looking at atop on then node where the VM installs:

A) using QCOW2 disk format for IDE/SATA/SCSI/VirtIO
Looking at atop values immediately shows
- disk(s) being busy with 90%-100% all the time when debian install starts to write to disk and for some long time after
- wait goes above 90% (even seen 140%)
- Install of debian base system (ext4, only basic tools and sshd) takes HOURS to complete

B) using RAW OR VMDK for IDE/SATA/SCSI/VirtIO
Looking at atop immediately shows
- disk being busy below 10% most of the time (usually 3%-6%) during times debian install writes do disk.
- wait normally does not go further than 10%
- Install of debian base system (ext4, only basic tools and sshd) takes MINUTES to complete

-> Thus, by using RAW or VMDK for the VMs my PVE cluster node host is much more responsive with unsaturated disk channels and nearly no wait at all and this in turn will help in avoid having those 'CPU stalls' in other VMs running in parallel.

Note: A) is going to happen, even if this is the ONLY VM using host/cluster (each cluster node: 34Cores, 256GB Ram!)

i) QCOW handling seems seriously broken, as this was not the case with older (IFRC 3.x, 4.3) PVE versions and it is not related to using VirtIO only. Or the other way around: switching Controller-Type doesn't help (so it's not related to IDE/SAT/SCSI/VirtIO controller emulation)
ii) A+B are the same, whether I write directly to some ZFS storage or by using a GluserFS volume on top of same ZFS storage
(NO real degradation of GlusterFS over ZFS compared to direct ZFS use). Thus, bad QCOW behavior isn't related to Filesystem (ZFS or GlusterFS in this test)
What's interesting is, that the other GlusterFS/ZFS-PVE node (not hosting the InstallVM) does not have the skyrocketing disk channel busy and wait values. It just seems to handle the replicated disk-io very well.
iii) use of VMDK as disks is comparable to using RAW (its _somewhat_ slower, but not in 10th of magnitudes slower and load-/busy-generation as with QCOW2)

As VMDK usage should be considered 'same use' as of QCOW (usig a file container) it's not clear to me as to why QCOW is soooo worse than VMDK use. -> something gots broken with QCOW handling.

EDIT: Same tests with a debian-netinst-9.0.0 iso showed, that with QCOW2 disk the values are still 'bad', but only at 1/3 of that of debina-8.8.
As far as I can remeber the situation with win7 and win2008 was the same as with debian8.8. I will retetst and post results then.

Thus, this very bad qcow2 behavior with some VMs/OSes seems to contribute to the problem of 'stalled cpus' in time of heavy load as it contributes significantly to the 'load'.

mir · Jul 13, 2017

Could be the double COW is causing the bad IO?

proxtom · Jul 16, 2017

Anyone form Proxmox looking at/into this?

dcsapak · Jul 17, 2017

mir said:
Could be the double COW is causing the bad IO?

yes definitely, double cow causes very bad io.
last time i tried using qcow2 on zfs, i got about 1MB/s on a (consumer) ssd

mir · Jul 17, 2017

dcsapak said:
yes definitely, double cow causes very bad io.
last time i tried using qcow2 on zfs, i got about 1MB/s on a (consumer) ssd

Since there are no advantages using qcow2 for disk images when deploying on zfs backed storage it might be an idea to warn users in the wiki?

dcsapak · Jul 17, 2017

mir said:
Since there are no advantages using qcow2 for disk images when deploying on zfs backed storage it might be an idea to warn users in the wiki?

i even would go as far as to say that using zfs as a dir storage (and not with the zfspool plugin) does not give any advantage, i will look as how we can improve the documentation

gkovacs · Jul 19, 2017

mir said:
Could be the double COW is causing the bad IO?

dcsapak said:
yes definitely, double cow causes very bad io.
last time i tried using qcow2 on zfs, i got about 1MB/s on a (consumer) ssd

mir said:
Since there are no advantages using qcow2 for disk images when deploying on zfs backed storage it might be an idea to warn users in the wiki?

dcsapak said:
i even would go as far as to say that using zfs as a dir storage (and not with the zfspool plugin) does not give any advantage, i will look as how we can improve the documentation

Guys, this thread is about the CPU stalls / hung tasks that happen in KVM guests when there is high IO load on the host. I have no idea why you hijack this thread talking about the effect of double COW, but rest assured: the issue discussed here happens on ZFS+ZVOL nodes, EXT4+LVM nodes, etc. Please stay on topic here, we need an acknowledgement from Proxmox on this issue.

gkovacs · Jul 20, 2017

proxtom said:
Anyone form Proxmox looking at/into this?

I have filed a bugreport, you can add your experience if it's different from ours:
https://bugzilla.proxmox.com/show_bug.cgi?id=1453

micro · Sep 9, 2017

Hi, I think we are experiencing exactly the same issue here:
https://forum.proxmox.com/threads/3000-msec-ping-and-packet-drops-with-virtio-under-load.36687/

Ekkas · Sep 15, 2017

Something similar here:
https://forum.proxmox.com/threads/proxmox-5-ceph-luminous-bluestore-super-slow.36969/

saphirblanc · Oct 24, 2018

We are facing the same issue when we try to restore a backup, even if we limit the read speed. IO are so high that all VMs are getting stuck and are showing up the following message :

Message from syslogd@XXX at Oct 24 15:28:26 ...
kernel:[2226496.055048] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [systemd-cgroups:1464]

Message from syslogd@XXX at Oct 24 15:28:26 ...
kernel:[2226524.055243] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [systemd-cgroups:1464]

Message from syslogd@XXX at Oct 24 15:28:26 ...
kernel:[2226552.055439] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [systemd-cgroups:1464]

Message from syslogd@XXX at Oct 24 15:28:26 ...
kernel:[2226580.055636] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [systemd-cgroups:1464]

Message from syslogd@XXX at Oct 24 15:28:26 ...
kernel:[2226608.055896] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [systemd-cgroups:1464]

Message from syslogd@XXX at Oct 24 15:28:26 ...
kernel:[2226636.056032] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd-cgroups:1464]

Our packages versions are the following one on this node (which is in a cluster) :

proxmox-ve: 5.2-2 (running kernel: 4.15.18-3-pve)
pve-manager: 5.2-8 (running version: 5.2-8/fdf39912)
pve-kernel-4.15: 5.2-6
pve-kernel-4.15.18-3-pve: 4.15.18-22
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-38
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-10
libpve-storage-perl: 5.0-25
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-1
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-30
pve-container: 2.0-26
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-33
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

We restored the VM on an HDD ZFS Pool. However, some VMs which were not on the same pool (SSD) were also affected which is even more weird. At the end we niced the process to reduce the IO however I don't think it's something we should have to do everytime. Is there somewhere to look at else ?

Thank you very much.

EDIT : I had the following messages in the syslog of the host :

Code:

Oct 24 15:21:42 aramis kernel: [4748139.247005] INFO: task systemd-journal:1818 blocked for more than 120 seconds.
Oct 24 15:21:42 aramis kernel: [4748139.247046]       Tainted: P           O     4.15.18-3-pve #1
Oct 24 15:21:42 aramis kernel: [4748139.247070] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 24 15:21:42 aramis kernel: [4748139.247102] systemd-journal D    0  1818      1 0x00000104
Oct 24 15:21:42 aramis kernel: [4748139.247106] Call Trace:
Oct 24 15:21:42 aramis kernel: [4748139.247118]  __schedule+0x3e0/0x870
Oct 24 15:21:42 aramis kernel: [4748139.247123]  ? shmem_swapin+0x6d/0xc0
Oct 24 15:21:42 aramis kernel: [4748139.247125]  schedule+0x36/0x80
Oct 24 15:21:42 aramis kernel: [4748139.247129]  io_schedule+0x16/0x40
Oct 24 15:21:42 aramis kernel: [4748139.247132]  __lock_page+0xff/0x140
Oct 24 15:21:42 aramis kernel: [4748139.247134]  ? page_cache_tree_insert+0xe0/0xe0
Oct 24 15:21:42 aramis kernel: [4748139.247137]  shmem_getpage_gfp+0x981/0xcc0
Oct 24 15:21:42 aramis kernel: [4748139.247143]  ? dput+0x34/0x1f0
Oct 24 15:21:42 aramis kernel: [4748139.247145]  shmem_fault+0xa0/0x1e0
Oct 24 15:21:42 aramis kernel: [4748139.247147]  ? file_update_time+0xc9/0x110
Oct 24 15:21:42 aramis kernel: [4748139.247150]  __do_fault+0x24/0xe3
Oct 24 15:21:42 aramis kernel: [4748139.247152]  __handle_mm_fault+0xae5/0x11e0
Oct 24 15:21:42 aramis kernel: [4748139.247154]  handle_mm_fault+0xce/0x1b0
Oct 24 15:21:42 aramis kernel: [4748139.247158]  __do_page_fault+0x25e/0x500
Oct 24 15:21:42 aramis kernel: [4748139.247162]  ? do_sys_open+0x1bc/0x280
Oct 24 15:21:42 aramis kernel: [4748139.247164]  do_page_fault+0x2e/0xe0
Oct 24 15:21:42 aramis kernel: [4748139.247168]  ? page_fault+0x2f/0x50
Oct 24 15:21:42 aramis kernel: [4748139.247169]  page_fault+0x45/0x50
Oct 24 15:21:42 aramis kernel: [4748139.247172] RIP: 0033:0x7fd6db355ea5
Oct 24 15:21:42 aramis kernel: [4748139.247174] RSP: 002b:00007ffd4e5333a0 EFLAGS: 00010297
Oct 24 15:21:42 aramis kernel: [4748139.247175] RAX: 0000000005763f58 RBX: 0000000000001044 RCX: 000000000254adc8
Oct 24 15:21:42 aramis kernel: [4748139.247176] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 00007fd6d4510dc8
Oct 24 15:21:42 aramis kernel: [4748139.247178] RBP: 000000000254adc8 R08: 0000000002557350 R09: 0000557b03126230
Oct 24 15:21:42 aramis kernel: [4748139.247179] R10: be31bf1827fa2d79 R11: 4c0f14fbc6e1bc43 R12: 0000557b031352c0
Oct 24 15:21:42 aramis kernel: [4748139.247180] R13: 00007ffd4e5333c8 R14: 00007ffd4e533460 R15: 00007fd6d4510dc8
Oct 24 15:21:42 aramis kernel: [4748139.247253] INFO: task kworker/3:0:35780 blocked for more than 120 seconds.
Oct 24 15:21:42 aramis kernel: [4748139.247281]       Tainted: P           O     4.15.18-3-pve #1
Oct 24 15:21:42 aramis kernel: [4748139.247305] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 24 15:21:42 aramis kernel: [4748139.247336] kworker/3:0     D    0 35780      2 0x80000000
Oct 24 15:21:42 aramis kernel: [4748139.247369] Workqueue: events async_pf_execute [kvm]
Oct 24 15:21:42 aramis kernel: [4748139.247370] Call Trace:
Oct 24 15:21:42 aramis kernel: [4748139.247373]  __schedule+0x3e0/0x870
Oct 24 15:21:42 aramis kernel: [4748139.247376]  schedule+0x36/0x80
Oct 24 15:21:42 aramis kernel: [4748139.247377]  io_schedule+0x16/0x40
Oct 24 15:21:42 aramis kernel: [4748139.247380]  __lock_page_or_retry+0x2ce/0x2e0
Oct 24 15:21:42 aramis kernel: [4748139.247382]  ? find_get_entry+0x1e/0x100
Oct 24 15:21:42 aramis kernel: [4748139.247384]  ? page_cache_tree_insert+0xe0/0xe0
Oct 24 15:21:42 aramis kernel: [4748139.247385]  do_swap_page+0x5e5/0x9b0
Oct 24 15:21:42 aramis kernel: [4748139.247387]  __handle_mm_fault+0x88d/0x11e0
Oct 24 15:21:42 aramis kernel: [4748139.247390]  handle_mm_fault+0xce/0x1b0
Oct 24 15:21:42 aramis kernel: [4748139.247393]  __get_user_pages+0x11c/0x6c0
Oct 24 15:21:42 aramis kernel: [4748139.247395]  ? __switch_to_asm+0x34/0x70
Oct 24 15:21:42 aramis kernel: [4748139.247396]  ? __switch_to_asm+0x40/0x70
Oct 24 15:21:42 aramis kernel: [4748139.247399]  get_user_pages_remote+0x126/0x1b0
Oct 24 15:21:42 aramis kernel: [4748139.247412]  async_pf_execute+0x7a/0x190 [kvm]
Oct 24 15:21:42 aramis kernel: [4748139.247414]  process_one_work+0x1e0/0x400
Oct 24 15:21:42 aramis kernel: [4748139.247416]  worker_thread+0x4b/0x420
Oct 24 15:21:42 aramis kernel: [4748139.247419]  kthread+0x105/0x140
Oct 24 15:21:42 aramis kernel: [4748139.247420]  ? process_one_work+0x400/0x400
Oct 24 15:21:42 aramis kernel: [4748139.247422]  ? kthread_create_worker_on_cpu+0x70/0x70
Oct 24 15:21:42 aramis kernel: [4748139.247424]  ret_from_fork+0x35/0x40
Oct 24 15:23:43 aramis kernel: [4748260.079930] INFO: task kvm:29103 blocked for more than 120 seconds.
Oct 24 15:23:43 aramis kernel: [4748260.079965]       Tainted: P           O     4.15.18-3-pve #1
Oct 24 15:23:43 aramis kernel: [4748260.079984] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 24 15:23:43 aramis kernel: [4748260.080008] kvm             D    0 29103      1 0x00000000
Oct 24 15:23:43 aramis kernel: [4748260.080011] Call Trace:
Oct 24 15:23:43 aramis kernel: [4748260.080021]  __schedule+0x3e0/0x870
Oct 24 15:23:43 aramis kernel: [4748260.080025]  ? get_swap_bio+0xcf/0x100
Oct 24 15:23:43 aramis kernel: [4748260.080027]  schedule+0x36/0x80
Oct 24 15:23:43 aramis kernel: [4748260.080030]  io_schedule+0x16/0x40
Oct 24 15:23:43 aramis kernel: [4748260.080033]  __lock_page_or_retry+0x2ce/0x2e0
Oct 24 15:23:43 aramis kernel: [4748260.080035]  ? find_get_entry+0x1e/0x100
Oct 24 15:23:43 aramis kernel: [4748260.080037]  ? page_cache_tree_insert+0xe0/0xe0
Oct 24 15:23:43 aramis kernel: [4748260.080039]  do_swap_page+0x5e5/0x9b0
Oct 24 15:23:43 aramis kernel: [4748260.080041]  __handle_mm_fault+0x88d/0x11e0
Oct 24 15:23:43 aramis kernel: [4748260.080044]  ? ttwu_do_activate+0x77/0x80
Oct 24 15:23:43 aramis kernel: [4748260.080046]  handle_mm_fault+0xce/0x1b0
Oct 24 15:23:43 aramis kernel: [4748260.080050]  __get_user_pages+0x11c/0x6c0
Oct 24 15:23:43 aramis kernel: [4748260.080053]  get_user_pages_unlocked+0x12e/0x1b0
Oct 24 15:23:43 aramis kernel: [4748260.080072]  __gfn_to_pfn_memslot+0x30e/0x410 [kvm]
Oct 24 15:23:43 aramis kernel: [4748260.080085]  try_async_pf+0x8d/0x1f0 [kvm]
Oct 24 15:23:43 aramis kernel: [4748260.080097]  tdp_page_fault+0x12d/0x290 [kvm]
Oct 24 15:23:43 aramis kernel: [4748260.080108]  kvm_mmu_page_fault+0x62/0x160 [kvm]
Oct 24 15:23:43 aramis kernel: [4748260.080113]  handle_ept_violation+0xad/0x140 [kvm_intel]
Oct 24 15:23:43 aramis kernel: [4748260.080117]  vmx_handle_exit+0xb5/0x1520 [kvm_intel]
Oct 24 15:23:43 aramis kernel: [4748260.080119]  ? vmexit_fill_RSB+0x10/0x40 [kvm_intel]
Oct 24 15:23:43 aramis kernel: [4748260.080122]  ? vmx_vcpu_run+0x418/0x5e0 [kvm_intel]
Oct 24 15:23:43 aramis kernel: [4748260.080133]  kvm_arch_vcpu_ioctl_run+0x935/0x16c0 [kvm]
Oct 24 15:23:43 aramis kernel: [4748260.080144]  ? kvm_arch_vcpu_load+0x4d/0x250 [kvm]
Oct 24 15:23:43 aramis kernel: [4748260.080154]  ? kvm_arch_vcpu_load+0x68/0x250 [kvm]
Oct 24 15:23:43 aramis kernel: [4748260.080162]  kvm_vcpu_ioctl+0x339/0x620 [kvm]
Oct 24 15:23:43 aramis kernel: [4748260.080170]  ? kvm_vcpu_ioctl+0x339/0x620 [kvm]
Oct 24 15:23:43 aramis kernel: [4748260.080173]  do_vfs_ioctl+0xa6/0x620
Oct 24 15:23:43 aramis kernel: [4748260.080183]  ? kvm_on_user_return+0x70/0xa0 [kvm]
Oct 24 15:23:43 aramis kernel: [4748260.080185]  SyS_ioctl+0x79/0x90
Oct 24 15:23:43 aramis kernel: [4748260.080187]  ? exit_to_usermode_loop+0xa5/0xd0
Oct 24 15:23:43 aramis kernel: [4748260.080189]  do_syscall_64+0x73/0x130
Oct 24 15:23:43 aramis kernel: [4748260.080191]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Oct 24 15:23:43 aramis kernel: [4748260.080194] RIP: 0033:0x7f5c5c59add7
Oct 24 15:23:43 aramis kernel: [4748260.080195] RSP: 002b:00007f5c4fffb538 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Oct 24 15:23:43 aramis kernel: [4748260.080197] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007f5c5c59add7
Oct 24 15:23:43 aramis kernel: [4748260.080197] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000015
Oct 24 15:23:43 aramis kernel: [4748260.080198] RBP: 00007f5c509ec000 R08: 0000560726687350 R09: 000000000000ffff
Oct 24 15:23:43 aramis kernel: [4748260.080199] R10: 00007f5c74ed4000 R11: 0000000000000246 R12: 0000000000000000
Oct 24 15:23:43 aramis kernel: [4748260.080200] R13: 00007f5c74ed3000 R14: 0000000000000000 R15: 00007f5c509ec000
Oct 24 15:23:43 aramis kernel: [4748260.080229] INFO: task kworker/6:1:7682 blocked for more than 120 seconds.
Oct 24 15:23:43 aramis kernel: [4748260.080252]       Tainted: P           O     4.15.18-3-pve #1
Oct 24 15:23:43 aramis kernel: [4748260.080271] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 24 15:23:43 aramis kernel: [4748260.080294] kworker/6:1     D    0  7682      2 0x80000000
Oct 24 15:23:43 aramis kernel: [4748260.080307] Workqueue: events async_pf_execute [kvm]
Oct 24 15:23:43 aramis kernel: [4748260.080308] Call Trace:
Oct 24 15:23:43 aramis kernel: [4748260.080310]  __schedule+0x3e0/0x870
Oct 24 15:23:43 aramis kernel: [4748260.080313]  schedule+0x36/0x80
Oct 24 15:23:43 aramis kernel: [4748260.080314]  io_schedule+0x16/0x40
Oct 24 15:23:43 aramis kernel: [4748260.080316]  __lock_page_or_retry+0x2ce/0x2e0
Oct 24 15:23:43 aramis kernel: [4748260.080318]  ? find_get_entry+0x1e/0x100
Oct 24 15:23:43 aramis kernel: [4748260.080319]  ? page_cache_tree_insert+0xe0/0xe0
Oct 24 15:23:43 aramis kernel: [4748260.080321]  do_swap_page+0x5e5/0x9b0
Oct 24 15:23:43 aramis kernel: [4748260.080323]  __handle_mm_fault+0x88d/0x11e0
Oct 24 15:23:43 aramis kernel: [4748260.080325]  handle_mm_fault+0xce/0x1b0
Oct 24 15:23:43 aramis kernel: [4748260.080327]  __get_user_pages+0x11c/0x6c0
Oct 24 15:23:43 aramis kernel: [4748260.080329]  ? __switch_to_asm+0x34/0x70
Oct 24 15:23:43 aramis kernel: [4748260.080330]  ? __switch_to_asm+0x40/0x70
Oct 24 15:23:43 aramis kernel: [4748260.080332]  get_user_pages_remote+0x126/0x1b0
Oct 24 15:23:43 aramis kernel: [4748260.080341]  async_pf_execute+0x7a/0x190 [kvm]
Oct 24 15:23:43 aramis kernel: [4748260.080344]  process_one_work+0x1e0/0x400
Oct 24 15:23:43 aramis kernel: [4748260.080345]  worker_thread+0x4b/0x420
Oct 24 15:23:43 aramis kernel: [4748260.080348]  kthread+0x105/0x140
Oct 24 15:23:43 aramis kernel: [4748260.080349]  ? process_one_work+0x400/0x400
Oct 24 15:23:43 aramis kernel: [4748260.080351]  ? kthread_create_worker_on_cpu+0x70/0x70
Oct 24 15:23:43 aramis kernel: [4748260.080353]  ret_from_fork+0x35/0x40
Oct 24 15:25:44 aramis kernel: [4748380.912780] INFO: task kvm:11651 blocked for more than 120 seconds.
Oct 24 15:25:44 aramis kernel: [4748380.912811]       Tainted: P           O     4.15.18-3-pve #1
Oct 24 15:25:44 aramis kernel: [4748380.912830] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 24 15:25:44 aramis kernel: [4748380.912853] kvm             D    0 11651      1 0x80000000
Oct 24 15:25:44 aramis kernel: [4748380.912856] Call Trace:
Oct 24 15:25:44 aramis kernel: [4748380.912867]  __schedule+0x3e0/0x870
Oct 24 15:25:44 aramis kernel: [4748380.912871]  ? get_swap_bio+0xcf/0x100
Oct 24 15:25:44 aramis kernel: [4748380.912873]  schedule+0x36/0x80
Oct 24 15:25:44 aramis kernel: [4748380.912876]  io_schedule+0x16/0x40
Oct 24 15:25:44 aramis kernel: [4748380.912879]  __lock_page_or_retry+0x2ce/0x2e0
Oct 24 15:25:44 aramis kernel: [4748380.912881]  ? find_get_entry+0x1e/0x100
Oct 24 15:25:44 aramis kernel: [4748380.912883]  ? page_cache_tree_insert+0xe0/0xe0
Oct 24 15:25:44 aramis kernel: [4748380.912885]  do_swap_page+0x5e5/0x9b0
Oct 24 15:25:44 aramis kernel: [4748380.912887]  __handle_mm_fault+0x88d/0x11e0
Oct 24 15:25:44 aramis kernel: [4748380.912889]  handle_mm_fault+0xce/0x1b0
Oct 24 15:25:44 aramis kernel: [4748380.912894]  __get_user_pages+0x11c/0x6c0
Oct 24 15:25:44 aramis kernel: [4748380.912896]  get_user_pages_unlocked+0x12e/0x1b0
Oct 24 15:25:44 aramis kernel: [4748380.912916]  __gfn_to_pfn_memslot+0x30e/0x410 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.912929]  try_async_pf+0x8d/0x1f0 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.912940]  tdp_page_fault+0x12d/0x290 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.912951]  kvm_mmu_page_fault+0x62/0x160 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.912955]  handle_ept_violation+0xad/0x140 [kvm_intel]
Oct 24 15:25:44 aramis kernel: [4748380.912959]  vmx_handle_exit+0xb5/0x1520 [kvm_intel]
Oct 24 15:25:44 aramis kernel: [4748380.912961]  ? vmexit_fill_RSB+0x10/0x40 [kvm_intel]
Oct 24 15:25:44 aramis kernel: [4748380.912964]  ? vmx_vcpu_run+0x418/0x5e0 [kvm_intel]
Oct 24 15:25:44 aramis kernel: [4748380.912975]  kvm_arch_vcpu_ioctl_run+0x935/0x16c0 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.912986]  ? kvm_arch_vcpu_load+0x4d/0x250 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.912995]  ? kvm_arch_vcpu_load+0x68/0x250 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913004]  kvm_vcpu_ioctl+0x339/0x620 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913011]  ? kvm_vcpu_ioctl+0x339/0x620 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913014]  do_vfs_ioctl+0xa6/0x620
Oct 24 15:25:44 aramis kernel: [4748380.913024]  ? kvm_on_user_return+0x70/0xa0 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913026]  SyS_ioctl+0x79/0x90
Oct 24 15:25:44 aramis kernel: [4748380.913028]  ? exit_to_usermode_loop+0xa5/0xd0
Oct 24 15:25:44 aramis kernel: [4748380.913030]  do_syscall_64+0x73/0x130
Oct 24 15:25:44 aramis kernel: [4748380.913032]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Oct 24 15:25:44 aramis kernel: [4748380.913034] RIP: 0033:0x7fae32850dd7
Oct 24 15:25:44 aramis kernel: [4748380.913035] RSP: 002b:00007fae253fc538 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Oct 24 15:25:44 aramis kernel: [4748380.913037] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007fae32850dd7
Oct 24 15:25:44 aramis kernel: [4748380.913038] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000017
Oct 24 15:25:44 aramis kernel: [4748380.913039] RBP: 00007fae26e6e000 R08: 000055fb22927350 R09: 000000000000ffff
Oct 24 15:25:44 aramis kernel: [4748380.913040] R10: 00007fae4b0a1000 R11: 0000000000000246 R12: 0000000000000000
Oct 24 15:25:44 aramis kernel: [4748380.913040] R13: 00007fae4b0a0000 R14: 0000000000000000 R15: 00007fae26e6e000
Oct 24 15:25:44 aramis kernel: [4748380.913045] INFO: task kvm:36475 blocked for more than 120 seconds.
Oct 24 15:25:44 aramis kernel: [4748380.913066]       Tainted: P           O     4.15.18-3-pve #1
Oct 24 15:25:44 aramis kernel: [4748380.913083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 24 15:25:44 aramis kernel: [4748380.913106] kvm             D    0 36475      1 0x00000000
Oct 24 15:25:44 aramis kernel: [4748380.913108] Call Trace:
Oct 24 15:25:44 aramis kernel: [4748380.913111]  __schedule+0x3e0/0x870
Oct 24 15:25:44 aramis kernel: [4748380.913113]  schedule+0x36/0x80
Oct 24 15:25:44 aramis kernel: [4748380.913120]  cv_wait_common+0x11e/0x140 [spl]
Oct 24 15:25:44 aramis kernel: [4748380.913122]  ? wait_woken+0x80/0x80
Oct 24 15:25:44 aramis kernel: [4748380.913126]  __cv_wait+0x15/0x20 [spl]
Oct 24 15:25:44 aramis kernel: [4748380.913171]  zfs_range_lock+0x44d/0x5c0 [zfs]
Oct 24 15:25:44 aramis kernel: [4748380.913174]  ? spl_kmem_zalloc+0xa4/0x190 [spl]
Oct 24 15:25:44 aramis kernel: [4748380.913204]  zvol_get_data+0x84/0x180 [zfs]
Oct 24 15:25:44 aramis kernel: [4748380.913233]  zil_commit.part.14+0x451/0x8b0 [zfs]
Oct 24 15:25:44 aramis kernel: [4748380.913261]  zil_commit+0x17/0x20 [zfs]
Oct 24 15:25:44 aramis kernel: [4748380.913288]  zvol_request+0xd0/0x300 [zfs]
Oct 24 15:25:44 aramis kernel: [4748380.913292]  generic_make_request+0x123/0x2f0
Oct 24 15:25:44 aramis kernel: [4748380.913294]  submit_bio+0x73/0x150
Oct 24 15:25:44 aramis kernel: [4748380.913295]  ? submit_bio+0x73/0x150
Oct 24 15:25:44 aramis kernel: [4748380.913299]  submit_bio_wait+0x59/0x90
Oct 24 15:25:44 aramis kernel: [4748380.913300]  blkdev_issue_flush+0x85/0xb0
Oct 24 15:25:44 aramis kernel: [4748380.913303]  blkdev_fsync+0x35/0x50
Oct 24 15:25:44 aramis kernel: [4748380.913305]  vfs_fsync_range+0x51/0xb0
Oct 24 15:25:44 aramis kernel: [4748380.913306]  do_fsync+0x3d/0x70
Oct 24 15:25:44 aramis kernel: [4748380.913308]  SyS_fdatasync+0x13/0x20
Oct 24 15:25:44 aramis kernel: [4748380.913310]  do_syscall_64+0x73/0x130
Oct 24 15:25:44 aramis kernel: [4748380.913311]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Oct 24 15:25:44 aramis kernel: [4748380.913312] RIP: 0033:0x7f8d76fe160d
Oct 24 15:25:44 aramis kernel: [4748380.913313] RSP: 002b:00007f8cd13fc5f0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
Oct 24 15:25:44 aramis kernel: [4748380.913315] RAX: ffffffffffffffda RBX: 00000000fffffffb RCX: 00007f8d76fe160d
Oct 24 15:25:44 aramis kernel: [4748380.913315] RDX: 00007f8d6b1a31f0 RSI: 0000557005f832d0 RDI: 0000000000000016
Oct 24 15:25:44 aramis kernel: [4748380.913316] RBP: 00007f8ce4830440 R08: 0000000000000000 R09: 00007f8cd13ff700
Oct 24 15:25:44 aramis kernel: [4748380.913317] R10: 00007f8cd13fc620 R11: 0000000000000293 R12: 00007f8d6b1b4e40
Oct 24 15:25:44 aramis kernel: [4748380.913318] R13: 00007f8d6b1a3258 R14: 00007f8cd13ff700 R15: 00007f8ce4c0e290
Oct 24 15:25:44 aramis kernel: [4748380.913340] INFO: task kvm:29103 blocked for more than 120 seconds.
Oct 24 15:25:44 aramis kernel: [4748380.913360]       Tainted: P           O     4.15.18-3-pve #1
Oct 24 15:25:44 aramis kernel: [4748380.913378] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 24 15:25:44 aramis kernel: [4748380.913401] kvm             D    0 29103      1 0x00000000
Oct 24 15:25:44 aramis kernel: [4748380.913403] Call Trace:
Oct 24 15:25:44 aramis kernel: [4748380.913405]  __schedule+0x3e0/0x870
Oct 24 15:25:44 aramis kernel: [4748380.913407]  ? get_swap_bio+0xcf/0x100
Oct 24 15:25:44 aramis kernel: [4748380.913409]  schedule+0x36/0x80
Oct 24 15:25:44 aramis kernel: [4748380.913410]  io_schedule+0x16/0x40
Oct 24 15:25:44 aramis kernel: [4748380.913412]  __lock_page_or_retry+0x2ce/0x2e0
Oct 24 15:25:44 aramis kernel: [4748380.913414]  ? find_get_entry+0x1e/0x100
Oct 24 15:25:44 aramis kernel: [4748380.913416]  ? page_cache_tree_insert+0xe0/0xe0
Oct 24 15:25:44 aramis kernel: [4748380.913418]  do_swap_page+0x5e5/0x9b0
Oct 24 15:25:44 aramis kernel: [4748380.913420]  __handle_mm_fault+0x88d/0x11e0
Oct 24 15:25:44 aramis kernel: [4748380.913422]  ? ttwu_do_activate+0x77/0x80
Oct 24 15:25:44 aramis kernel: [4748380.913424]  handle_mm_fault+0xce/0x1b0
Oct 24 15:25:44 aramis kernel: [4748380.913426]  __get_user_pages+0x11c/0x6c0
Oct 24 15:25:44 aramis kernel: [4748380.913429]  get_user_pages_unlocked+0x12e/0x1b0
Oct 24 15:25:44 aramis kernel: [4748380.913442]  __gfn_to_pfn_memslot+0x30e/0x410 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913454]  try_async_pf+0x8d/0x1f0 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913465]  tdp_page_fault+0x12d/0x290 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913476]  kvm_mmu_page_fault+0x62/0x160 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913480]  handle_ept_violation+0xad/0x140 [kvm_intel]
Oct 24 15:25:44 aramis kernel: [4748380.913483]  vmx_handle_exit+0xb5/0x1520 [kvm_intel]
Oct 24 15:25:44 aramis kernel: [4748380.913485]  ? vmexit_fill_RSB+0x10/0x40 [kvm_intel]
Oct 24 15:25:44 aramis kernel: [4748380.913488]  ? vmx_vcpu_run+0x418/0x5e0 [kvm_intel]
Oct 24 15:25:44 aramis kernel: [4748380.913499]  kvm_arch_vcpu_ioctl_run+0x935/0x16c0 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913510]  ? kvm_arch_vcpu_load+0x4d/0x250 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913519]  ? kvm_arch_vcpu_load+0x68/0x250 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913528]  kvm_vcpu_ioctl+0x339/0x620 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913536]  ? kvm_vcpu_ioctl+0x339/0x620 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913538]  do_vfs_ioctl+0xa6/0x620
Oct 24 15:25:44 aramis kernel: [4748380.913548]  ? kvm_on_user_return+0x70/0xa0 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913549]  SyS_ioctl+0x79/0x90
Oct 24 15:25:44 aramis kernel: [4748380.913551]  ? exit_to_usermode_loop+0xa5/0xd0
Oct 24 15:25:44 aramis kernel: [4748380.913553]  do_syscall_64+0x73/0x130
Oct 24 15:25:44 aramis kernel: [4748380.913555]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Oct 24 15:25:44 aramis kernel: [4748380.913555] RIP: 0033:0x7f5c5c59add7
Oct 24 15:25:44 aramis kernel: [4748380.913556] RSP: 002b:00007f5c4fffb538 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Oct 24 15:25:44 aramis kernel: [4748380.913558] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007f5c5c59add7
Oct 24 15:25:44 aramis kernel: [4748380.913558] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000015
Oct 24 15:25:44 aramis kernel: [4748380.913559] RBP: 00007f5c509ec000 R08: 0000560726687350 R09: 000000000000ffff
Oct 24 15:25:44 aramis kernel: [4748380.913560] R10: 00007f5c74ed4000 R11: 0000000000000246 R12: 0000000000000000
Oct 24 15:25:44 aramis kernel: [4748380.913561] R13: 00007f5c74ed3000 R14: 0000000000000000 R15: 00007f5c509ec000
Oct 24 15:25:44 aramis kernel: [4748380.913563] INFO: task kvm:29104 blocked for more than 120 seconds.
Oct 24 15:25:44 aramis kernel: [4748380.913582]       Tainted: P           O     4.15.18-3-pve #1
Oct 24 15:25:44 aramis kernel: [4748380.913600] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 24 15:25:44 aramis kernel: [4748380.913623] kvm             D    0 29104      1 0x00000002
Oct 24 15:25:44 aramis kernel: [4748380.913625] Call Trace:
Oct 24 15:25:44 aramis kernel: [4748380.913628]  __schedule+0x3e0/0x870
Oct 24 15:25:44 aramis kernel: [4748380.913629]  ? get_swap_bio+0xcf/0x100
Oct 24 15:25:44 aramis kernel: [4748380.913631]  schedule+0x36/0x80
Oct 24 15:25:44 aramis kernel: [4748380.913632]  io_schedule+0x16/0x40
Oct 24 15:25:44 aramis kernel: [4748380.913634]  __lock_page_or_retry+0x2ce/0x2e0
Oct 24 15:25:44 aramis kernel: [4748380.913636]  ? find_get_entry+0x1e/0x100
Oct 24 15:25:44 aramis kernel: [4748380.913637]  ? page_cache_tree_insert+0xe0/0xe0
Oct 24 15:25:44 aramis kernel: [4748380.913639]  do_swap_page+0x5e5/0x9b0
Oct 24 15:25:44 aramis kernel: [4748380.913640]  __handle_mm_fault+0x88d/0x11e0
Oct 24 15:25:44 aramis kernel: [4748380.913642]  handle_mm_fault+0xce/0x1b0
Oct 24 15:25:44 aramis kernel: [4748380.913644]  __get_user_pages+0x11c/0x6c0
Oct 24 15:25:44 aramis kernel: [4748380.913647]  get_user_pages_unlocked+0x12e/0x1b0
Oct 24 15:25:44 aramis kernel: [4748380.913655]  __gfn_to_pfn_memslot+0x30e/0x410 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913667]  try_async_pf+0x8d/0x1f0 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913677]  tdp_page_fault+0x12d/0x290 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913687]  kvm_mmu_page_fault+0x62/0x160 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913690]  handle_ept_violation+0xad/0x140 [kvm_intel]
Oct 24 15:25:44 aramis kernel: [4748380.913693]  vmx_handle_exit+0xb5/0x1520 [kvm_intel]
Oct 24 15:25:44 aramis kernel: [4748380.913696]  ? vmexit_fill_RSB+0x10/0x40 [kvm_intel]
Oct 24 15:25:44 aramis kernel: [4748380.913699]  ? vmx_vcpu_run+0x418/0x5e0 [kvm_intel]
Oct 24 15:25:44 aramis kernel: [4748380.913709]  kvm_arch_vcpu_ioctl_run+0x935/0x16c0 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913719]  ? kvm_arch_vcpu_load+0x4d/0x250 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913728]  ? kvm_arch_vcpu_load+0x68/0x250 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913737]  kvm_vcpu_ioctl+0x339/0x620 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913744]  ? kvm_vcpu_ioctl+0x339/0x620 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913746]  do_vfs_ioctl+0xa6/0x620
Oct 24 15:25:44 aramis kernel: [4748380.913749]  ? _cond_resched+0x1a/0x50
Oct 24 15:25:44 aramis kernel: [4748380.913758]  ? kvm_on_user_return+0x70/0xa0 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913760]  SyS_ioctl+0x79/0x90
Oct 24 15:25:44 aramis kernel: [4748380.913762]  do_syscall_64+0x73/0x130
Oct 24 15:25:44 aramis kernel: [4748380.913763]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Oct 24 15:25:44 aramis kernel: [4748380.913764] RIP: 0033:0x7f5c5c59add7
Oct 24 15:25:44 aramis kernel: [4748380.913765] RSP: 002b:00007f5c4effc538 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Oct 24 15:25:44 aramis kernel: [4748380.913766] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007f5c5c59add7
Oct 24 15:25:44 aramis kernel: [4748380.913767] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000016
Oct 24 15:25:44 aramis kernel: [4748380.913768] RBP: 00007f5c50a59000 R08: 0000560726687350 R09: 00000000000000ff
Oct 24 15:25:44 aramis kernel: [4748380.913769] R10: 00000000000fe401 R11: 0000000000000246 R12: 0000000000000000
Oct 24 15:25:44 aramis kernel: [4748380.913770] R13: 00007f5c74dea000 R14: 0000000000000000 R15: 00007f5c50a59000
Oct 24 15:25:44 aramis kernel: [4748380.913779] INFO: task kvm:20205 blocked for more than 120 seconds.
Oct 24 15:25:44 aramis kernel: [4748380.913798]       Tainted: P           O     4.15.18-3-pve #1
Oct 24 15:25:44 aramis kernel: [4748380.913816] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 24 15:25:44 aramis kernel: [4748380.913840] kvm             D    0 20205      1 0x00000000
Oct 24 15:25:44 aramis kernel: [4748380.913841] Call Trace:
Oct 24 15:25:44 aramis kernel: [4748380.913844]  __schedule+0x3e0/0x870
Oct 24 15:25:44 aramis kernel: [4748380.913845]  ? get_swap_bio+0xcf/0x100
Oct 24 15:25:44 aramis kernel: [4748380.913847]  schedule+0x36/0x80
Oct 24 15:25:44 aramis kernel: [4748380.913848]  io_schedule+0x16/0x40
Oct 24 15:25:44 aramis kernel: [4748380.913850]  __lock_page_or_retry+0x2ce/0x2e0
Oct 24 15:25:44 aramis kernel: [4748380.913852]  ? find_get_entry+0x1e/0x100
Oct 24 15:25:44 aramis kernel: [4748380.913853]  ? page_cache_tree_insert+0xe0/0xe0
Oct 24 15:25:44 aramis kernel: [4748380.913854]  do_swap_page+0x5e5/0x9b0
Oct 24 15:25:44 aramis kernel: [4748380.913856]  __handle_mm_fault+0x88d/0x11e0
Oct 24 15:25:44 aramis kernel: [4748380.913865]  ? gfn_to_hva_memslot_prot+0x1b/0x40 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913867]  handle_mm_fault+0xce/0x1b0
Oct 24 15:25:44 aramis kernel: [4748380.913869]  __get_user_pages+0x11c/0x6c0
Oct 24 15:25:44 aramis kernel: [4748380.913872]  get_user_pages_unlocked+0x12e/0x1b0
Oct 24 15:25:44 aramis kernel: [4748380.913880]  __gfn_to_pfn_memslot+0x30e/0x410 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913891]  try_async_pf+0x8d/0x1f0 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913901]  tdp_page_fault+0x12d/0x290 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913911]  kvm_mmu_page_fault+0x62/0x160 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913914]  handle_ept_violation+0xad/0x140 [kvm_intel]
Oct 24 15:25:44 aramis kernel: [4748380.913917]  vmx_handle_exit+0xb5/0x1520 [kvm_intel]
Oct 24 15:25:44 aramis kernel: [4748380.913919]  ? vmexit_fill_RSB+0x10/0x40 [kvm_intel]
Oct 24 15:25:44 aramis kernel: [4748380.913922]  ? vmx_vcpu_run+0x418/0x5e0 [kvm_intel]
Oct 24 15:25:44 aramis kernel: [4748380.913932]  kvm_arch_vcpu_ioctl_run+0x935/0x16c0 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913943]  ? kvm_arch_vcpu_load+0x4d/0x250 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913952]  ? kvm_arch_vcpu_load+0x68/0x250 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913960]  kvm_vcpu_ioctl+0x339/0x620 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913968]  ? kvm_vcpu_ioctl+0x339/0x620 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913970]  do_vfs_ioctl+0xa6/0x620
Oct 24 15:25:44 aramis kernel: [4748380.913971]  ? handle_mm_fault+0xce/0x1b0
Oct 24 15:25:44 aramis kernel: [4748380.913981]  ? kvm_on_user_return+0x70/0xa0 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.913982]  SyS_ioctl+0x79/0x90
Oct 24 15:25:44 aramis kernel: [4748380.913984]  ? exit_to_usermode_loop+0xa5/0xd0
Oct 24 15:25:44 aramis kernel: [4748380.913986]  do_syscall_64+0x73/0x130
Oct 24 15:25:44 aramis kernel: [4748380.913987]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Oct 24 15:25:44 aramis kernel: [4748380.913988] RIP: 0033:0x7f9c688c9dd7
Oct 24 15:25:44 aramis kernel: [4748380.913989] RSP: 002b:00007f9c5c3fb538 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Oct 24 15:25:44 aramis kernel: [4748380.913991] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007f9c688c9dd7
Oct 24 15:25:44 aramis kernel: [4748380.913991] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000014
Oct 24 15:25:44 aramis kernel: [4748380.913992] RBP: 00007f9c5cdec000 R08: 0000558fdc390350 R09: 000000000000ffff
Oct 24 15:25:44 aramis kernel: [4748380.913993] R10: 00007f9c81203000 R11: 0000000000000246 R12: 0000000000000000
Oct 24 15:25:44 aramis kernel: [4748380.913994] R13: 00007f9c81202000 R14: 0000000000000000 R15: 00007f9c5cdec000
Oct 24 15:25:44 aramis kernel: [4748380.914006] INFO: task kworker/12:1:12376 blocked for more than 120 seconds.
Oct 24 15:25:44 aramis kernel: [4748380.914027]       Tainted: P           O     4.15.18-3-pve #1
Oct 24 15:25:44 aramis kernel: [4748380.914044] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 24 15:25:44 aramis kernel: [4748380.914068] kworker/12:1    D    0 12376      2 0x80000000
Oct 24 15:25:44 aramis kernel: [4748380.914080] Workqueue: events async_pf_execute [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.914081] Call Trace:
Oct 24 15:25:44 aramis kernel: [4748380.914084]  __schedule+0x3e0/0x870
Oct 24 15:25:44 aramis kernel: [4748380.914086]  schedule+0x36/0x80
Oct 24 15:25:44 aramis kernel: [4748380.914087]  io_schedule+0x16/0x40
Oct 24 15:25:44 aramis kernel: [4748380.914089]  __lock_page_or_retry+0x2ce/0x2e0
Oct 24 15:25:44 aramis kernel: [4748380.914090]  ? find_get_entry+0x1e/0x100
Oct 24 15:25:44 aramis kernel: [4748380.914092]  ? page_cache_tree_insert+0xe0/0xe0
Oct 24 15:25:44 aramis kernel: [4748380.914093]  do_swap_page+0x5e5/0x9b0
Oct 24 15:25:44 aramis kernel: [4748380.914095]  __handle_mm_fault+0x88d/0x11e0
Oct 24 15:25:44 aramis kernel: [4748380.914097]  handle_mm_fault+0xce/0x1b0
Oct 24 15:25:44 aramis kernel: [4748380.914099]  __get_user_pages+0x11c/0x6c0
Oct 24 15:25:44 aramis kernel: [4748380.914101]  ? __switch_to_asm+0x34/0x70
Oct 24 15:25:44 aramis kernel: [4748380.914102]  ? __switch_to_asm+0x40/0x70
Oct 24 15:25:44 aramis kernel: [4748380.914105]  get_user_pages_remote+0x126/0x1b0
Oct 24 15:25:44 aramis kernel: [4748380.914113]  async_pf_execute+0x7a/0x190 [kvm]
Oct 24 15:25:44 aramis kernel: [4748380.914116]  process_one_work+0x1e0/0x400
Oct 24 15:25:44 aramis kernel: [4748380.914118]  worker_thread+0x4b/0x420
Oct 24 15:25:44 aramis kernel: [4748380.914120]  kthread+0x105/0x140
Oct 24 15:25:44 aramis kernel: [4748380.914122]  ? process_one_work+0x400/0x400
Oct 24 15:25:44 aramis kernel: [4748380.914123]  ? kthread_create_worker_on_cpu+0x70/0x70
Oct 24 15:25:44 aramis kernel: [4748380.914125]  ret_from_fork+0x35/0x40

gkovacs · Oct 25, 2018

saphirblanc said:
We are facing the same issue when we try to restore a backup, even if we limit the read speed. IO are so high that all VMs are getting stuck and are showing up the following message :

Message from syslogd@XXX at Oct 24 15:28:26 ...
kernel:[2226496.055048] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [systemd-cgroups:1464]

Yes, this problem is still massively affecting many Proxmox users, unfortunately no one has any idea what could be causing it. Actually the Proxmox developers didn't even acknowledge so far that this bug exists (my bugreport is still in NEW status), so I wouldn't hold my breath that this gets fixed anytime soon.

You should add your experience to the bugreport, maybe someone will finally take it seriously and check it out:
https://bugzilla.proxmox.com/show_bug.cgi?id=1453

Also there are a few steps though you can take that would help lessen the impact:

1. Put this in your /etc/sysctl.conf

Code:

vm.swappiness=1
vm.min_free_kbytes=524288
vm.dirty_ratio=2
vm.dirty_background_ratio=1

With these settings the IO blocking happens much later, and for much shorter periods.

2. If you are using an all HDD ZFS pool, add an SSD as a log (ZIL) and cache (L2ARC) to your pool. I recommend using an NVMe SSD, but a SATA will suffice as well. On a 250GB SSD, 10-20GB is more than enough for log, 100-150GB for cache, and leave 30-50% unpartitioned to keep the write speed high. Put these in your /etc/modprobe.d/zfs.conf

Code:

options zfs l2arc_headroom=4
options zfs l2arc_headroom_boost=8
options zfs l2arc_feed_min_ms=10
options zfs l2arc_write_max=134217728
options zfs l2arc_write_boost=268435456
options zfs l2arc_noprefetch=1

3. Put swap on the (NVMe) SSD as well if you can, and remove it from your all HDD rpool.

KVM guests freeze (hung tasks) during backup/restore/migrate

Distinguished Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

New Member

Renowned Member

New Member

Famous Member

New Member

Proxmox Staff Member

Famous Member

Proxmox Staff Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

We value your privacy