Proxmox V6 Servers freeze, Zvol blocked for more than 120s

Update from Github:
behlendorf commented 5 hours ago
Thanks for verifying the proposed workaround. For those interested, this is caused by an interaction between the scrub, which attempted to reopen the devices, and the compatibility code required for the zfs.ko to set the block device's scheduler. Setting zfs_vdev_scheduler=none prevents the issue by instructing the zfs.ko to never modify the scheduler.

I've opened #9317 which proposes to remove the zfs_vdev_scheduler module option and delegate setting the scheduler as needed entirely to user space. Additional information is provided in the PR and discussion is welcome. This change would apply to the next major release, we should be able to do something less disruptive for the current release.

Here you can track the process of solving this issue by zfsonlinux itself:
https://github.com/zfsonlinux/zfs/pull/9317
 
  • Like
Reactions: Stoiko Ivanov
Nice that a workaround has been found and a solution is being implemented in zfsonlinux.

I've done some further testing now.
  • Another kernel.log remotely written by rsyslog, however I entered echo t > /proc/sysrq-trigger only once right BEFORE triggering zpool scrub rpool: Attachment kernel-no-sysrq-during-problem-but-before.zip
    This is my only log that includes any output saying INFO task ... blocked for more than 120 seconds.
    kernel-no-sysrq-during-problem-but-before.png
    • I am starting the scrub right before the message [132667.894204] [UFW BLOCK] appears.
      kernel-no-sysrq-during-problem-but-before-screen2.png

      kernel-no-sysrq-during-problem-but-before-screen3.png
    • The error messages occur at this timestamp:
      [132921.452804] INFO: task txg_sync:566 blocked for more than 120 seconds.
      for the first time, i.e. about 4 minutes 14 seconds after starting the scrub.
      kernel-no-sysrq-during-problem-but-before-screen4.png
    • The load increases then steadily, the system does not react to a reboot or anything anymore​
    • Unfortunately, I had entered zpool scrub rpool in the wrong terminal, so I couldn't trigger a echo t > /proc/sysrq-trigger anymore afterwards.

  • Another kernel.log remotely written by rsyslog, where I entered echo t > /proc/sysrq-trigger 34 times (many times during the hang up): Attachment kernel-problem-incl-sysrq-during.zip (warning! 35 MB unzipped)

  • This is the kernel.log (not written remotely) where the first time zpool scrub rpool did not hang. Since then, scrub doesn't hang anymore. I've not even tried out your workaround yet, but I can't trigger the bug anymore: kern-manyLogs-but-no-problem-first-time--not-remote.zip (warning! 17 MB unzipped) I've entered echo t > /proc/sysrq-trigger about 7 times during the run.

  • Here is my updated zpool information including history: Attachment zpool_scrubbed_success.txt
It's interesting that I can't trigger the bug anymore, although I havn't tried out the workaround yet.
 

Attachments

  • kernel-problem-incl-sysrq-during.zip
    800.8 KB · Views: 1
  • kernel-no-sysrq-during-problem-but-before.zip
    64.9 KB · Views: 1
  • kern-manyLogs-but-no-problem-first-time--not-remote.zip
    787.9 KB · Views: 1
  • zpool_scrubbed_success.txt
    16.7 KB · Views: 2
When we can expect the patch on proxmox repo ? Or is it already?

I am not familiar with the process, but as far as I can see:

 
Is this issue fixed yet? I'm having the same problem on Proxmox VE 6.0.4 (iso installer) where the kernel froze when accessing zvol on secondary zpool on ssd disks.
 
What's your zfs version?

ZFS 0.8.1 the one which came with the iso installer.

I used secondary zpool mirror (raid1) which are using 1TB enterprise SSD disks and I could perform zpool scrub to the pool just fine. It's only when it's idling around 2-3am and the kernel and the zfs zvol froze.

pve 6.0 kernel zfs freezes 2019-11-08_22-15-44.png

Code:
# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-3
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1
 
Last edited:
ZFS 0.8.1 the one which came with the iso installer.

Code:
# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
...
zfsutils-linux: 0.8.1-pve1

See my post https://forum.proxmox.com/threads/p...d-for-more-than-120s.57765/page-3#post-271946

ZFS 0.8.2 is the first version to include the bugfix.

5.0.21-7 is the first PVE kernel version that ships ZFS 0.8.2.

You should update your proxmox first. Your iso does not include the current updates.
 
  • Like
Reactions: chrone
See my post https://forum.proxmox.com/threads/p...d-for-more-than-120s.57765/page-3#post-271946

ZFS 0.8.2 is the first version to include the bugfix.

5.0.21-7 is the first PVE kernel version that ships ZFS 0.8.2.

You should update your proxmox first. Your iso does not include the current updates.

Thanks for the link. Did it solve your issue? And does the newer version introduce more bugs? If it's stable enough, I'm willing to give it a try next Monday.

I also just disabled the Xeon E5 V4 CPU C6 C-states in BIOS right now. Let's see in the weekend whether that helps or not.
 
I believe it hasn't been fixed. Even when applying all Proxmox updates. The system crashed at 00.38hrs of the second Sunday of the month, same as last month. This was the screen we had at that time
 

Attachments

  • error_proxmox.PNG
    error_proxmox.PNG
    174.1 KB · Views: 26
  • Like
Reactions: chrone
I believe it hasn't been fixed. Even when applying all Proxmox updates. The system crashed at 00.38hrs of the second Sunday of the month, same as last month. This was the screen we had at that time

Is there any chance you are using writeback cache in VM the zfs zvol local storage?

And have you tried disabling CPU C6 state and only enabling C0/C1/C1E in the BIOS?
 
Having the same issue with the latest proxmox version.
Code:
proxmox-ve: 6.0-2 (running kernel: 5.0.21-4-pve)
pve-manager: 6.0-11 (running version: 6.0-11/2140ef37)
pve-kernel-helper: 6.0-11
pve-kernel-5.0: 6.0-10
pve-kernel-5.0.21-4-pve: 5.0.21-9
pve-kernel-5.0.21-3-pve: 5.0.21-7
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-3
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-7
libpve-guest-common-perl: 3.0-2
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-8
pve-cluster: 6.0-7
pve-container: 3.0-10
pve-docs: 6.0-8
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-4
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.1-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-13
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2

txg_sync hung during heavy IO, in my case while benchmarking 2 pools using fio at the same time.
Code:
[  968.292727] INFO: task txg_sync:2130 blocked for more than 120 seconds.
[  968.292762]        Tainted: P          0      5.0.21-4-pve #1
[  968.292784] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  968.292815] txg_sync        D   0   2130      2 0x80000000
[  968.292837] Call Trace:
[  968.292855] __schedule+0x2d4/0x870
[  968.292872] schedule+Ox2c/0x70
[  968.292887] schedule_timeout+0x157/0x360
[  968.292906] ? __next_timer_interrupt+OxdO/OxdO
[  968.292925]  io_schedule_timeout+Ox1e/Ox50
[  968.292946] __cv_timedwait_common+Oxl2f/0x170  [spl]
[  968.292968] ? wait_woken+0x80/0x80
[  968.292985] __cv_timedwait_io+0x19/0x20 [spl]
[  968.293049] zio_wait+Ox13a/0x280 [zfs]
[  968.293067] ? _cond_resched+0x19/0x30
[  968.293115] dsl_pool_sync+Oxdc/Ox4f0 [zfs]
[  968.293168] spa_sync+0x5b2/Oxfc0 [zfs]
[  968.293221] ? spa_txg_history_init_io+0x106/0x110 [zfs]
[  968.293278] txg_sync_thread+0x2d9/0x4c0 [zfs]
[  968.293331] ? txg_thread_exit.isra.11+0x60/0x60 [zfs]
[  968.293355] thread_generic_wrapper+0x74/0x90 [spl]
[  968.293376] kthread+Ox120/0x140
[  968.293393] ? __thread_exit+0x20/0x20 [sell
[  968.293410]    __kthread_parkme+0x70/0x70
[  968.293428] ret_from_fork+0x35/0x40

Already tried setting zfs parameters but no effect:
Code:
options zfs zfs_arc_max=25769803776
options zfs zfetch_max_distance=268435456
options zfs zfs_prefetch_disable=1
options zfs zfs_nocacheflush=1
options zfs zfs_vdev_scheduler=none
 
  • Like
Reactions: chrone
I believe it hasn't been fixed. Even when applying all Proxmox updates. The system crashed at 00.38hrs of the second Sunday of the month, same as last month. This was the screen we had at that time

that screenshot shows that you are running an outdated kernel though?
 
Having the same issue with the latest proxmox version.
Code:
proxmox-ve: 6.0-2 (running kernel: 5.0.21-4-pve)
pve-manager: 6.0-11 (running version: 6.0-11/2140ef37)
pve-kernel-helper: 6.0-11
pve-kernel-5.0: 6.0-10
pve-kernel-5.0.21-4-pve: 5.0.21-9
pve-kernel-5.0.21-3-pve: 5.0.21-7
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-3
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-7
libpve-guest-common-perl: 3.0-2
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-8
pve-cluster: 6.0-7
pve-container: 3.0-10
pve-docs: 6.0-8
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-4
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.1-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-13
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2

txg_sync hung during heavy IO, in my case while benchmarking 2 pools using fio at the same time.
Code:
[  968.292727] INFO: task txg_sync:2130 blocked for more than 120 seconds.
[  968.292762]        Tainted: P          0      5.0.21-4-pve #1
[  968.292784] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  968.292815] txg_sync        D   0   2130      2 0x80000000
[  968.292837] Call Trace:
[  968.292855] __schedule+0x2d4/0x870
[  968.292872] schedule+Ox2c/0x70
[  968.292887] schedule_timeout+0x157/0x360
[  968.292906] ? __next_timer_interrupt+OxdO/OxdO
[  968.292925]  io_schedule_timeout+Ox1e/Ox50
[  968.292946] __cv_timedwait_common+Oxl2f/0x170  [spl]
[  968.292968] ? wait_woken+0x80/0x80
[  968.292985] __cv_timedwait_io+0x19/0x20 [spl]
[  968.293049] zio_wait+Ox13a/0x280 [zfs]
[  968.293067] ? _cond_resched+0x19/0x30
[  968.293115] dsl_pool_sync+Oxdc/Ox4f0 [zfs]
[  968.293168] spa_sync+0x5b2/Oxfc0 [zfs]
[  968.293221] ? spa_txg_history_init_io+0x106/0x110 [zfs]
[  968.293278] txg_sync_thread+0x2d9/0x4c0 [zfs]
[  968.293331] ? txg_thread_exit.isra.11+0x60/0x60 [zfs]
[  968.293355] thread_generic_wrapper+0x74/0x90 [spl]
[  968.293376] kthread+Ox120/0x140
[  968.293393] ? __thread_exit+0x20/0x20 [sell
[  968.293410]    __kthread_parkme+0x70/0x70
[  968.293428] ret_from_fork+0x35/0x40

Already tried setting zfs parameters but no effect:
Code:
options zfs zfs_arc_max=25769803776
options zfs zfetch_max_distance=268435456
options zfs zfs_prefetch_disable=1
options zfs zfs_nocacheflush=1
options zfs zfs_vdev_scheduler=none

that is not a panic/dead lock, but the system telling you that your storage is way undersized for the kind of load you throw at it (while benchmarking).
 
that is not a panic/dead lock, but the system telling you that your storage is way undersized for the kind of load you throw at it (while benchmarking).
I'm using zfs for root pool, it's indeed a deadlock. Console is completely frozen, Ipmi and softdog triggered after about 10sec.
Then I disabled the watchdog and redid the test, this time the system has frozen for more than one hour before I shut it down.
 
I'm using zfs for root pool, it's indeed a deadlock. Console is completely frozen, Ipmi and softdog triggered after about 10sec.
Then I disabled the watchdog and redid the test, this time the system has frozen for more than one hour before I shut it down.

that still doesn't say it's a deadlock. if you overload your system, it can take ages to do the work you requested and recover fully. a deadlock means it does no progress and never recovers. in any case it's an altogether different issue than this one, so please open a new thread!
 
that still doesn't say it's a deadlock. if you overload your system, it can take ages to do the work you requested and recover fully. a deadlock means it does no progress and never recovers. in any case it's an altogether different issue than this one, so please open a new thread!

So how to spot a deadlock?

Is the zvol kernel call trace a deadlock as shown in above post at https://forum.proxmox.com/threads/p...d-for-more-than-120s.57765/page-3#post-276627 a deadlock?

For your information, after I set the VM disk cache mode to default (no cache) which is using ZFS zvol, and disabled the C6 CPU state and left only C0/C1E enabled, the Proxmox node is stable for the last 5 days.
 
Hi,

My 2 cents ideea . As a older zfs user, I do not dare to upgrade my zfs from older version (0.7.x) to the current version (0.8.x) after at least 6(or even more) months on any production server.
 
So how to spot a deadlock?

you check the call stack for hints, ideally in combination with ZFS debugging output. sometimes it's quite obvious that it's just a general overload (e.g., tasks are just waiting for I/O to complete), sometimes it's quite likely that it's a deadlock (e.g., multiple tasks are waiting for locks). in many cases it's not trivial to determine, and you need a reproducer+debug build or careful manual analysis of the code that can produce such call stacks.

Is the zvol kernel call trace a deadlock as shown in above post at https://forum.proxmox.com/threads/p...d-for-more-than-120s.57765/page-3#post-276627 a deadlock?

For your information, after I set the VM disk cache mode to default (no cache) which is using ZFS zvol, and disabled the C6 CPU state and left only C0/C1E enabled, the Proxmox node is stable for the last 5 days.

see above - that's not possible to tell just from that trace. if you can easily reproduce it on a current ZFS+kernel version, a debug-build of the modules might help in further analysis.
 
Happened to me today: /proc/sys/kernel/hung_task_timeout_secs

Proxmox 6.0-7 - 5.0.21-2-pve #1 SMP PVE 5.0.21-3 (Thu, 05 Sep 2019 13:56:01 +0200) x86_64 GNU/Linux

This server is running for about 150days.

The SSH / Proxmox interface become irresponsive, so, i logged in the console, and after did a "df -h" the hunk_task_timeout_secs' show and the console become irresponsive too.

ZFS RAID1 for system
LVM-THIN to VM's
NFS mounted for backups

I noted that automatic backup in NFS doesnt work for abount 2 days (so the problem can be started in this date).

The VM's are running fine, it's running a critical SQL server, the only solution is reboot the Proxmox Server?

What are the solution for this? Update Proxmox to last 6 version?
 

Attachments

  • Captura de tela de 2020-03-05 10-53-35.png
    Captura de tela de 2020-03-05 10-53-35.png
    203.6 KB · Views: 14
  • Captura de tela de 2020-03-05 10-53-09.png
    Captura de tela de 2020-03-05 10-53-09.png
    206.7 KB · Views: 12

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!