Proxmox V6 Servers freeze, Zvol blocked for more than 120s

Okay so I managed to move things around a little earlier than expected, and so the new information from me is:

- I stopped the only VM on this machine (my storage VM providing backup space for the proxmox cluster), so no workloads are running on the node
- i did "echo 1 > /proc/sys/kernel/sysrq " and "echo t > /proc/sysrq-trigger" with remote syslog enabled

I ran one "echo t" initially, then started the zpool scrub, then managed to run "echo t" several more times before everything started falling apart on the box (and then I had to power cycle it, it returns to service just fine and shows nothing about the zpool scrub at all - so I guess the scrub actually never really starts). Trying to run any process on the box at this point that requires disk I/O just stalls and cannot be returned from. Anything in memory/already running that doesn't need disk I/O seems to keep running just fine.

The log is attached - I annotated each section with ">>>" to show where the sysrq-triggers were fired, so hopefully this provides some useful info.

The last "section" of the log shows processes getting stuck because of no I/O i think. It's almost exactly as @sandor mentioned, it's like the box lost all disks.

edit: forgot to say, this server is fully up to date (with a community subscription)
 

Attachments

Last edited:
I got the following answer on github from Brian:

thank you for posting the logs. I believe I see what's causing the reported hangs. As a workaround you can try setting zfs_vdev_scheduler=none. This should avoid the problem, it looks like 4.12 and newer kernels might encounter this.

So, we should try to set zfs_vdev_scheduler=none and check, does it work or not.
 
Same problem here


Code:
pve:~# zpool status -t
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:01:11 with 0 errors on Sun Sep  8 00:25:12 2019
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sdd3    ONLINE       0     0     0  (100% trimmed, completed at Sat 20 Jul 2019 01:19:34 PM CEST)
        sde3    ONLINE       0     0     0  (100% trimmed, completed at Sat 20 Jul 2019 01:19:34 PM CEST)

errors: No known data errors

  pool: storage
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:31:14 with 0 errors on Sun Sep  8 00:55:16 2019
config:

    NAME                        STATE     READ WRITE CKSUM
    storage                     ONLINE       0     0     0
      mirror-0                  ONLINE       0     0     0
        wwn-0x55cd2e415082e975  ONLINE       0     0     0  (100% trimmed, completed at Sat 20 Jul 2019 02:04:16 PM CEST)
        wwn-0x55cd2e415064c6dc  ONLINE       0     0     0  (100% trimmed, completed at Sat 20 Jul 2019 02:04:16 PM CEST)
      mirror-1                  ONLINE       0     0     0
        wwn-0x55cd2e41508346de  ONLINE       0     0     0  (100% trimmed, completed at Sat 20 Jul 2019 02:04:21 PM CEST)
        wwn-0x55cd2e415083470a  ONLINE       0     0     0  (100% trimmed, completed at Sat 20 Jul 2019 02:04:21 PM CEST)
      mirror-2                  ONLINE       0     0     0
        wwn-0x55cd2e415064c6e7  ONLINE       0     0     0  (100% trimmed, completed at Sat 20 Jul 2019 02:04:35 PM CEST)
        wwn-0x55cd2e4150831023  ONLINE       0     0     0  (100% trimmed, completed at Sat 20 Jul 2019 02:04:35 PM CEST)

errors: No known data errors
 

Attachments

I think I'm in the same boat.

Here is my post and diag. https://forum.proxmox.com/threads/proxmox-ve-6-0-released.56001/post-258777

I tried this https://forum.proxmox.com/threads/proxmox-ve-6-0-released.56001/post-259157 , but it didn`t help, server still crushing when scrub after some uptime

try to set zfs_vdev_scheduler=none and check, does it work or not.

What is the right way to set zfs_vdev_scheduler=none?
Add line "zfs_vdev_scheduler=none" to "/etc/modprobe.d/zfs.conf" then "pve-efiboot-tool refresh" ?
 
@vanes thanks for the question, you are right.

Steps:
  • Check the actual setup: # cat /sys/module/zfs/parameters/zfs_vdev_scheduler
    It should be: noop
  • Open or create new file if it does not exist using nano editor: # nano /etc/modprobe.d/zfs.conf
  • Add or modify this line to contain the scheduler, final content should be: options zfs zfs_prefetch_disable=1 zfs_vdev_scheduler=none
  • Save it
  • Simply update the initramfs of the running kernel: # update-initramfs -u -k `uname -r`
  • Reboot
  • Check the actual setup: # cat /sys/module/zfs/parameters/zfs_vdev_scheduler
    It should be: none
 
  • Like
Reactions: logics
Initially, this looks like it does indeed fix the issue... I just managed to start a scrub successfully with zfs_vdev_scheduler set to "none":

root@proxmox3:~# zpool status
pool: rpool
state: ONLINE
scan: scrub in progress since Wed Sep 11 13:28:05 2019
112G scanned at 1.94G/s, 13.9G issued at 245M/s, 6.95T total
0B repaired, 0.20% done, 0 days 08:14:31 to go

I guess it'll be tomorrow by the time this scrub finishes, but I'll let it run and see how things go! I will report back any problems and will also report back if this completes successfully :)
 
Awesome, I'll give mine a go as well this evening.
Server has both a ssd pool and a spinning disk pool so we'll see how that goes.
 
@pongraczi : Thanks for taking the reports upstream and communicating back! Much appreciated!

@everyone:
Thanks for trying the suggested mitigations and reporting back!
 
Could I ask those for who setting options zfs zfs_vdev_scheduler=none
to share:
* zpool status (pool setup)
* cat /sys/block/$dev/queue/scheduler for each device that is part of the zpool ($dev is something like sdb or nvme0n1)
(if possible also the without setting the module parameter)

Thanks!
 
@Stoiko Ivanov here's the output from my server as requested - and you are right, it is definitely not all users since I have 3 other boxes with identical versions of Proxmox (all fully updated) that can scrub no problem - they each only have 2 disks in a single mirror zpool though, so that is the main difference for me between the "working" and "not working" configurations. No problem at all for me to try things or pull debug, just let me know if it is useful and I will share!

Code:
root@proxmox3:~# zpool status
  pool: rpool
 state: ONLINE
  scan: scrub in progress since Wed Sep 11 13:28:05 2019
        5.21T scanned at 343M/s, 5.09T issued at 335M/s, 6.95T total
        0B repaired, 73.30% done, 0 days 01:36:39 to go
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda3    ONLINE       0     0     0
            sdb3    ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0

errors: No known data errors

root@proxmox3:~# cat /sys/module/zfs/parameters/zfs_vdev_scheduler
none

root@proxmox3:~# cat /sys/block/sda/queue/scheduler
[mq-deadline] none
root@proxmox3:~# cat /sys/block/sdb/queue/scheduler
[mq-deadline] none
root@proxmox3:~# cat /sys/block/sdc/queue/scheduler
[mq-deadline] none
root@proxmox3:~# cat /sys/block/sdd/queue/scheduler
[mq-deadline] none
 
  • Like
Reactions: Stoiko Ivanov
I have 3 other boxes with identical versions of Proxmox (all fully updated) that can scrub no problem
What's the difference between those boxes (hardwarewise - which disks, which controllers, amount of RAM)?

The /sys/block/sda/queue/scheduler when the module-parameter /sys/module/zfs/parameters/zfs_vdev_scheduler is not set would also help!

Thanks!
 
This is my before/after i added "options zfs zfs_vdev_scheduler=none" to "/etc/modprobe.d/zfs.conf":

Before
Code:
root@C236:~# cat /sys/module/zfs/parameters/zfs_vdev_scheduler
noop

After
Code:
root@C236:~# cat /sys/module/zfs/parameters/zfs_vdev_scheduler
none

Before/after (no difference)
Code:
root@C236:~# cat /sys/block/sda/queue/scheduler
[mq-deadline] none
root@C236:~# cat /sys/block/sdb/queue/scheduler
[mq-deadline] none
root@C236:~# cat /sys/block/sdc/queue/scheduler
[mq-deadline] none
root@C236:~# cat /sys/block/sdd/queue/scheduler
[mq-deadline] none

after 5hours of uptime i sucsessfuly ran scrub one of my two nodes.
Code:
root@C236:~# zpool status
  pool: rpool
 state: ONLINE
  scan: scrub in progress since Wed Sep 11 19:38:00 2019
        520G scanned at 360M/s, 265G issued at 183M/s, 520G total
        0B repaired, 50.90% done, 0 days 00:23:47 to go
config:

        NAME                                                STATE     READ WRITE CKSUM
        rpool                                               ONLINE       0     0     0
          mirror-0                                          ONLINE       0     0     0
            ata-WDC_WD10EZEX-75WN4A0_WD-WCC6Y5XA8A52-part3  ONLINE       0     0     0
            ata-WDC_WD10EZEX-75WN4A0_WD-WCC6Y6ZZTFS0-part3  ONLINE       0     0     0
          mirror-1                                          ONLINE       0     0     0
            ata-WDC_WD10EZEX-08WN4A0_WD-WCC6Y7KN097F        ONLINE       0     0     0
            ata-WDC_WD10EZEX-08WN4A0_WD-WCC6Y5KAY0R8        ONLINE       0     0     0

errors: No known data errors

Let's see what happens next, I will update this post
 
  • Like
Reactions: Stoiko Ivanov
My scrub just completed:

Code:
root@proxmox3:~# zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0 days 06:21:57 with 0 errors on Wed Sep 11 19:50:02 2019

So, I think we definitely found a good workaround here. However, I am wondering what the performance impact of this change is - anecdotally the I/O feels a little slower than usual now, but I'd rather have a little slower than lockups. Let's hope the ZFS guys find a fix soon :)
 
In this system I did not experienced this kind of issue. This one has hard disks as pool and some ssd for log/zil.
I will check my servers where the before situation still exists.

This is the after:
Bash:
root@lm3:~# zpool status
  pool: rpool
 state: ONLINE
  scan: scrub canceled on Sun Sep  8 15:59:56 2019
config:

    NAME                                           STATE     READ WRITE CKSUM
    rpool                                          ONLINE       0     0     0
      raidz2-0                                     ONLINE       0     0     0
        wwn-0x5000c500ac94f284-part2               ONLINE       0     0     0
        wwn-0x5000c500acf603c0-part2               ONLINE       0     0     0
        wwn-0x5000c500acf609b8-part2               ONLINE       0     0     0
        wwn-0x5000c500acf6079b-part2               ONLINE       0     0     0
        wwn-0x5000c500acf609c4-part2               ONLINE       0     0     0
        wwn-0x5000c500ac0a722d-part2               ONLINE       0     0     0
    logs   
      mirror-1                                     ONLINE       0     0     0
        ata-SanDisk_SDSSDP064G_143094401488-part1  ONLINE       0     0     0
        ata-SanDisk_SDSSDP064G_143906402718-part1  ONLINE       0     0     0
    cache
      ata-SanDisk_SDSSDP064G_143094401488-part2    ONLINE       0     0     0
      ata-SanDisk_SDSSDP064G_143906402718-part2    ONLINE       0     0     0

errors: No known data errors
root@lm3:~# cat /sys/block/sda/queue/scheduler
[mq-deadline] none
root@lm3:~# cat /sys/block/sdb/queue/scheduler
[mq-deadline] none
root@lm3:~# cat /sys/block/sdc/queue/scheduler
[mq-deadline] none
root@lm3:~# cat /sys/block/sdd/queue/scheduler
[mq-deadline] none
root@lm3:~# cat /sys/block/sde/queue/scheduler
[mq-deadline] none
root@lm3:~# cat /sys/block/sdf/queue/scheduler
[mq-deadline] none
root@lm3:~# cat /sys/block/sdg/queue/scheduler
[mq-deadline] none
root@lm3:~# cat /sys/block/sdh/queue/scheduler
[mq-deadline] none
root@lm3:~# cat /sys/module/zfs/parameters/zfs_vdev_scheduler
none

Before (only general available):
Bash:
root@lm3:~# cat /sys/module/zfs/parameters/zfs_vdev_scheduler
noop


Iw
 
System No.2.
It locked 3 times, so, definitely affected.

Before:
Bash:
root@lm1:~# zpool status
  pool: rpool
state: ONLINE
  scan: scrub repaired 0B in 0 days 01:08:43 with 0 errors on Sun Jul 14 01:32:44 2019
config:

    NAME                                             STATE     READ WRITE CKSUM
    rpool                                            ONLINE       0     0     0
      mirror-0                                       ONLINE       0     0     0
        sde2                                         ONLINE       0     0     0
        sdf2                                         ONLINE       0     0     0
      mirror-1                                       ONLINE       0     0     0
        ata-Samsung_SSD_850_PRO_1TB_S252NXAG805442N  ONLINE       0     0     0
        ata-Samsung_SSD_850_PRO_1TB_S252NXAG805106B  ONLINE       0     0     0

errors: No known data errors

  pool: zbackup
state: ONLINE
  scan: scrub repaired 0B in 1 days 03:32:58 with 0 errors on Mon Jul 15 03:57:00 2019
config:

    NAME                                          STATE     READ WRITE CKSUM
    zbackup                                       ONLINE       0     0     0
      mirror-0                                    ONLINE       0     0     0
        ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M3UZC1F1  ONLINE       0     0     0
        ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M7ZESTA2  ONLINE       0     0     0
      mirror-1                                    ONLINE       0     0     0
        ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M7ZEST6D  ONLINE       0     0     0
        ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M6AVE83Y  ONLINE       0     0     0

errors: No known data errors

root@lm1:~# cat /sys/block/sda/queue/scheduler
[mq-deadline] none
root@lm1:~# cat /sys/block/sdb/queue/scheduler
[mq-deadline] none
root@lm1:~# cat /sys/block/sdc/queue/scheduler
[mq-deadline] none
root@lm1:~# cat /sys/block/sdd/queue/scheduler
[mq-deadline] none
root@lm1:~# cat /sys/block/sde/queue/scheduler
[mq-deadline] none
root@lm1:~# cat /sys/block/sdf/queue/scheduler
[mq-deadline] none
root@lm1:~# cat /sys/block/sdg/queue/scheduler
[mq-deadline] none
root@lm1:~# cat /sys/block/sdh/queue/scheduler
[mq-deadline] none
root@lm1:~# cat /sys/module/zfs/parameters/zfs_vdev_scheduler
noop


After is not available yet, but I assume the only difference will be the module/zfs/parameters/zfs_vdev_scheduler
EDIT: after available
Bash:
root@lm1:~# for i in a b c d e f g h; do cat /sys/block/sd$i/queue/scheduler; done
[mq-deadline] none
[mq-deadline] none
[mq-deadline] none
[mq-deadline] none
[mq-deadline] none
[mq-deadline] none
[mq-deadline] none
[mq-deadline] none
root@lm1:~# cat /sys/module/zfs/parameters/zfs_vdev_scheduler
none
 
Last edited:
you said this happens to 2 nodes in a cluster? same time?

STOP RIGHT THERE , stop meesing with zfs, its not zfs.

just
Code:
service restart corosync

another possible fault would be storage shares, nfs and cfis in particular.
if its not corosync deactivate external storages. a faulty storage, or even a slow one can hang your kernel too.

what you see is a symptom of the kernel that is expressed in zfs, its not zfs that is the cause. it would happen on any filesystem too
 
@bofh Thank you for your hint! I will check it next time.
Regarding to my experience, these two servers were doing scrub, while zfs send/receive also happened between them on a 5 nodes cluster and there were about an hour difference.
This was the 2nd or 3rd time.
Even standalone servers froze in the same way.
I think, next time if happens this again, I will try to check, but I am afraid, when it happens, my system will be unresponsible and I will have chance to login and test it.
 
I started a scrub on my server without making any changes to the setup, for safety I stopped all services(like nfs and smb) and vm's/containers.
Scrubbing rpool (wich consists of 2 ssd mirrors) no errors popped up, trying again with all services etc started it still didn't bug out.
At this moment I'm scrubbing my tank pool (consisting of 8 spinning disks in raidZ2) and has been going strong for at least half an hour now.

This same server had a kernel panic like the other servers in this thread on Sunday 8th starting the monthly scrub.
For good measurement I started the scrubs at around 00:00 like it would with the cronjob.
I'll post the result of the scrub tomorrow.

Here's some information about my setup:
Supermicro A2SDi-H-TF
Intel Atom C3758
64gb 2133mhz ecc registered memory
4x Intel s3510 480gb
8x Seagate Enterprise 8tb
1x Intel optane M10 16gb
Kernel PVE 5.0.18-3 (Thu, 8 Aug 2019 09:05:29 +0200)

Code:
root@pve:~# cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline
root@pve:~# cat /sys/block/sda/queue/scheduler
[mq-deadline] none
root@pve:~# cat /sys/block/sdb/queue/scheduler
[mq-deadline] none
root@pve:~# cat /sys/block/sdc/queue/scheduler
[mq-deadline] none
root@pve:~# cat /sys/block/sdd/queue/scheduler
[mq-deadline] none
root@pve:~# cat /sys/block/sde/queue/scheduler
[mq-deadline] none
root@pve:~# cat /sys/block/sdf/queue/scheduler
[mq-deadline] none
root@pve:~# cat /sys/block/sdg/queue/scheduler
[mq-deadline] none
root@pve:~# cat /sys/block/sdh/queue/scheduler
[mq-deadline] none
root@pve:~# cat /sys/block/sdi/queue/scheduler
[mq-deadline] none
root@pve:~# cat /sys/block/sdj/queue/scheduler
[mq-deadline] none
root@pve:~# cat /sys/block/sdk/queue/scheduler
[mq-deadline] none
root@pve:~# cat /sys/block/sdl/queue/scheduler
[mq-deadline] none
root@pve:~# cat /sys/block/sdm/queue/scheduler
[mq-deadline] none
root@pve:~# cat /sys/module/zfs/parameters/zfs_vdev_scheduler
noop
root@pve:~# zpool status -t
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0 days 01:46:48 with 0 errors on Sun Aug 11 02:10:49 2019
config:

        NAME                                                  STATE     READ WRITE CKSUM
        rpool                                                 ONLINE       0     0     0
          mirror-0                                            ONLINE       0     0     0
            ata-INTEL_SSDSC2BB480G4_************480QGN        ONLINE       0     0     0  (untrimmed)
            ata-INTEL_SSDSC2BB480G4_************480QGN-part3  ONLINE       0     0     0  (untrimmed)
          mirror-1                                            ONLINE       0     0     0
            ata-INTEL_SSDSC2BB480G4_************480QGN        ONLINE       0     0     0  (untrimmed)
            ata-INTEL_SSDSC2BB480G4_************480QGN        ONLINE       0     0     0  (untrimmed)

errors: No known data errors

  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 0 days 07:32:50 with 0 errors on Sun Aug 11 07:56:52 2019
config:

        NAME                                  STATE     READ WRITE CKSUM
        tank                                  ONLINE       0     0     0
          raidz2-0                            ONLINE       0     0     0
            ata-ST8000NM0055-1RM112_********  ONLINE       0     0     0  (trim unsupported)
            ata-ST8000NM0055-1RM112_********  ONLINE       0     0     0  (trim unsupported)
            ata-ST8000NM0055-1RM112_********  ONLINE       0     0     0  (trim unsupported)
            ata-ST8000NM0055-1RM112_********  ONLINE       0     0     0  (trim unsupported)
            ata-ST8000NM0055-1RM112_********  ONLINE       0     0     0  (trim unsupported)
            ata-ST8000NM0055-1RM112_********  ONLINE       0     0     0  (trim unsupported)
            ata-ST8000NM0055-1RM112_********  ONLINE       0     0     0  (trim unsupported)
            ata-ST8000NM0055-1RM112_********  ONLINE       0     0     0  (trim unsupported)
        logs
          nvme0n1                             ONLINE       0     0     0  (untrimmed)

errors: No known data errors
root@pve:~#
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!