Proxmox V6 Servers freeze, Zvol blocked for more than 120s

hashimji2008 · Sep 8, 2019

This happened twice in the last 30 days (interval of 25 days) and it is happening to both servers of the same configuration at same time. (same proc and RAM). There were no issues before the V6 update. Both servers going down at the same time.

CPU(s) 16 x Intel(R) Xeon(R) CPU L5520 @ 2.27GHz (2 Sockets)
Kernel Version Linux 5.0.18-1-pve #1 SMP PVE 5.0.18-2 (Fri, 2 Aug 2019 14:51:00 +0200)
PVE Manager Version pve-manager/6.0-5/f8a710d7

KVM images attached.

ishan · Sep 8, 2019

Seeing something similar on
Intel E3-1230v5 32GB RAM 4x500GB SSD ZFS RAID10 .

No load on VM or proxmox and suddenly everything freezes up . Only option is to restart proxmox node.

hashimji2008 · Sep 8, 2019

Both servers have 10 vms each. Average load 5-6
ZFS RAID1
Total RAM 72GB

pongraczi · Sep 8, 2019

It seems I run into the same situation as you.
Unfortunately I experienced this issue with 6 different servers (HP, Dell, Fujitsu and Intel servers).
All of them use almost the same configuration as storage:

rpool (and production storage) on SSD
rpools are: raidz10 (4 SSDs); raidz2 (4 and other ones 6 SSDs)
SSDs are (usually one server contains only one kind of SSDs as devices): Kingston, Western Digital, Samsung
local backups: raid1 spinning rusts
usually they use motherboard SATA, PCIe sata card (cheap one), one use raid card, which support pass-trough
all kvm drives set to support trim, I hoped the best
turned out, trim is not supported on kernel level in some cases (due to HBA etc. prerequsities I guess), but the situation really differs from server to server, so, this should be not the core problem
I use zfs-auto-snapshot, hourly, daily etc. on all production volumes/filesystems
simplesnap usually in ever hour to remote server (it usually took some minutes and the start spread in an hour, so, only one remote copy happens at a time)
daily simplesnap remote copy over internet to a remote location
in one server only auto-snapshot, no remote copy
zfs pools were upgraded in different times, they hanged before and after, too
when the server hangs, it starts with increasing load, one increased it up to 1300 which a little bit high )))))))))
The same servers were fine and worked for years without this kind of situation since PVE version 3.x, 4.x, 5.x
Only started to happen with PVE 6.0
It seems all zfs pool access hangs and systems die, only kernel can answer for ping, but remote connection impossible, login to local machine on physical terminal could work in the beginning, but after a time, it will became impossible, too
two times happened the following: two of my oldest intel servers hanged with the same symptoms, nearly the same time (they usually simplesnap to each others in 40 minutes, so, it could trigger the dual die: first one of them dies, the second tries to zfs send/receive and dies, too, but it could be a false lead and not a real reason)
today (8th of Sept, 2019) I upgraded to the latest version (pve-manager/6.0-7/28984024 (running kernel: 5.0.21-1-pve)), which contains new zfsutils-linux package, but I experienced the same situation with different subversions of PVE 6.0 (different kernels) before this version.

So, the only common point I can see is the zfs filesystem itself and it hangs in some reasons.
First time it happened after about 1 week uptime, but it seems this is unpredictable, so, I cannot guess, when will happen next time.
I will try to reboot these servers once a week and hope the best, it can avoid this situation.

As I checked the zpool status -t output, I usually got that: SSDs are untrimmed, spinning rusts trim unsupported, as you can see in the following example:

root@lm4:~# zpool status -t
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 0 days 00:13:31 with 0 errors on Sun Sep 8 00:37:32 2019
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda2 ONLINE 0 0 0 (untrimmed)
sdb2 ONLINE 0 0 0 (untrimmed)

errors: No known data errors

pool: zbackup
state: ONLINE
scan: scrub repaired 0B in 0 days 02:08:04 with 0 errors on Sun Sep 8 02:32:07 2019
config:

NAME STATE READ WRITE CKSUM
zbackup ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-ST2000DM001-1CH164_Z1E8CR5S-part2 ONLINE 0 0 0 (trim unsupported)
ata-TOSHIBA_DT01ACA200_84649YHGS-part2 ONLINE 0 0 0 (trim unsupported)
mirror-1 ONLINE 0 0 0
ata-ST2000DM001-1ER164_W4Z07B1X ONLINE 0 0 0 (trim unsupported)
ata-TOSHIBA_DT01ACA200_84643E8GS ONLINE 0 0 0 (trim unsupported)

So, I suspect the following:

general zfs 0.8.x race condition situation (I did not find specific issue on zfs email list yet)
something weird with trim support switched on and underlying zfs/ssd

I hope my post contains enough details to get your attention

Anyway, it is a serious problem I never experienced with PVE/ZFS combination even from the beginning, when I had to compile zfs by myself on PVE 1.x

logics · Sep 8, 2019

I think I have something similar here: https://forum.proxmox.com/threads/c...hung_task_timeout_secs-problem-mit-zfs.57779/

- I had a backup job (1M Iops according to the proxmox monitor) about 1 hour before the event, but can't really connect that to the crash because there was 1 hour idle in between the backup job and the crash.
- I had a load of 0.2-0.8 (1 minute avg) before the event, when suddenly the CPU load rises quickly due to an I/O problem. I suspect a ZFS problem.

edit:
- For me it happened on Linux XirraDedi 5.0.21-1-pve #1 SMP PVE 5.0.21-2 (Wed, 28 Aug 2019 15:12:18 +0200) x86_64 GNU/Linux
- Until now, it happened only once, about 8.5 hours after the last restart. So I'm not sure if your "reboot once a week" is enough, @pongraczi

edit #2:
Because you mentioned trim, yes I have activated the support for that, too. Although my VM has no conjob executing that command regularily IIRC.

wasteground · Sep 9, 2019

I also hit this today, and it seems to be triggered (and re-triggerable) by running a zpool scrub (of course, that could just be triggering another issue, but I can consistently trigger exactly this behaviour just by running a zpool scrub). Interestingly, this only happens on one of my four servers, the other 3 can all scrub just fine.

I am not using any discard/trim, all the disks are rust. Happy to share more info if it helps!

mbosma · Sep 9, 2019

I had the same issue yesterday.
Server was running fine until I upgraded my pools to zfs 0.8.1.
The error messages started just after midnight at the first Sunday of the month.
I'll check if I can see what the cron does and if I can reproduce the issue this evening.

logics · Sep 9, 2019

@wasteground That was a good hint. I could trigger it on my machine now with a zpool scrub rpool

8:54 am: I entered zpool scrub rpool on the host. The SSH terminal where I entered that command was stuck immediately, and it never returned. I couldn't CTRL+C out of it, nothing.
8:55 am: At first it was still possible to login into another shell, but htop could not be opened there. Later I could no longer login into neither the host nor the VM.
8:58 am: The first messages "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" ... INFO: task ... blocked for more than 120 seconds appeared about 4 minutes after the scrub command.
Here is a screenshot of the terminal output via iKVM/HTML5 (ikvm_html5.png) at 8:58, shortly after the crash-messages appeared. What you see on this screenshot is basically the same of what happened until the reset at 9:24 am. No new messages appeared. The system was frozen:

My monitoring software ("NixStats") was stuck shortly after it. It could only send a few values after the scrub command, you can see it here (nixstats_host.png):

(All CPU goes into "kvm" because in the VM, java is running and apparently stuck due to the scrub command, java took all CPU power, so this time it counted as "user" and not "i/o" CPU ressources though.)

9:24 am: After executing a "Power Reset" command via iKVM/HTML5 on the machine, and a completed boot up, the syslog showed no hint of what happened. It's good I have the terminal output via iKVM/HTML5, because else there would be no way of seeing any error messages. If anyone wants a copy of syslog, I can provide it though.
I made a video of the whole situation, but it does not show much more information than what is provided here already.
I've executed zpool status -v rpool and zpool history commands before and after the scrub crash. It shows no change, so it appears scrub had never been executed really. See "zpool_before_scrub.txt" and "z_pool_after_scrub.txt".

I don't know what triggered the system crash/freeze yesterday though (https://forum.proxmox.com/threads/c...hung_task_timeout_secs-problem-mit-zfs.57779/). I did not schedule or run scrub explicitely. Is scrub automatically executed on Proxmox 6? So can we be safe it is always the scrub command that does these problems? Or is there another zfs command that results in the same situation?

Stoiko Ivanov · Sep 9, 2019

logics said:
Is scrub automatically executed on Proxmox 6?

There is a cronjob, which by default scrubs all healthy pools on the second sunday of a month:

Code:

/etc/cron.d/zfsutils-linux

Does anyone have a comprehensive log from when the system becomes unresponsive?
* remote syslog could help in aquiring one
* enabling persistent journalling as well (although it could be that the last log-messages are not persisted to disk if you need to reset the system)

Semmo · Sep 9, 2019

I have the same problems. But I can not provide logs atm. Just wanted to add me to the list

Every 2nd Sunday (I just realized this now while reading this thread) I need to do a hard reset of my remote hosted node.

I have a default setup with ZFS-Raid10

logics · Sep 9, 2019

To save you guys some googling, here is a tutorial on how to make a machine a (sys)log receiver and how to make your Proxmox host send logs to that remote machine: https://vexxhost.com/resources/tuto...tem-logging-with-rsyslog-on-ubuntu-14-04-lts/

I will try it out soon, too.

wasteground · Sep 9, 2019

I configured all my Proxmox boxes with remote syslog now - I can't cause the problem right now since it's during the day, but this evening I will run a scrub, create the issue and then share the logs.

If there are any other useful pieces of data to capture at the same time, just let me know.

Stoiko Ivanov · Sep 9, 2019

wasteground said:
If there are any other useful pieces of data to capture at the same time, just let me know.

Please make sure to remove any data you consider sensitive from the debug-output you post.

a stacktrace of all running processes when the hang starts - this might take awhile or not be possible at all depending on how hard the freeze is:
* enable sysrq by writing 1 to '/proc/sys/kernel/sysrq' : echo 1 > /proc/sys/kernel/sysrq
* when the problem arises write 't' to '/proc/sysrq-trigger' echo t > /proc/sysrq-trigger
(this should provide the necessary info in the kernel.log - which should be transferred via remote syslog to the other machine)
See https://en.wikipedia.org/wiki/Magic_SysRq_key and https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html for details

gather some perf data around the time the system freezes (try to write it to a disk not part of your ZFS - that should increase the chances of it being written should the system crash):
* perf record -F 99 -a -g -- sleep 60 (records all running processes stacktraces repeatedly and writes to 'perf.data' in current directory)
* perf script > global.perf
(the global.perf contains the data in a processable format)
see https://perf.wiki.kernel.org/index.php/Tutorial , https://jvns.ca/perf-cheat-sheet.pdf and http://www.brendangregg.com/perf.html
for further info

I hope this helps!

pongraczi · Sep 9, 2019

Unfortunately the affected system will unresponsible, due to that, accessing disk blocked immediately, so, local log will contains nothing.
I assume, there will be not so much in the remote log, too, because the syslog already printed to the screen and there will be not so much resource to run.
I was able to login to one of my server and it was incredible slow (of course) while load was 1300+, so, physical turn off/on was the solution.

Anyway, I am curious about the result if someone has such a system with remote log and non-zfs local os.

Good luck!

<offtopic>
how can I edit signature to update it? thanks.
</offtopic>

sandor · Sep 9, 2019

We also experience two fail seems like this.
We have a local storage which is a single disk zfs.
At the point of failure the pve gui and the server was still responsive, only the vms on the node was unresponsive. Their disks was on the zfs storage.
They seems like they lost their hdd.
But only the node restart solve the problem.
We unable to reproduce the fail with scrub.
I attache two syslog, the part when the event happen.

ivanm · Sep 9, 2019

Hi,

I have a similar problem since v6 update. Sometimes, on destination pool, zfs recv -F -- /rpool/... process is not properly killed and remains as zombie process. Any further pve-zync command will be queued on source machine until new process cannot be forked. Destination server must be rebooted.

zpool status:

Code:

  pool: rpool
state: ONLINE
  scan: scrub repaired 0B in 0 days 02:38:05 with 0 errors on Sun Sep  8 03:02:06 2019
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      raidz1-0  ONLINE       0     0     0
        sda2    ONLINE       0     0     0
        sdb2    ONLINE       0     0     0
        sdc2    ONLINE       0     0     0
        sdd2    ONLINE       0     0     0

errors: No known data errors

kernel log:

Code:

Sep  7 11:03:34 hn kernel: [962440.233380]  zfs_znode_alloc+0x625/0x680 [zfs]
Sep  7 11:03:34 hn kernel: [962440.233441]  zfs_zget+0x1ad/0x240 [zfs]
Sep  7 11:03:34 hn kernel: [962440.233498]  zfs_unlinked_drain_task+0x74/0x100 [zfs]
Sep  7 11:03:34 hn kernel: [962440.233502]  ? __switch_to_asm+0x41/0x70
Sep  7 11:03:34 hn kernel: [962440.233503]  ? __switch_to_asm+0x35/0x70
Sep  7 11:03:34 hn kernel: [962440.233505]  ? __switch_to_asm+0x41/0x70
Sep  7 11:03:34 hn kernel: [962440.233507]  ? __switch_to_asm+0x35/0x70
Sep  7 11:03:34 hn kernel: [962440.233508]  ? __switch_to_asm+0x41/0x70
Sep  7 11:03:34 hn kernel: [962440.233510]  ? __switch_to_asm+0x35/0x70
Sep  7 11:03:34 hn kernel: [962440.233512]  ? __switch_to_asm+0x41/0x70
Sep  7 11:03:34 hn kernel: [962440.233514]  ? __switch_to+0x96/0x4e0
Sep  7 11:03:34 hn kernel: [962440.233516]  ? __switch_to_asm+0x41/0x70
Sep  7 11:03:34 hn kernel: [962440.233518]  ? __switch_to_asm+0x35/0x70
Sep  7 11:03:34 hn kernel: [962440.233521]  ? __schedule+0x2dc/0x870
Sep  7 11:03:34 hn kernel: [962440.233525]  ? remove_wait_queue+0x4d/0x60
Sep  7 11:03:34 hn kernel: [962440.233535]  ? wake_up_q+0x80/0x80
Sep  7 11:03:34 hn kernel: [962440.233548]  ? task_done+0xb0/0xb0 [spl]
Sep  7 11:03:34 hn kernel: [962440.233552]  ret_from_fork+0x35/0x40
Sep  7 11:03:34 hn kernel: [962440.233662]       Tainted: P           O      5.0.18-1-pve #1
Sep  7 11:03:34 hn kernel: [962440.233724] z_unlinked_drai D    0  2275      2 0x80000000
Sep  7 11:03:34 hn kernel: [962440.233730]  __schedule+0x2d4/0x870
Sep  7 11:03:34 hn kernel: [962440.233740]  spl_panic+0xf9/0xfb [spl]
Sep  7 11:03:34 hn kernel: [962440.233805]  ? zfs_inode_destroy+0xf8/0x110 [zfs]
Sep  7 11:03:34 hn kernel: [962440.233872]  ? destroy_inode+0x3e/0x60
Sep  7 11:03:34 hn kernel: [962440.233876]  ? iput+0x148/0x210
Sep  7 11:03:34 hn kernel: [962440.233934]  zfs_znode_alloc+0x625/0x680 [zfs]
Sep  7 11:03:34 hn kernel: [962440.234052]  zfs_unlinked_drain_task+0x74/0x100 [zfs]
Sep  7 11:03:34 hn kernel: [962440.234057]  ? __switch_to+0x471/0x4e0
Sep  7 11:03:34 hn kernel: [962440.234061]  ? __switch_to_asm+0x41/0x70
Sep  7 11:03:34 hn kernel: [962440.234065]  ? __schedule+0x2dc/0x870
Sep  7 11:03:34 hn kernel: [962440.234075]  taskq_thread+0x2ec/0x4d0 [spl]
Sep  7 11:03:34 hn kernel: [962440.234082]  kthread+0x120/0x140
Sep  7 11:03:34 hn kernel: [962440.234091]  ? __kthread_parkme+0x70/0x70
Sep  7 11:07:36 hn kernel: [962681.900169] INFO: task z_unlinked_drai:473 blocked for more than 120 seconds.
Sep  7 11:07:36 hn kernel: [962681.900241] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  7 11:07:36 hn kernel: [962681.900281] Call Trace:
Sep  7 11:07:36 hn kernel: [962681.900298]  schedule+0x2c/0x70
Sep  7 11:07:36 hn kernel: [962681.900316]  ? spl_kmem_cache_free+0xc0/0x1d0 [spl]
Sep  7 11:07:36 hn kernel: [962681.900446]  ? zpl_inode_destroy+0xe/0x10 [zfs]
Sep  7 11:07:36 hn kernel: [962681.900451]  ? evict+0x139/0x1a0
Sep  7 11:07:36 hn kernel: [962681.900454]  ? insert_inode_locked+0x1d8/0x1e0
Sep  7 11:07:36 hn kernel: [962681.900573]  zfs_zget+0x1ad/0x240 [zfs]
Sep  7 11:07:36 hn kernel: [962681.900634]  ? __switch_to_asm+0x41/0x70
Sep  7 11:07:36 hn kernel: [962681.900637]  ? __switch_to_asm+0x41/0x70
Sep  7 11:07:36 hn kernel: [962681.900641]  ? __switch_to_asm+0x41/0x70
Sep  7 11:07:36 hn kernel: [962681.900644]  ? __switch_to_asm+0x41/0x70
Sep  7 11:07:36 hn kernel: [962681.900649]  ? __switch_to_asm+0x41/0x70
Sep  7 11:07:36 hn kernel: [962681.900653]  ? __schedule+0x2dc/0x870
Sep  7 11:07:36 hn kernel: [962681.900664]  taskq_thread+0x2ec/0x4d0 [spl]
Sep  7 11:07:36 hn kernel: [962681.900674]  kthread+0x120/0x140
Sep  7 11:07:36 hn kernel: [962681.900683]  ? __kthread_parkme+0x70/0x70
Sep  7 11:07:36 hn kernel: [962681.900757] INFO: task z_unlinked_drai:2275 blocked for more than 120 seconds.
Sep  7 11:07:36 hn kernel: [962681.900819] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  7 11:07:36 hn kernel: [962681.900857] Call Trace:
Sep  7 11:07:36 hn kernel: [962681.900864]  schedule+0x2c/0x70
Sep  7 11:07:36 hn kernel: [962681.900878]  ? spl_kmem_cache_free+0xc0/0x1d0 [spl]
Sep  7 11:07:36 hn kernel: [962681.901000]  ? zpl_inode_destroy+0xe/0x10 [zfs]
Sep  7 11:07:36 hn kernel: [962681.901004]  ? evict+0x139/0x1a0
Sep  7 11:07:36 hn kernel: [962681.901007]  ? insert_inode_locked+0x1d8/0x1e0
Sep  7 11:07:36 hn kernel: [962681.901124]  zfs_zget+0x1ad/0x240 [zfs]
Sep  7 11:07:36 hn kernel: [962681.901185]  ? __switch_to_asm+0x41/0x70
Sep  7 11:07:36 hn kernel: [962681.901188]  ? __switch_to_asm+0x35/0x70
Sep  7 11:07:36 hn kernel: [962681.901192]  ? __switch_to_asm+0x35/0x70
Sep  7 11:07:36 hn kernel: [962681.901198]  ? remove_wait_queue+0x4d/0x60
Sep  7 11:07:36 hn kernel: [962681.901207]  ? wake_up_q+0x80/0x80
Sep  7 11:07:36 hn kernel: [962681.901218]  ? task_done+0xb0/0xb0 [spl]
Sep  7 11:07:36 hn kernel: [962681.901222]  ret_from_fork+0x35/0x40
Sep  7 11:09:37 hn kernel: [962802.734202]  ? __switch_to_asm+0x41/0x70
Sep  7 11:09:37 hn kernel: [962802.734211]  ? __schedule+0x2dc/0x870
Sep  7 11:09:37 hn kernel: [962802.734226]  ? wake_up_q+0x80/0x80
Sep  7 11:09:37 hn kernel: [962802.734241]  ? __kthread_parkme+0x70/0x70
Sep  7 11:09:37 hn kernel: [962802.734351]       Tainted: P           O      5.0.18-1-pve #1
Sep  7 11:09:37 hn kernel: [962802.734416] Call Trace:
Sep  7 11:09:37 hn kernel: [962802.734430]  spl_panic+0xf9/0xfb [spl]
Sep  7 11:09:37 hn kernel: [962802.734552]  ? zpl_inode_destroy+0xe/0x10 [zfs]
Sep  7 11:09:37 hn kernel: [962802.734557]  ? iput+0x148/0x210
Sep  7 11:09:37 hn kernel: [962802.734682]  zfs_zget+0x1ad/0x240 [zfs]
Sep  7 11:09:37 hn kernel: [962802.734744]  ? __switch_to+0x471/0x4e0
Sep  7 11:09:37 hn kernel: [962802.734749]  ? __switch_to_asm+0x35/0x70
Sep  7 11:09:37 hn kernel: [962802.734762]  taskq_thread+0x2ec/0x4d0 [spl]
Sep  7 11:09:37 hn kernel: [962802.734775]  ? task_done+0xb0/0xb0 [spl]
Sep  7 11:11:38 hn kernel: [962923.567782]  ? __switch_to_asm+0x41/0x70
Sep  7 11:11:38 hn kernel: [962923.567791]  ? remove_wait_queue+0x4d/0x60
Sep  7 11:11:38 hn kernel: [962923.567801]  ? wake_up_q+0x80/0x80
Sep  7 11:11:38 hn kernel: [962923.567813]  ? task_done+0xb0/0xb0 [spl]
Sep  7 11:11:38 hn kernel: [962923.567818]  ret_from_fork+0x35/0x40
Sep  7 11:11:38 hn kernel: [962923.567927]       Tainted: P           O      5.0.18-1-pve #1
Sep  7 11:11:38 hn kernel: [962923.567989] z_unlinked_drai D    0  2275      2 0x80000000
Sep  7 11:11:38 hn kernel: [962923.567996]  __schedule+0x2d4/0x870
Sep  7 11:11:38 hn kernel: [962923.568006]  spl_panic+0xf9/0xfb [spl]
Sep  7 11:11:38 hn kernel: [962923.568072]  ? zfs_inode_destroy+0xf8/0x110 [zfs]
Sep  7 11:11:38 hn kernel: [962923.568131]  ? destroy_inode+0x3e/0x60
Sep  7 11:11:38 hn kernel: [962923.568134]  ? iput+0x148/0x210
Sep  7 11:11:38 hn kernel: [962923.568193]  zfs_znode_alloc+0x625/0x680 [zfs]
Sep  7 11:11:38 hn kernel: [962923.568310]  zfs_unlinked_drain_task+0x74/0x100 [zfs]
Sep  7 11:11:38 hn kernel: [962923.568316]  ? __switch_to+0x471/0x4e0
Sep  7 11:11:38 hn kernel: [962923.568319]  ? __switch_to_asm+0x41/0x70
Sep  7 11:11:38 hn kernel: [962923.568324]  ? __schedule+0x2dc/0x870
Sep  7 11:11:38 hn kernel: [962923.568334]  taskq_thread+0x2ec/0x4d0 [spl]
Sep  7 11:11:38 hn kernel: [962923.568341]  kthread+0x120/0x140
Sep  7 11:11:38 hn kernel: [962923.568349]  ? __kthread_parkme+0x70/0x70

hashimji2008 · Sep 9, 2019

Another server with zero to 1 VM (Avg Load less than 1 or .5) passed 38 days of uptime. I think server load somehow related with this freeze along with the zfs

wasteground · Sep 10, 2019

I didn't get a chance yet to trigger this on my box (accidentally moved a VM to the server that I can't reboot until the weekend, sorry!), however, I noticed that at least in my case, this doesn't seem to be I/O related:

- I accidentally managed to move a VM (2TB of data) to the server, no issues.
- The box in question runs one FreeBSD VM which is used as a storage server and is the backup target for a bunch of other nodes
- all the backups I have ran perfectly fine overnight (4-5TB+ of data moved to this box)
- the server then uploaded 1-2TB to another server for backup

So, at least in my case, I don't feel that this is related to server load or disk throughput/usage - I could be wrong, but it just seems like the scrub is the only trigger here for me.

I'll follow up on recreating the issue on Saturday when I can move the mistake VM again without breaking anything - if there's anything non-destructive I can help with until then, just lmk.

Quick edit: I also found this, not clear if it's related: https://github.com/zfsonlinux/zfs/issues/8664

Rob.

wasteground · Sep 10, 2019

Follow-up thought - it would be interesting to check if the scrub complete okay when there are no workloads running on the Proxmox node? I can test this at the weekend too (maybe sooner if i I can move some things around)

pongraczi · Sep 10, 2019

Rob, you found something on github:#8664
I report this thread to github.

Proxmox V6 Servers freeze, Zvol blocked for more than 120s

Renowned Member

Attachments

New Member

Renowned Member

Renowned Member

Well-Known Member

Member

Renowned Member

Well-Known Member

Attachments

Proxmox Staff Member

Well-Known Member

Well-Known Member

Member

Proxmox Staff Member

Renowned Member

Member

Attachments

Active Member

Renowned Member

Member

Member

Renowned Member