Proxmox V6 Servers freeze, Zvol blocked for more than 120s

hashimji2008

Member
Dec 17, 2015
6
1
8
28
This happened twice in the last 30 days (interval of 25 days) and it is happening to both servers of the same configuration at same time. (same proc and RAM). There were no issues before the V6 update. Both servers going down at the same time.



CPU(s) 16 x Intel(R) Xeon(R) CPU L5520 @ 2.27GHz (2 Sockets)
Kernel Version Linux 5.0.18-1-pve #1 SMP PVE 5.0.18-2 (Fri, 2 Aug 2019 14:51:00 +0200)
PVE Manager Version pve-manager/6.0-5/f8a710d7


KVM images attached.
 

Attachments

  • Like
Reactions: chrone

ishan

New Member
Nov 12, 2018
11
1
3
31
Seeing something similar on
Intel E3-1230v5 32GB RAM 4x500GB SSD ZFS RAID10 .

No load on VM or proxmox and suddenly everything freezes up . Only option is to restart proxmox node.
 

pongraczi

New Member
Oct 23, 2008
17
5
3
Hungary
www.startit.hu
It seems I run into the same situation as you.
Unfortunately I experienced this issue with 6 different servers (HP, Dell, Fujitsu and Intel servers).
All of them use almost the same configuration as storage:
  • rpool (and production storage) on SSD
  • rpools are: raidz10 (4 SSDs); raidz2 (4 and other ones 6 SSDs)
  • SSDs are (usually one server contains only one kind of SSDs as devices): Kingston, Western Digital, Samsung
  • local backups: raid1 spinning rusts
  • usually they use motherboard SATA, PCIe sata card (cheap one), one use raid card, which support pass-trough
  • all kvm drives set to support trim, I hoped the best
  • turned out, trim is not supported on kernel level in some cases (due to HBA etc. prerequsities I guess), but the situation really differs from server to server, so, this should be not the core problem
  • I use zfs-auto-snapshot, hourly, daily etc. on all production volumes/filesystems
  • simplesnap usually in ever hour to remote server (it usually took some minutes and the start spread in an hour, so, only one remote copy happens at a time)
  • daily simplesnap remote copy over internet to a remote location
  • in one server only auto-snapshot, no remote copy
  • zfs pools were upgraded in different times, they hanged before and after, too
  • when the server hangs, it starts with increasing load, one increased it up to 1300 which a little bit high :))))))))))
  • The same servers were fine and worked for years without this kind of situation since PVE version 3.x, 4.x, 5.x
  • Only started to happen with PVE 6.0
  • It seems all zfs pool access hangs and systems die, only kernel can answer for ping, but remote connection impossible, login to local machine on physical terminal could work in the beginning, but after a time, it will became impossible, too
  • two times happened the following: two of my oldest intel servers hanged with the same symptoms, nearly the same time (they usually simplesnap to each others in 40 minutes, so, it could trigger the dual die: first one of them dies, the second tries to zfs send/receive and dies, too, but it could be a false lead and not a real reason)
  • today (8th of Sept, 2019) I upgraded to the latest version (pve-manager/6.0-7/28984024 (running kernel: 5.0.21-1-pve)), which contains new zfsutils-linux package, but I experienced the same situation with different subversions of PVE 6.0 (different kernels) before this version.
So, the only common point I can see is the zfs filesystem itself and it hangs in some reasons.
First time it happened after about 1 week uptime, but it seems this is unpredictable, so, I cannot guess, when will happen next time.
I will try to reboot these servers once a week and hope the best, it can avoid this situation.

As I checked the zpool status -t output, I usually got that: SSDs are untrimmed, spinning rusts trim unsupported, as you can see in the following example:

root@lm4:~# zpool status -t
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 0 days 00:13:31 with 0 errors on Sun Sep 8 00:37:32 2019
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda2 ONLINE 0 0 0 (untrimmed)
sdb2 ONLINE 0 0 0 (untrimmed)

errors: No known data errors

pool: zbackup
state: ONLINE
scan: scrub repaired 0B in 0 days 02:08:04 with 0 errors on Sun Sep 8 02:32:07 2019
config:

NAME STATE READ WRITE CKSUM
zbackup ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-ST2000DM001-1CH164_Z1E8CR5S-part2 ONLINE 0 0 0 (trim unsupported)
ata-TOSHIBA_DT01ACA200_84649YHGS-part2 ONLINE 0 0 0 (trim unsupported)
mirror-1 ONLINE 0 0 0
ata-ST2000DM001-1ER164_W4Z07B1X ONLINE 0 0 0 (trim unsupported)
ata-TOSHIBA_DT01ACA200_84643E8GS ONLINE 0 0 0 (trim unsupported)



So, I suspect the following:
  • general zfs 0.8.x race condition situation (I did not find specific issue on zfs email list yet)
  • something weird with trim support switched on and underlying zfs/ssd
I hope my post contains enough details to get your attention :)
Anyway, it is a serious problem I never experienced with PVE/ZFS combination even from the beginning, when I had to compile zfs by myself on PVE 1.x
 
Sep 8, 2019
12
1
3
29
I think I have something similar here: https://forum.proxmox.com/threads/crash-proc-sys-kernel-hung_task_timeout_secs-problem-mit-zfs.57779/

- I had a backup job (1M Iops according to the proxmox monitor) about 1 hour before the event, but can't really connect that to the crash because there was 1 hour idle in between the backup job and the crash.
- I had a load of 0.2-0.8 (1 minute avg) before the event, when suddenly the CPU load rises quickly due to an I/O problem. I suspect a ZFS problem.

edit:
- For me it happened on Linux XirraDedi 5.0.21-1-pve #1 SMP PVE 5.0.21-2 (Wed, 28 Aug 2019 15:12:18 +0200) x86_64 GNU/Linux
- Until now, it happened only once, about 8.5 hours after the last restart. So I'm not sure if your "reboot once a week" is enough, @pongraczi

edit #2:
Because you mentioned trim, yes I have activated the support for that, too. Although my VM has no conjob executing that command regularily IIRC.
trim.png
 
Last edited:
Aug 6, 2019
10
2
3
36
I also hit this today, and it seems to be triggered (and re-triggerable) by running a zpool scrub (of course, that could just be triggering another issue, but I can consistently trigger exactly this behaviour just by running a zpool scrub). Interestingly, this only happens on one of my four servers, the other 3 can all scrub just fine.

I am not using any discard/trim, all the disks are rust. Happy to share more info if it helps!
 

mbosma

Member
Dec 3, 2018
41
4
8
25
I had the same issue yesterday.
Server was running fine until I upgraded my pools to zfs 0.8.1.
The error messages started just after midnight at the first Sunday of the month.
I'll check if I can see what the cron does and if I can reproduce the issue this evening.
 
Sep 8, 2019
12
1
3
29
@wasteground That was a good hint. I could trigger it on my machine now with a zpool scrub rpool

  • 8:54 am: I entered zpool scrub rpool on the host. The SSH terminal where I entered that command was stuck immediately, and it never returned. I couldn't CTRL+C out of it, nothing.
  • 8:55 am: At first it was still possible to login into another shell, but htop could not be opened there. Later I could no longer login into neither the host nor the VM.
  • 8:58 am: The first messages "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" ... INFO: task ... blocked for more than 120 seconds appeared about 4 minutes after the scrub command.
  • Here is a screenshot of the terminal output via iKVM/HTML5 (ikvm_html5.png) at 8:58, shortly after the crash-messages appeared. What you see on this screenshot is basically the same of what happened until the reset at 9:24 am. No new messages appeared. The system was frozen:
ikvm_html5.png
  • My monitoring software ("NixStats") was stuck shortly after it. It could only send a few values after the scrub command, you can see it here (nixstats_host.png): nixstats_host.png
(All CPU goes into "kvm" because in the VM, java is running and apparently stuck due to the scrub command, java took all CPU power, so this time it counted as "user" and not "i/o" CPU ressources though.)​
  • 9:24 am: After executing a "Power Reset" command via iKVM/HTML5 on the machine, and a completed boot up, the syslog showed no hint of what happened. It's good I have the terminal output via iKVM/HTML5, because else there would be no way of seeing any error messages. If anyone wants a copy of syslog, I can provide it though.
  • I made a video of the whole situation, but it does not show much more information than what is provided here already.
  • I've executed zpool status -v rpool and zpool history commands before and after the scrub crash. It shows no change, so it appears scrub had never been executed really. See "zpool_before_scrub.txt" and "z_pool_after_scrub.txt".

I don't know what triggered the system crash/freeze yesterday though (https://forum.proxmox.com/threads/crash-proc-sys-kernel-hung_task_timeout_secs-problem-mit-zfs.57779/). I did not schedule or run scrub explicitely. Is scrub automatically executed on Proxmox 6? So can we be safe it is always the scrub command that does these problems? Or is there another zfs command that results in the same situation?
 

Attachments

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
2,136
223
63
Is scrub automatically executed on Proxmox 6?
There is a cronjob, which by default scrubs all healthy pools on the second sunday of a month:
Code:
/etc/cron.d/zfsutils-linux
Does anyone have a comprehensive log from when the system becomes unresponsive?
* remote syslog could help in aquiring one
* enabling persistent journalling as well (although it could be that the last log-messages are not persisted to disk if you need to reset the system)
 
  • Like
Reactions: logics

Semmo

New Member
May 27, 2019
26
0
1
33
I have the same problems. But I can not provide logs atm. Just wanted to add me to the list ;) Every 2nd Sunday (I just realized this now while reading this thread) I need to do a hard reset of my remote hosted node.

I have a default setup with ZFS-Raid10
 
Aug 6, 2019
10
2
3
36
I configured all my Proxmox boxes with remote syslog now - I can't cause the problem right now since it's during the day, but this evening I will run a scrub, create the issue and then share the logs.

If there are any other useful pieces of data to capture at the same time, just let me know.
 
  • Like
Reactions: Stoiko Ivanov

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
2,136
223
63
If there are any other useful pieces of data to capture at the same time, just let me know.
Please make sure to remove any data you consider sensitive from the debug-output you post.

a stacktrace of all running processes when the hang starts - this might take awhile or not be possible at all depending on how hard the freeze is:
* enable sysrq by writing 1 to '/proc/sys/kernel/sysrq' : echo 1 > /proc/sys/kernel/sysrq
* when the problem arises write 't' to '/proc/sysrq-trigger' echo t > /proc/sysrq-trigger
(this should provide the necessary info in the kernel.log - which should be transferred via remote syslog to the other machine)
See https://en.wikipedia.org/wiki/Magic_SysRq_key and https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html for details

gather some perf data around the time the system freezes (try to write it to a disk not part of your ZFS - that should increase the chances of it being written should the system crash):
* perf record -F 99 -a -g -- sleep 60 (records all running processes stacktraces repeatedly and writes to 'perf.data' in current directory)
* perf script > global.perf
(the global.perf contains the data in a processable format)
see https://perf.wiki.kernel.org/index.php/Tutorial , https://jvns.ca/perf-cheat-sheet.pdf and http://www.brendangregg.com/perf.html
for further info

I hope this helps!
 

pongraczi

New Member
Oct 23, 2008
17
5
3
Hungary
www.startit.hu
Unfortunately the affected system will unresponsible, due to that, accessing disk blocked immediately, so, local log will contains nothing.
I assume, there will be not so much in the remote log, too, because the syslog already printed to the screen and there will be not so much resource to run.
I was able to login to one of my server and it was incredible slow (of course) while load was 1300+, so, physical turn off/on was the solution.

Anyway, I am curious about the result if someone has such a system with remote log and non-zfs local os.

Good luck!

<offtopic>
how can I edit signature to update it? thanks.
</offtopic>
 
Aug 12, 2019
4
0
1
25
We also experience two fail seems like this.
We have a local storage which is a single disk zfs.
At the point of failure the pve gui and the server was still responsive, only the vms on the node was unresponsive. Their disks was on the zfs storage.
They seems like they lost their hdd.
But only the node restart solve the problem.
We unable to reproduce the fail with scrub.
I attache two syslog, the part when the event happen.
 

Attachments

Jun 9, 2018
1
0
1
50
Hi,

I have a similar problem since v6 update. Sometimes, on destination pool, zfs recv -F -- /rpool/... process is not properly killed and remains as zombie process. Any further pve-zync command will be queued on source machine until new process cannot be forked. Destination server must be rebooted.

zpool status:
Code:
  pool: rpool
state: ONLINE
  scan: scrub repaired 0B in 0 days 02:38:05 with 0 errors on Sun Sep  8 03:02:06 2019
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      raidz1-0  ONLINE       0     0     0
        sda2    ONLINE       0     0     0
        sdb2    ONLINE       0     0     0
        sdc2    ONLINE       0     0     0
        sdd2    ONLINE       0     0     0

errors: No known data errors
kernel log:
Code:
Sep  7 11:03:34 hn kernel: [962440.233380]  zfs_znode_alloc+0x625/0x680 [zfs]
Sep  7 11:03:34 hn kernel: [962440.233441]  zfs_zget+0x1ad/0x240 [zfs]
Sep  7 11:03:34 hn kernel: [962440.233498]  zfs_unlinked_drain_task+0x74/0x100 [zfs]
Sep  7 11:03:34 hn kernel: [962440.233502]  ? __switch_to_asm+0x41/0x70
Sep  7 11:03:34 hn kernel: [962440.233503]  ? __switch_to_asm+0x35/0x70
Sep  7 11:03:34 hn kernel: [962440.233505]  ? __switch_to_asm+0x41/0x70
Sep  7 11:03:34 hn kernel: [962440.233507]  ? __switch_to_asm+0x35/0x70
Sep  7 11:03:34 hn kernel: [962440.233508]  ? __switch_to_asm+0x41/0x70
Sep  7 11:03:34 hn kernel: [962440.233510]  ? __switch_to_asm+0x35/0x70
Sep  7 11:03:34 hn kernel: [962440.233512]  ? __switch_to_asm+0x41/0x70
Sep  7 11:03:34 hn kernel: [962440.233514]  ? __switch_to+0x96/0x4e0
Sep  7 11:03:34 hn kernel: [962440.233516]  ? __switch_to_asm+0x41/0x70
Sep  7 11:03:34 hn kernel: [962440.233518]  ? __switch_to_asm+0x35/0x70
Sep  7 11:03:34 hn kernel: [962440.233521]  ? __schedule+0x2dc/0x870
Sep  7 11:03:34 hn kernel: [962440.233525]  ? remove_wait_queue+0x4d/0x60
Sep  7 11:03:34 hn kernel: [962440.233535]  ? wake_up_q+0x80/0x80
Sep  7 11:03:34 hn kernel: [962440.233548]  ? task_done+0xb0/0xb0 [spl]
Sep  7 11:03:34 hn kernel: [962440.233552]  ret_from_fork+0x35/0x40
Sep  7 11:03:34 hn kernel: [962440.233662]       Tainted: P           O      5.0.18-1-pve #1
Sep  7 11:03:34 hn kernel: [962440.233724] z_unlinked_drai D    0  2275      2 0x80000000
Sep  7 11:03:34 hn kernel: [962440.233730]  __schedule+0x2d4/0x870
Sep  7 11:03:34 hn kernel: [962440.233740]  spl_panic+0xf9/0xfb [spl]
Sep  7 11:03:34 hn kernel: [962440.233805]  ? zfs_inode_destroy+0xf8/0x110 [zfs]
Sep  7 11:03:34 hn kernel: [962440.233872]  ? destroy_inode+0x3e/0x60
Sep  7 11:03:34 hn kernel: [962440.233876]  ? iput+0x148/0x210
Sep  7 11:03:34 hn kernel: [962440.233934]  zfs_znode_alloc+0x625/0x680 [zfs]
Sep  7 11:03:34 hn kernel: [962440.234052]  zfs_unlinked_drain_task+0x74/0x100 [zfs]
Sep  7 11:03:34 hn kernel: [962440.234057]  ? __switch_to+0x471/0x4e0
Sep  7 11:03:34 hn kernel: [962440.234061]  ? __switch_to_asm+0x41/0x70
Sep  7 11:03:34 hn kernel: [962440.234065]  ? __schedule+0x2dc/0x870
Sep  7 11:03:34 hn kernel: [962440.234075]  taskq_thread+0x2ec/0x4d0 [spl]
Sep  7 11:03:34 hn kernel: [962440.234082]  kthread+0x120/0x140
Sep  7 11:03:34 hn kernel: [962440.234091]  ? __kthread_parkme+0x70/0x70
Sep  7 11:07:36 hn kernel: [962681.900169] INFO: task z_unlinked_drai:473 blocked for more than 120 seconds.
Sep  7 11:07:36 hn kernel: [962681.900241] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  7 11:07:36 hn kernel: [962681.900281] Call Trace:
Sep  7 11:07:36 hn kernel: [962681.900298]  schedule+0x2c/0x70
Sep  7 11:07:36 hn kernel: [962681.900316]  ? spl_kmem_cache_free+0xc0/0x1d0 [spl]
Sep  7 11:07:36 hn kernel: [962681.900446]  ? zpl_inode_destroy+0xe/0x10 [zfs]
Sep  7 11:07:36 hn kernel: [962681.900451]  ? evict+0x139/0x1a0
Sep  7 11:07:36 hn kernel: [962681.900454]  ? insert_inode_locked+0x1d8/0x1e0
Sep  7 11:07:36 hn kernel: [962681.900573]  zfs_zget+0x1ad/0x240 [zfs]
Sep  7 11:07:36 hn kernel: [962681.900634]  ? __switch_to_asm+0x41/0x70
Sep  7 11:07:36 hn kernel: [962681.900637]  ? __switch_to_asm+0x41/0x70
Sep  7 11:07:36 hn kernel: [962681.900641]  ? __switch_to_asm+0x41/0x70
Sep  7 11:07:36 hn kernel: [962681.900644]  ? __switch_to_asm+0x41/0x70
Sep  7 11:07:36 hn kernel: [962681.900649]  ? __switch_to_asm+0x41/0x70
Sep  7 11:07:36 hn kernel: [962681.900653]  ? __schedule+0x2dc/0x870
Sep  7 11:07:36 hn kernel: [962681.900664]  taskq_thread+0x2ec/0x4d0 [spl]
Sep  7 11:07:36 hn kernel: [962681.900674]  kthread+0x120/0x140
Sep  7 11:07:36 hn kernel: [962681.900683]  ? __kthread_parkme+0x70/0x70
Sep  7 11:07:36 hn kernel: [962681.900757] INFO: task z_unlinked_drai:2275 blocked for more than 120 seconds.
Sep  7 11:07:36 hn kernel: [962681.900819] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  7 11:07:36 hn kernel: [962681.900857] Call Trace:
Sep  7 11:07:36 hn kernel: [962681.900864]  schedule+0x2c/0x70
Sep  7 11:07:36 hn kernel: [962681.900878]  ? spl_kmem_cache_free+0xc0/0x1d0 [spl]
Sep  7 11:07:36 hn kernel: [962681.901000]  ? zpl_inode_destroy+0xe/0x10 [zfs]
Sep  7 11:07:36 hn kernel: [962681.901004]  ? evict+0x139/0x1a0
Sep  7 11:07:36 hn kernel: [962681.901007]  ? insert_inode_locked+0x1d8/0x1e0
Sep  7 11:07:36 hn kernel: [962681.901124]  zfs_zget+0x1ad/0x240 [zfs]
Sep  7 11:07:36 hn kernel: [962681.901185]  ? __switch_to_asm+0x41/0x70
Sep  7 11:07:36 hn kernel: [962681.901188]  ? __switch_to_asm+0x35/0x70
Sep  7 11:07:36 hn kernel: [962681.901192]  ? __switch_to_asm+0x35/0x70
Sep  7 11:07:36 hn kernel: [962681.901198]  ? remove_wait_queue+0x4d/0x60
Sep  7 11:07:36 hn kernel: [962681.901207]  ? wake_up_q+0x80/0x80
Sep  7 11:07:36 hn kernel: [962681.901218]  ? task_done+0xb0/0xb0 [spl]
Sep  7 11:07:36 hn kernel: [962681.901222]  ret_from_fork+0x35/0x40
Sep  7 11:09:37 hn kernel: [962802.734202]  ? __switch_to_asm+0x41/0x70
Sep  7 11:09:37 hn kernel: [962802.734211]  ? __schedule+0x2dc/0x870
Sep  7 11:09:37 hn kernel: [962802.734226]  ? wake_up_q+0x80/0x80
Sep  7 11:09:37 hn kernel: [962802.734241]  ? __kthread_parkme+0x70/0x70
Sep  7 11:09:37 hn kernel: [962802.734351]       Tainted: P           O      5.0.18-1-pve #1
Sep  7 11:09:37 hn kernel: [962802.734416] Call Trace:
Sep  7 11:09:37 hn kernel: [962802.734430]  spl_panic+0xf9/0xfb [spl]
Sep  7 11:09:37 hn kernel: [962802.734552]  ? zpl_inode_destroy+0xe/0x10 [zfs]
Sep  7 11:09:37 hn kernel: [962802.734557]  ? iput+0x148/0x210
Sep  7 11:09:37 hn kernel: [962802.734682]  zfs_zget+0x1ad/0x240 [zfs]
Sep  7 11:09:37 hn kernel: [962802.734744]  ? __switch_to+0x471/0x4e0
Sep  7 11:09:37 hn kernel: [962802.734749]  ? __switch_to_asm+0x35/0x70
Sep  7 11:09:37 hn kernel: [962802.734762]  taskq_thread+0x2ec/0x4d0 [spl]
Sep  7 11:09:37 hn kernel: [962802.734775]  ? task_done+0xb0/0xb0 [spl]
Sep  7 11:11:38 hn kernel: [962923.567782]  ? __switch_to_asm+0x41/0x70
Sep  7 11:11:38 hn kernel: [962923.567791]  ? remove_wait_queue+0x4d/0x60
Sep  7 11:11:38 hn kernel: [962923.567801]  ? wake_up_q+0x80/0x80
Sep  7 11:11:38 hn kernel: [962923.567813]  ? task_done+0xb0/0xb0 [spl]
Sep  7 11:11:38 hn kernel: [962923.567818]  ret_from_fork+0x35/0x40
Sep  7 11:11:38 hn kernel: [962923.567927]       Tainted: P           O      5.0.18-1-pve #1
Sep  7 11:11:38 hn kernel: [962923.567989] z_unlinked_drai D    0  2275      2 0x80000000
Sep  7 11:11:38 hn kernel: [962923.567996]  __schedule+0x2d4/0x870
Sep  7 11:11:38 hn kernel: [962923.568006]  spl_panic+0xf9/0xfb [spl]
Sep  7 11:11:38 hn kernel: [962923.568072]  ? zfs_inode_destroy+0xf8/0x110 [zfs]
Sep  7 11:11:38 hn kernel: [962923.568131]  ? destroy_inode+0x3e/0x60
Sep  7 11:11:38 hn kernel: [962923.568134]  ? iput+0x148/0x210
Sep  7 11:11:38 hn kernel: [962923.568193]  zfs_znode_alloc+0x625/0x680 [zfs]
Sep  7 11:11:38 hn kernel: [962923.568310]  zfs_unlinked_drain_task+0x74/0x100 [zfs]
Sep  7 11:11:38 hn kernel: [962923.568316]  ? __switch_to+0x471/0x4e0
Sep  7 11:11:38 hn kernel: [962923.568319]  ? __switch_to_asm+0x41/0x70
Sep  7 11:11:38 hn kernel: [962923.568324]  ? __schedule+0x2dc/0x870
Sep  7 11:11:38 hn kernel: [962923.568334]  taskq_thread+0x2ec/0x4d0 [spl]
Sep  7 11:11:38 hn kernel: [962923.568341]  kthread+0x120/0x140
Sep  7 11:11:38 hn kernel: [962923.568349]  ? __kthread_parkme+0x70/0x70
 

hashimji2008

Member
Dec 17, 2015
6
1
8
28
Another server with zero to 1 VM (Avg Load less than 1 or .5) passed 38 days of uptime. I think server load somehow related with this freeze along with the zfs
 
Aug 6, 2019
10
2
3
36
I didn't get a chance yet to trigger this on my box (accidentally moved a VM to the server that I can't reboot until the weekend, sorry!), however, I noticed that at least in my case, this doesn't seem to be I/O related:

- I accidentally managed to move a VM (2TB of data) to the server, no issues.
- The box in question runs one FreeBSD VM which is used as a storage server and is the backup target for a bunch of other nodes
- all the backups I have ran perfectly fine overnight (4-5TB+ of data moved to this box)
- the server then uploaded 1-2TB to another server for backup

So, at least in my case, I don't feel that this is related to server load or disk throughput/usage - I could be wrong, but it just seems like the scrub is the only trigger here for me.

I'll follow up on recreating the issue on Saturday when I can move the mistake VM again without breaking anything - if there's anything non-destructive I can help with until then, just lmk.

Quick edit: I also found this, not clear if it's related: https://github.com/zfsonlinux/zfs/issues/8664

Rob.
 
Last edited:
Aug 6, 2019
10
2
3
36
Follow-up thought - it would be interesting to check if the scrub complete okay when there are no workloads running on the Proxmox node? I can test this at the weekend too (maybe sooner if i I can move some things around)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!