[SOLVED] Delete stuck snapshot

CFQ was removed/replaced by BFQ in kernel 5.0. The wiki article wasn't update since 2019. But I'd try using a bandwidth limit first.

EDIT: The wiki article was updated now
 
Last edited:
Also note that you might need to modprobe bfq (configure it in /etc/modules-load.d to make it persistent across boots) to have it show up.
 
  • Like
Reactions: ghusson
So Unspec, maybe you should try to change the scheduler algorithm to CFQ ?
nano /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="... elevator=bfq"

update-grub
Unfortunately, this also is not possible anymore. The elevator commandline option has no effect anymore nowadays. Can be done via udev nowadays. See the (now updated) wiki article: https://pve.proxmox.com/wiki/IO_Scheduler
 
  • Like
Reactions: Johannes S
So far, limiting bandwidth to 50MB has prevented any failures. Will continue to monitor. Note that my failures only ever affected CT's - the VM snapshot backups never failed.
 
Sigh. Didn't work. Failed again today at 7pm, same problem. Seems like after the failure, my IO delay stays at around 8-10%.
 
Last edited:
Hmmm... And what activity makes those IOs ?

It seems like it's entirely centered around my uptime-kuma container. If I shut it down, IO delay stays below 2% consistently. It's doing something like 60M/s of reads. After multiple restarts of that container, it's down to 20 now - no idea why that container is being such a disk hog.

I've limited backup bandwidth to 25MB and will continue monitoring

Edit: Failed at 25MB even.
 
Last edited:
It seems like it's entirely centered around my uptime-kuma container. If I shut it down, IO delay stays below 2% consistently. It's doing something like 60M/s of reads. After multiple restarts of that container, it's down to 20 now - no idea why that container is being such a disk hog.

I've limited backup bandwidth to 25MB and will continue monitoring
What is the output of zpool status -v? How full is your pool? Is there anything in the system logs/journal around the time of the issue? Could you share the configuration of the problematic container pct config <ID>?
 
What is the output of zpool status -v? How full is your pool? Is there anything in the system logs/journal around the time of the issue? Could you share the configuration of the problematic container pct config <ID>?

pool: rpool
state: ONLINE
scan: scrub repaired 0B in 00:11:19 with 0 errors on Sun Feb 9 00:35:20 2025
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda3 ONLINE 0 0 0
nvme0n1p3 ONLINE 0 0 0

errors: No known data errors

pool: rtank
state: ONLINE
scan: scrub repaired 0B in 03:48:04 with 0 errors on Sun Feb 9 04:12:06 2025
config:

NAME STATE READ WRITE CKSUM
rtank ONLINE 0 0 0
ata-ST20000NM007D-3DJ103_ZVT5SS7C ONLINE 0 0 0

errors: No known data errors

There is no PCT config to really share, I'd have to share you my entire list of containers since it's not one particular container that is having issues - it randomly affects a random container.

I will try 6.11. It does feel like this issue started happening out of nowhere within the last few months with no real changes on my end.
 
Last edited:
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 00:11:19 with 0 errors on Sun Feb 9 00:35:20 2025
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda3 ONLINE 0 0 0
nvme0n1p3 ONLINE 0 0 0
Are the disk speeds very different between these two? Of course, it shouldn't lead to issues, but I'd also not recommend having them in a mirror then.

I also noticed your storage is called pbs_local? Is PBS running as a stand-alone node, a VM or a container?
 
Are the disk speeds very different between these two? Of course, it shouldn't lead to issues, but I'd also not recommend having them in a mirror then.

I also noticed your storage is called pbs_local? Is PBS running as a stand-alone node, a VM or a container?

The nvme is running off a pcie1x riser card, so it is definitely faster, but as far as I am aware there should be no issues with doing so besides "wasting" the nvme speed.

Standalone node. I called it local cause it's on my local network, as opposed to a remote site.
 
Kernel 6.11 did not fix. I am stumped at this point. Looks like there's potentially an underlying ZFS bug.
 
The next time the issue pops up, could you run fgrep -e vzsnap -e vzdump /proc/*/mounts on the host? This is for checking if there is still a process using the dataset (of course if there is still a backup running, it will show up there too), so you should check who the PID belongs to afterwards.