[SOLVED] Delete stuck snapshot

fiona · Feb 6, 2025

CFQ was removed/replaced by BFQ in kernel 5.0. The wiki article wasn't update since 2019. But I'd try using a bandwidth limit first.

EDIT: The wiki article was updated now

ghusson · Feb 6, 2025

fiona said:
CFQ was removed/replaced by BFQ in kernel 5.0. The wiki article wasn't update since 2019. But I'd try using a bandwidth limit first.

Thank you Fiona. I edited my post.
Yest it is the first thing to try, I agree.

fiona · Feb 6, 2025

Also note that you might need to modprobe bfq (configure it in /etc/modules-load.d to make it persistent across boots) to have it show up.

fiona · Feb 6, 2025

ghusson said:
So Unspec, maybe you should try to change the scheduler algorithm to CFQ ?
nano /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="... elevator=bfq"

update-grub

Unfortunately, this also is not possible anymore. The elevator commandline option has no effect anymore nowadays. Can be done via udev nowadays. See the (now updated) wiki article: https://pve.proxmox.com/wiki/IO_Scheduler

ghusson · Feb 6, 2025

Ohhh... Good to know fiona, thanks !

Unspec · Feb 6, 2025

So far, limiting bandwidth to 50MB has prevented any failures. Will continue to monitor. Note that my failures only ever affected CT's - the VM snapshot backups never failed.

ghusson · Feb 7, 2025

Unspec said:
So far, limiting bandwidth to 50MB has prevented any failures. Will continue to monitor. Note that my failures only ever affected CT's - the VM snapshot backups never failed.

Nice to read this ️

Unspec · Feb 8, 2025

Sigh. Didn't work. Failed again today at 7pm, same problem. Seems like after the failure, my IO delay stays at around 8-10%.

ghusson · Feb 10, 2025

Hmmm... And what activity makes those IOs ?

Unspec · Feb 10, 2025

ghusson said:
Hmmm... And what activity makes those IOs ?

It seems like it's entirely centered around my uptime-kuma container. If I shut it down, IO delay stays below 2% consistently. It's doing something like 60M/s of reads. After multiple restarts of that container, it's down to 20 now - no idea why that container is being such a disk hog.

I've limited backup bandwidth to 25MB and will continue monitoring

Edit: Failed at 25MB even.

fiona · Feb 11, 2025

Unspec said:
It seems like it's entirely centered around my uptime-kuma container. If I shut it down, IO delay stays below 2% consistently. It's doing something like 60M/s of reads. After multiple restarts of that container, it's down to 20 now - no idea why that container is being such a disk hog.

I've limited backup bandwidth to 25MB and will continue monitoring

What is the output of zpool status -v? How full is your pool? Is there anything in the system logs/journal around the time of the issue? Could you share the configuration of the problematic container pct config <ID>?

fiona · Feb 11, 2025

You could also want to give the newer 6.11 kernel a try. If is a low-level issue and you are lucky, it might already be resolved there: https://forum.proxmox.com/threads/156818

Unspec · Feb 11, 2025

fiona said:
What is the output of zpool status -v? How full is your pool? Is there anything in the system logs/journal around the time of the issue? Could you share the configuration of the problematic container pct config <ID>?

pool: rpool
state: ONLINE
scan: scrub repaired 0B in 00:11:19 with 0 errors on Sun Feb 9 00:35:20 2025
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda3 ONLINE 0 0 0
nvme0n1p3 ONLINE 0 0 0

errors: No known data errors

pool: rtank
state: ONLINE
scan: scrub repaired 0B in 03:48:04 with 0 errors on Sun Feb 9 04:12:06 2025
config:

NAME STATE READ WRITE CKSUM
rtank ONLINE 0 0 0
ata-ST20000NM007D-3DJ103_ZVT5SS7C ONLINE 0 0 0

errors: No known data errors

There is no PCT config to really share, I'd have to share you my entire list of containers since it's not one particular container that is having issues - it randomly affects a random container.

I will try 6.11. It does feel like this issue started happening out of nowhere within the last few months with no real changes on my end.

fiona · Feb 12, 2025

Unspec said:
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 00:11:19 with 0 errors on Sun Feb 9 00:35:20 2025
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda3 ONLINE 0 0 0
nvme0n1p3 ONLINE 0 0 0

Are the disk speeds very different between these two? Of course, it shouldn't lead to issues, but I'd also not recommend having them in a mirror then.

I also noticed your storage is called pbs_local? Is PBS running as a stand-alone node, a VM or a container?

Unspec · Feb 13, 2025

fiona said:
Are the disk speeds very different between these two? Of course, it shouldn't lead to issues, but I'd also not recommend having them in a mirror then.

I also noticed your storage is called pbs_local? Is PBS running as a stand-alone node, a VM or a container?

The nvme is running off a pcie1x riser card, so it is definitely faster, but as far as I am aware there should be no issues with doing so besides "wasting" the nvme speed.

Standalone node. I called it local cause it's on my local network, as opposed to a remote site.

Unspec · Feb 14, 2025

Kernel 6.11 did not fix. I am stumped at this point. Looks like there's potentially an underlying ZFS bug.

ghusson · Feb 17, 2025

Unspec said:
Kernel 6.11 did not fix. I am stumped at this point. Looks like there's potentially an underlying ZFS bug.

Maybe you can try the elevator and the IO scheduling ?

https://pve.proxmox.com/wiki/IO_Scheduler
+ ionice 8 in /etc/vzdump.conf

fiona · Feb 17, 2025

The next time the issue pops up, could you run fgrep -e vzsnap -e vzdump /proc/*/mounts on the host? This is for checking if there is still a process using the dataset (of course if there is still a backup running, it will show up there too), so you should check who the PID belongs to afterwards.

fiona · Feb 17, 2025

Is there maybe a hung umount task? See: https://forum.proxmox.com/threads/b...shot-dataset-already-exists.52783/post-748864

Unspec · Feb 25, 2025

fiona said:
The next time the issue pops up, could you run fgrep -e vzsnap -e vzdump /proc/*/mounts on the host? This is for checking if there is still a process using the dataset (of course if there is still a backup running, it will show up there too), so you should check who the PID belongs to afterwards.

Returns nothing.

fiona said:
Is there maybe a hung umount task? See: https://forum.proxmox.com/threads/b...shot-dataset-already-exists.52783/post-748864

ps faxl | grep vzdump only returns:

0 0 1880280 1880261 20 0 6332 1396 pipe_r S+ pts/0 0:00 | \_ grep vzdump

Unlike the other use, ps faxl | grep vzsnap returns nothing as well (besides the \_ grep vzsnap ofc)

[SOLVED] Delete stuck snapshot

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Proxmox Staff Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

New Member

Proxmox Staff Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

New Member

Renowned Member

Proxmox Staff Member

Proxmox Staff Member

New Member

We value your privacy