Disable fs-freeze on snapshot backups

When I see this it happens because fs-freeze cannot freeze the processes running inside the VM. This causes the VM to fuck up whilst it waits for a response from the fs-freeze command that never arrives. Usually, I see this on cPanel servers. cPanel secures the /tmp folder which prevents fs-freeze from working. Somewhere you have a process that cannot be frozen. If cPanel is in use try running /scripts/securetmp and answer N, Y, N and take a backup again.
Thanks for the additional insight.

No cPanel in use in this case, so unfortunately no smoking gun.

Only changes in our case are some additional sticks of server ram, ceph upgrade and Prox upgrade. Guests unchanged.
 
I think it will still be a process that's running all of the time inside the VM. Even more so because disabling fs-freeze works for you.
 
  • Like
Reactions: b.miller
A short update from me.
I moved the local disks of the affected VMs from ceph RBD to the local ZFS and since then I have had no more problems
 
We have same problem while taking snapshot. VM are freezing and can't thaw anymore.
VM Disk are on ceph rbd, and we tried :
  • with qemu-guest-agent on & off
  • guest agent option to freeze/thaw on & off
  • qemu async native/threads/io_uring
Neither work. PVE has been updated to 8.1, before we were 7.3 and we did not experienced this frequent freeze on snapshot.
For now we have to disable snapshotwhich is not comfortable at all.
 
Well yes, if I think too : we have a problem between Qemu <-> Ceph.
Comment #9 from drjaymz@ get me on the right way.

Did some tests, start VM, wait for bootup, make a snapshot then try to run top or any program then make reboot/poweroff from guest :
  • VirtISO SCSI Single + IOThread + AIO Native : KO
  • VirtISO SCSI Single + IOThread + AIO Treads : KO
  • VirtISO SCSI Single + IOThread + AIO io_uring : KO
  • VirtISO SCSI + IOThread (not used/warning) + AIO Native : OK
  • VirtISO SCSI + IOThread (not used/warning) + AIO Treads : OK
  • VirtISO SCSI + IOThread (not used/warning) + AIO io_uring : OK
  • VirtISO SCSI Single + IOThread unchecked + AIO Native : OK
  • VirtISO SCSI Single + IOThread unchecked + AIO Treads : OK
  • VirtISO SCSI Single + IOThread unchecked + AIO io_uring : OK


With QEMU emulator version 8.1.2 (pve-qemu-kvm_8.1.2-5) - PVE 8.1.3, this mode is OK :

SCSI Controller Type : VirtIO SCSI single
1704727316285.png
Disk options : Dot not check IO thread (yes I know this was OK before and for years)

1704727282832.png
 
  • Like
Reactions: b.miller
Well yes, if I think too : we have a problem between Qemu <-> Ceph.
Comment #9 from drjaymz@ get me on the right way.

Did some tests, start VM, wait for bootup, make a snapshot then try to run top or any program then make reboot/poweroff from guest :
  • VirtISO SCSI Single + IOThread + AIO Native : KO
  • VirtISO SCSI Single + IOThread + AIO Treads : KO
  • VirtISO SCSI Single + IOThread + AIO io_uring : KO
  • VirtISO SCSI + IOThread (not used/warning) + AIO Native : OK
  • VirtISO SCSI + IOThread (not used/warning) + AIO Treads : OK
  • VirtISO SCSI + IOThread (not used/warning) + AIO io_uring : OK
  • VirtISO SCSI Single + IOThread unchecked + AIO Native : OK
  • VirtISO SCSI Single + IOThread unchecked + AIO Treads : OK
  • VirtISO SCSI Single + IOThread unchecked + AIO io_uring : OK


With QEMU emulator version 8.1.2 (pve-qemu-kvm_8.1.2-5) - PVE 8.1.3, this mode is OK :

SCSI Controller Type : VirtIO SCSI single
View attachment 61029
Disk options : Dot not check IO thread (yes I know this was OK before and for years)

View attachment 61028

This is terrific data. Thanks so much for sharing.

Can I confirm that you're on Ceph 17.2.7?
 
Ceph is running reef release from Proxmox (ceph --version) :
ceph version 18.2.0 (d724bab467c1c4e2a89e8070f01037ae589a37ca) reef (stable)

This sound interesting if you can make some test, at least with one VM and uncheck iothread, then poweroff VM and boot on for change to take effect.
 
  • Like
Reactions: b.miller
Ceph is running reef release from Proxmox (ceph --version) :
ceph version 18.2.0 (d724bab467c1c4e2a89e8070f01037ae589a37ca) reef (stable)

This sound interesting if you can make some test, at least with one VM and uncheck iothread, then poweroff VM and boot on for change to take effect.
Will test tomorrow and report back

I have a support ticket open and linked this thread, specifically your test results, so I am hopeful for additional info soon and for the devs to attempt to replicate
 
Last edited:
Unfortunately, I am unable to replicate. I can't even trigger it to work if I flip freeze/thaw back on. I'm still hunting for the exact cause - though it does appear to affect my Windows guests more than Linux.
 
It's working today with guest agent off. I will see this night if it's working for one vm.
I will not test for more for now, because VM get locked since too much time last 20 days ...
 
Hi,
may I join to the club?
I think I hit the same issue as you.

Short version: I use local zfs and proxmox-autosnap to create snapshots every hour on a PVE 8.1.
My guest UCS installation (kernel 4.19 or similar) produces jbd2/sdc1-8 dead processes, around this hourly snapshot.
The drive is random (I have 7 drives) in this setup.
Sometimes I have a live ssh session and I can see filesystem issues (jbd2 dead -> related processes also dead: systemd-journald, slapd, depends on that, which drive "died" and what system function used that drive - like log in separate partition, slapd in separate partition etc.).

Longer version is on univention help forum.

I have this issue with and without qemu-guest-agent.
I have a test running now with the following mount options inside the guest (discard removed):
Code:
/  ext4  errors=remount-ro,user_xattr
/boot/efi  vfat  umask=0077
/home  ext4  noatime,user_xattr,usrquota
/var/flexshares  ext4  noatime,user_xattr
/var/lib/univention-ldap  ext4  noatime,user_xattr
/var/log  ext4  noatime,user_xattr
/var/univention-backup  ext4  noatime,user_xattr
none  swap  sw

In my case, creating pure zfs snapshot by using zfs snap rpool@whatever does not cause trouble. At least, I do not remember any issue in the past 10+ years about it.
With backup, pbs backup I had issue in the past, especially it stops the VM for a while, which is not good when I run services which needs 100% runtime.

(Proxmox autosnap is useful, because that snapshots appear on the webgui and the admin can delete/rollback using the webgui - one admin has no ssh access, so, cannot remove invisible snapshots, not related to the original issue, just a note, why we need autosnap).
 
  • Like
Reactions: b.miller
May you post you VM config ? So we can compare ?

Here are 2 VM config : one linux and one windows which were defective and now working :
(before with iothread=1 then VM freeze)

VM Linux :
Bash:
root@pve-xxx-xxx-xxx-xxx:~# qm config 203
agent: 1
balloon: 2048
bootdisk: scsi0
cores: 4
cpu: host
description: xxx
ide2: none,media=cdrom
memory: 4096
name: xxx
net0: virtio=08:00:27:EA:EF:A6,bridge=vmbr1
numa: 0
onboot: 1
ostype: l26
parent: qm-auto-snap-daily-2024-01-09-1058
protection: 1
scsi0: cephpool1:vm-203-disk-0,aio=native,cache=writeback,discard=on,iothread=0,size=80G,ssd=1
scsi1: cephpool1:vm-203-disk-1,aio=native,cache=writeback,discard=on,iothread=0,size=120G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=1bf057a1-8f6e-4535-81a9-011330b25a1a
sockets: 1
startup: order=99,up=60
vmgenid: 6362bcac-0065-468a-9ba3-9252afcfdf64

VM Windows :
Bash:
root@pve-yyy-yyy-yyy-yyy:~# qm config 206
agent: enabled=0,freeze-fs-on-backup=0
balloon: 4096
boot: cdn
bootdisk: scsi0
cores: 4
cpu: host
description: yyy
ide2: none,media=cdrom
ide3: none,media=cdrom
memory: 8192
name: yyy
net0: virtio=08:00:27:49:C2:8B,bridge=vmbr1
numa: 0
onboot: 1
ostype: win10
parent: qm-auto-snap-daily-2024-01-09-0501
protection: 1
scsi0: cephpool1:vm-206-disk-1,aio=native,cache=writeback,discard=on,iothread=0,size=120G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=2c757712-9725-4b1b-8892-4ee46416ffb3
sockets: 1
startup: order=99,up=60
vmgenid: 72429d1f-689e-4458-97b3-a3a06757a9e4
 
@roms2000

Here is my config:
Code:
root@xxxx:/etc/cron.d# qm config 100
balloon: 0
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 8
cpu: host
efidisk0: local-zfs:vm-3441-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide2: none,media=cdrom
memory: 6144
meta: creation-qemu=8.1.2,ctime=1704366686
name: unidc
net0: virtio=BC:xx:xx:xx:xx:xx,bridge=vmbr10,firewall=1,mtu=1400
numa: 0
ostype: l26
parent: autohourly_2024_01_09T19_05_28
scsi0: local-zfs:vm-3441-disk-1,cache=writeback,discard=on,iothread=1,size=128G,ssd=1
scsi1: local-zfs:vm-3441-disk-2,cache=writeback,discard=on,iothread=1,size=128G,ssd=1
scsi2: local-zfs:vm-3441-disk-3,cache=writeback,discard=on,iothread=1,size=32G,ssd=1
scsi3: local-zfs:vm-3441-disk-4,cache=writeback,discard=on,iothread=1,size=32G,ssd=1
scsi4: local-zfs:vm-3441-disk-5,cache=writeback,discard=on,iothread=1,size=32G,ssd=1
scsi5: local-zfs:vm-3441-disk-6,cache=writeback,discard=on,iothread=1,size=16G,ssd=1
scsi6: local-zfs:vm-3441-disk-7,cache=writeback,discard=on,iothread=1,size=256G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=7b903100-f85d-4fe2-80f4-c391595b510d
sockets: 1
tablet: 0
vmgenid: fxxxxxxxx-yyyy-dddd-tttt-00000000xxx

At this moment I had about hourly proxmox-autosnap 10 time since this morning, so far, so good.
I removed the discard mount option from guest fstab from all filesystems.
 
@roms2000
have you checked syslog after reboot? Is there anything related to ID 129 in windows syslog?
For windows, yes, I have a lot of this ... vioscsi : "Une réinitialisation au périphérique, \Device\RaidPort0, a été émise."
On Linux, I have message like "INFO: task XXX blocked for more than 120 seconds"

@roms2000

Here is my config:
Code:
root@xxxx:/etc/cron.d# qm config 100
balloon: 0
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 8
cpu: host
efidisk0: local-zfs:vm-3441-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide2: none,media=cdrom
memory: 6144
meta: creation-qemu=8.1.2,ctime=1704366686
name: unidc
net0: virtio=BC:xx:xx:xx:xx:xx,bridge=vmbr10,firewall=1,mtu=1400
numa: 0
ostype: l26
parent: autohourly_2024_01_09T19_05_28
scsi0: local-zfs:vm-3441-disk-1,cache=writeback,discard=on,iothread=1,size=128G,ssd=1
scsi1: local-zfs:vm-3441-disk-2,cache=writeback,discard=on,iothread=1,size=128G,ssd=1
scsi2: local-zfs:vm-3441-disk-3,cache=writeback,discard=on,iothread=1,size=32G,ssd=1
scsi3: local-zfs:vm-3441-disk-4,cache=writeback,discard=on,iothread=1,size=32G,ssd=1
scsi4: local-zfs:vm-3441-disk-5,cache=writeback,discard=on,iothread=1,size=32G,ssd=1
scsi5: local-zfs:vm-3441-disk-6,cache=writeback,discard=on,iothread=1,size=16G,ssd=1
scsi6: local-zfs:vm-3441-disk-7,cache=writeback,discard=on,iothread=1,size=256G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=7b903100-f85d-4fe2-80f4-c391595b510d
sockets: 1
tablet: 0
vmgenid: fxxxxxxxx-yyyy-dddd-tttt-00000000xxx

At this moment I had about hourly proxmox-autosnap 10 time since this morning, so far, so good.
I removed the discard mount option from guest fstab from all filesystems.
Did you try to unchek IO Thread on disk option ? Or switch to "VirtIO SCSI" and not "VirtIO SCSI Single" ? Followed by a power off the vm to try ?
 
Last edited:
For windows, yes, I have a lot of this ... vioscsi : "Une réinitialisation au périphérique, \Device\RaidPort0, a été émise."
On Linux, I have message like "INFO: task XXX blocked for more than 120 seconds"

Then check this thread and links to virtio driver git (there are some advices from devs)
If you are able to reproduce this issue easily it would be very helpful to find a solution or workaround
 
Last edited:
@Whatever : i will look at the thread


Also, the latest update from proxmox qemu indicates :
Code:
pve-qemu-kvm (8.1.2-6) bookworm; urgency=medium

  * revert attempted fix to avoid rare issue with stuck guest IO when using
    iothread, because it caused a much more common issue with iothreads
    consuming too much CPU

 -- Proxmox Support Team <support@proxmox.com>  Fri, 15 Dec 2023 14:22:06 +0100

pve-qemu-kvm (8.1.2-5) bookworm; urgency=medium

  * backport workaround for stuck guest IO with iothread and VirtIO block/SCSI
    in some rare edge cases

  * backport fix for potential deadlock when issuing the "resize" QMP command
    for a disk that is using iothread

 -- Proxmox Support Team <support@proxmox.com>  Mon, 11 Dec 2023 16:58:27 +0100

So, I will update too.

EDIT : update to pve-qemu-kvm 8.1.2-6 seems to work for now even with IO thread like describe.
 
Last edited:
  • Like
Reactions: cheiss

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!