Disable fs-freeze on snapshot backups

b.miller · Dec 26, 2023

Gh0st said:
When I see this it happens because fs-freeze cannot freeze the processes running inside the VM. This causes the VM to fuck up whilst it waits for a response from the fs-freeze command that never arrives. Usually, I see this on cPanel servers. cPanel secures the /tmp folder which prevents fs-freeze from working. Somewhere you have a process that cannot be frozen. If cPanel is in use try running /scripts/securetmp and answer N, Y, N and take a backup again.

Thanks for the additional insight.

No cPanel in use in this case, so unfortunately no smoking gun.

Only changes in our case are some additional sticks of server ram, ceph upgrade and Prox upgrade. Guests unchanged.

Gh0st · Dec 26, 2023

I think it will still be a process that's running all of the time inside the VM. Even more so because disabling fs-freeze works for you.

b.miller · Dec 26, 2023

Gh0st said:
I think it will still be a process that's running all of the time inside the VM. Even more so because disabling fs-freeze works for you.

I'll definitely investigate this further. But it does beg the question - if it's something specific inside the VM, why is it only showing up now after the update?

privnote · Jan 7, 2024

A short update from me.
I moved the local disks of the affected VMs from ceph RBD to the local ZFS and since then I have had no more problems

roms2000 · Jan 8, 2024

We have same problem while taking snapshot. VM are freezing and can't thaw anymore.
VM Disk are on ceph rbd, and we tried :

with qemu-guest-agent on & off
guest agent option to freeze/thaw on & off
qemu async native/threads/io_uring

Neither work. PVE has been updated to 8.1, before we were 7.3 and we did not experienced this frequent freeze on snapshot.
For now we have to disable snapshotwhich is not comfortable at all.

b.miller · Jan 8, 2024

I'm wondering if Ceph is a common denominator for people who are still seeing this issue persist after disabling QEMU freeze/thaw in backup options.

roms2000 · Jan 8, 2024

Well yes, if I think too : we have a problem between Qemu <-> Ceph.
Comment #9 from drjaymz@ get me on the right way.

Did some tests, start VM, wait for bootup, make a snapshot then try to run top or any program then make reboot/poweroff from guest :

VirtISO SCSI Single + IOThread + AIO Native : KO
VirtISO SCSI Single + IOThread + AIO Treads : KO
VirtISO SCSI Single + IOThread + AIO io_uring : KO
VirtISO SCSI + IOThread (not used/warning) + AIO Native : OK
VirtISO SCSI + IOThread (not used/warning) + AIO Treads : OK
VirtISO SCSI + IOThread (not used/warning) + AIO io_uring : OK
VirtISO SCSI Single + IOThread unchecked + AIO Native : OK
VirtISO SCSI Single + IOThread unchecked + AIO Treads : OK
VirtISO SCSI Single + IOThread unchecked + AIO io_uring : OK

With QEMU emulator version 8.1.2 (pve-qemu-kvm_8.1.2-5) - PVE 8.1.3, this mode is OK :

SCSI Controller Type : VirtIO SCSI single

Disk options : Dot not check IO thread (yes I know this was OK before and for years)

b.miller · Jan 8, 2024

roms2000 said:
Well yes, if I think too : we have a problem between Qemu <-> Ceph.
Comment #9 from drjaymz@ get me on the right way.

Did some tests, start VM, wait for bootup, make a snapshot then try to run top or any program then make reboot/poweroff from guest :

VirtISO SCSI Single + IOThread + AIO Native : KO

VirtISO SCSI Single + IOThread + AIO Treads : KO

VirtISO SCSI Single + IOThread + AIO io_uring : KO

VirtISO SCSI + IOThread (not used/warning) + AIO Native : OK

VirtISO SCSI + IOThread (not used/warning) + AIO Treads : OK

VirtISO SCSI + IOThread (not used/warning) + AIO io_uring : OK

VirtISO SCSI Single + IOThread unchecked + AIO Native : OK

VirtISO SCSI Single + IOThread unchecked + AIO Treads : OK

VirtISO SCSI Single + IOThread unchecked + AIO io_uring : OK

With QEMU emulator version 8.1.2 (pve-qemu-kvm_8.1.2-5) - PVE 8.1.3, this mode is OK :

SCSI Controller Type : VirtIO SCSI single
View attachment 61029
Disk options : Dot not check IO thread (yes I know this was OK before and for years)

View attachment 61028

This is terrific data. Thanks so much for sharing.

Can I confirm that you're on Ceph 17.2.7?

roms2000 · Jan 8, 2024

Ceph is running reef release from Proxmox (ceph --version) :
ceph version 18.2.0 (d724bab467c1c4e2a89e8070f01037ae589a37ca) reef (stable)

This sound interesting if you can make some test, at least with one VM and uncheck iothread, then poweroff VM and boot on for change to take effect.

Cha0s · Jan 8, 2024

FYI I've encountered the problem on PVE installations with both ceph and local disks, so I don't think it's a ceph specific issue. Maybe ceph exacerbates it more easily.

b.miller · Jan 8, 2024

roms2000 said:
Ceph is running reef release from Proxmox (ceph --version) :
ceph version 18.2.0 (d724bab467c1c4e2a89e8070f01037ae589a37ca) reef (stable)

This sound interesting if you can make some test, at least with one VM and uncheck iothread, then poweroff VM and boot on for change to take effect.

Will test tomorrow and report back

I have a support ticket open and linked this thread, specifically your test results, so I am hopeful for additional info soon and for the devs to attempt to replicate

b.miller · Jan 9, 2024

Unfortunately, I am unable to replicate. I can't even trigger it to work if I flip freeze/thaw back on. I'm still hunting for the exact cause - though it does appear to affect my Windows guests more than Linux.

roms2000 · Jan 9, 2024

It's working today with guest agent off. I will see this night if it's working for one vm.
I will not test for more for now, because VM get locked since too much time last 20 days ...

pongraczi · Jan 9, 2024

Hi,
may I join to the club?
I think I hit the same issue as you.

Short version: I use local zfs and proxmox-autosnap to create snapshots every hour on a PVE 8.1.
My guest UCS installation (kernel 4.19 or similar) produces jbd2/sdc1-8 dead processes, around this hourly snapshot.
The drive is random (I have 7 drives) in this setup.
Sometimes I have a live ssh session and I can see filesystem issues (jbd2 dead -> related processes also dead: systemd-journald, slapd, depends on that, which drive "died" and what system function used that drive - like log in separate partition, slapd in separate partition etc.).

Longer version is on univention help forum.

I have this issue with and without qemu-guest-agent.
I have a test running now with the following mount options inside the guest (discard removed):

Code:

/  ext4  errors=remount-ro,user_xattr
/boot/efi  vfat  umask=0077
/home  ext4  noatime,user_xattr,usrquota
/var/flexshares  ext4  noatime,user_xattr
/var/lib/univention-ldap  ext4  noatime,user_xattr
/var/log  ext4  noatime,user_xattr
/var/univention-backup  ext4  noatime,user_xattr
none  swap  sw

In my case, creating pure zfs snapshot by using zfs snap rpool@whatever does not cause trouble. At least, I do not remember any issue in the past 10+ years about it.
With backup, pbs backup I had issue in the past, especially it stops the VM for a while, which is not good when I run services which needs 100% runtime.

(Proxmox autosnap is useful, because that snapshots appear on the webgui and the admin can delete/rollback using the webgui - one admin has no ssh access, so, cannot remove invisible snapshots, not related to the original issue, just a note, why we need autosnap).

roms2000 · Jan 9, 2024

May you post you VM config ? So we can compare ?

Here are 2 VM config : one linux and one windows which were defective and now working :
(before with iothread=1 then VM freeze)

VM Linux :

Bash:

root@pve-xxx-xxx-xxx-xxx:~# qm config 203
agent: 1
balloon: 2048
bootdisk: scsi0
cores: 4
cpu: host
description: xxx
ide2: none,media=cdrom
memory: 4096
name: xxx
net0: virtio=08:00:27:EA:EF:A6,bridge=vmbr1
numa: 0
onboot: 1
ostype: l26
parent: qm-auto-snap-daily-2024-01-09-1058
protection: 1
scsi0: cephpool1:vm-203-disk-0,aio=native,cache=writeback,discard=on,iothread=0,size=80G,ssd=1
scsi1: cephpool1:vm-203-disk-1,aio=native,cache=writeback,discard=on,iothread=0,size=120G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=1bf057a1-8f6e-4535-81a9-011330b25a1a
sockets: 1
startup: order=99,up=60
vmgenid: 6362bcac-0065-468a-9ba3-9252afcfdf64

VM Windows :

Bash:

root@pve-yyy-yyy-yyy-yyy:~# qm config 206
agent: enabled=0,freeze-fs-on-backup=0
balloon: 4096
boot: cdn
bootdisk: scsi0
cores: 4
cpu: host
description: yyy
ide2: none,media=cdrom
ide3: none,media=cdrom
memory: 8192
name: yyy
net0: virtio=08:00:27:49:C2:8B,bridge=vmbr1
numa: 0
onboot: 1
ostype: win10
parent: qm-auto-snap-daily-2024-01-09-0501
protection: 1
scsi0: cephpool1:vm-206-disk-1,aio=native,cache=writeback,discard=on,iothread=0,size=120G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=2c757712-9725-4b1b-8892-4ee46416ffb3
sockets: 1
startup: order=99,up=60
vmgenid: 72429d1f-689e-4458-97b3-a3a06757a9e4

Whatever · Jan 9, 2024

@roms2000
have you checked syslog after reboot? Is there anything related to ID 129 in windows syslog?

pongraczi · Jan 9, 2024

@roms2000

Here is my config:

Code:

root@xxxx:/etc/cron.d# qm config 100
balloon: 0
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 8
cpu: host
efidisk0: local-zfs:vm-3441-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide2: none,media=cdrom
memory: 6144
meta: creation-qemu=8.1.2,ctime=1704366686
name: unidc
net0: virtio=BC:xx:xx:xx:xx:xx,bridge=vmbr10,firewall=1,mtu=1400
numa: 0
ostype: l26
parent: autohourly_2024_01_09T19_05_28
scsi0: local-zfs:vm-3441-disk-1,cache=writeback,discard=on,iothread=1,size=128G,ssd=1
scsi1: local-zfs:vm-3441-disk-2,cache=writeback,discard=on,iothread=1,size=128G,ssd=1
scsi2: local-zfs:vm-3441-disk-3,cache=writeback,discard=on,iothread=1,size=32G,ssd=1
scsi3: local-zfs:vm-3441-disk-4,cache=writeback,discard=on,iothread=1,size=32G,ssd=1
scsi4: local-zfs:vm-3441-disk-5,cache=writeback,discard=on,iothread=1,size=32G,ssd=1
scsi5: local-zfs:vm-3441-disk-6,cache=writeback,discard=on,iothread=1,size=16G,ssd=1
scsi6: local-zfs:vm-3441-disk-7,cache=writeback,discard=on,iothread=1,size=256G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=7b903100-f85d-4fe2-80f4-c391595b510d
sockets: 1
tablet: 0
vmgenid: fxxxxxxxx-yyyy-dddd-tttt-00000000xxx

At this moment I had about hourly proxmox-autosnap 10 time since this morning, so far, so good.
I removed the discard mount option from guest fstab from all filesystems.

roms2000 · Jan 9, 2024

Whatever said:
@roms2000
have you checked syslog after reboot? Is there anything related to ID 129 in windows syslog?

For windows, yes, I have a lot of this ... vioscsi : "Une réinitialisation au périphérique, \Device\RaidPort0, a été émise."
On Linux, I have message like "INFO: task XXX blocked for more than 120 seconds"

pongraczi said:

@roms2000

Here is my config:

Code:

root@xxxx:/etc/cron.d# qm config 100
balloon: 0
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 8
cpu: host
efidisk0: local-zfs:vm-3441-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide2: none,media=cdrom
memory: 6144
meta: creation-qemu=8.1.2,ctime=1704366686
name: unidc
net0: virtio=BC:xx:xx:xx:xx:xx,bridge=vmbr10,firewall=1,mtu=1400
numa: 0
ostype: l26
parent: autohourly_2024_01_09T19_05_28
scsi0: local-zfs:vm-3441-disk-1,cache=writeback,discard=on,iothread=1,size=128G,ssd=1
scsi1: local-zfs:vm-3441-disk-2,cache=writeback,discard=on,iothread=1,size=128G,ssd=1
scsi2: local-zfs:vm-3441-disk-3,cache=writeback,discard=on,iothread=1,size=32G,ssd=1
scsi3: local-zfs:vm-3441-disk-4,cache=writeback,discard=on,iothread=1,size=32G,ssd=1
scsi4: local-zfs:vm-3441-disk-5,cache=writeback,discard=on,iothread=1,size=32G,ssd=1
scsi5: local-zfs:vm-3441-disk-6,cache=writeback,discard=on,iothread=1,size=16G,ssd=1
scsi6: local-zfs:vm-3441-disk-7,cache=writeback,discard=on,iothread=1,size=256G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=7b903100-f85d-4fe2-80f4-c391595b510d
sockets: 1
tablet: 0
vmgenid: fxxxxxxxx-yyyy-dddd-tttt-00000000xxx

At this moment I had about hourly proxmox-autosnap 10 time since this morning, so far, so good.
I removed the discard mount option from guest fstab from all filesystems.

Did you try to unchek IO Thread on disk option ? Or switch to "VirtIO SCSI" and not "VirtIO SCSI Single" ? Followed by a power off the vm to try ?

Whatever · Jan 9, 2024

roms2000 said:
For windows, yes, I have a lot of this ... vioscsi : "Une réinitialisation au périphérique, \Device\RaidPort0, a été émise."
On Linux, I have message like "INFO: task XXX blocked for more than 120 seconds"

Then check this thread and links to virtio driver git (there are some advices from devs)
If you are able to reproduce this issue easily it would be very helpful to find a solution or workaround

roms2000 · Jan 10, 2024

@Whatever : i will look at the thread

Also, the latest update from proxmox qemu indicates :

Code:

pve-qemu-kvm (8.1.2-6) bookworm; urgency=medium

  * revert attempted fix to avoid rare issue with stuck guest IO when using
    iothread, because it caused a much more common issue with iothreads
    consuming too much CPU

 -- Proxmox Support Team <support@proxmox.com>  Fri, 15 Dec 2023 14:22:06 +0100

pve-qemu-kvm (8.1.2-5) bookworm; urgency=medium

  * backport workaround for stuck guest IO with iothread and VirtIO block/SCSI
    in some rare edge cases

  * backport fix for potential deadlock when issuing the "resize" QMP command
    for a disk that is using iothread

 -- Proxmox Support Team <support@proxmox.com>  Mon, 11 Dec 2023 16:58:27 +0100

So, I will update too.

EDIT : update to pve-qemu-kvm 8.1.2-6 seems to work for now even with IO thread like describe.

Disable fs-freeze on snapshot backups

Member

Member

Member

New Member

Member

Member

Member

Member

Member

Well-Known Member

Member

Member

Member

Renowned Member

Member

Renowned Member

Renowned Member

Member

Renowned Member

Member

We value your privacy