Disable fs-freeze on snapshot backups

pongraczi · Jan 10, 2024

@roms2000
Thanks for the update!
After I removed discard option from the mount options of the guest, up to now 25 snapshots happened and the server still up and running. I even issued several fstrim commands, completed almost zero time.
Maybe this one only lowers the risk of stuck IO.

Anyway, based on the changelog and your results, it could explain some weird issues I experienced, so, upgrading PVE scheduled and I need to rethink all kvm guests configs.

privnote · Jan 11, 2024

Hi,
I still have no more problems, after moving the disks of the problematic VMs to local storage.
The 3-4 VMs that had problems are still on a local ZFS storage.
Almost all other VMs are on Ceph RBD HDD / NVME storage

I make a daily backup with PBS and occasionally take snapshots, sometimes of up to 50 VMs within minutes.
I have never had any problems when creating snapshots, only after creating backup with PBS

I am using Proxmox 8.1 and Ceph 17.2
Debian Buster is installed on most VMs.
For the problematic VMs it was Debian Buster and Bullseye

davemcl · Jan 11, 2024

The last 2 times I had this issue I was only backing up 2 VM's concurrently to PBS. Hardware is also over specced for this task so Im not convinced its a performance issue.

pongraczi · Jan 11, 2024

Just an update: the kvm server is still up and running, 49 hours, without any jbd2 hangups, disk lockups. Hourly qm snapshots, earlier the server already died after 12-24 hours. I did not upgrade the proxmox yet (still PVE 8.1.3).
Reminder: I removed the discard from the mount options inside the kvm guest.

Code:

base-files/stable 12.4+deb12u4 amd64 [upgradable from: 12.4+deb12u2]
curl/stable-security 7.88.1-10+deb12u5 amd64 [upgradable from: 7.88.1-10+deb12u4]
distro-info-data/stable 0.58+deb12u1 all [upgradable from: 0.58]
gnutls-bin/stable 3.7.9-2+deb12u1 amd64 [upgradable from: 3.7.9-2]
ifupdown2/stable 3.2.0-1+pmx8 all [upgradable from: 3.2.0-1+pmx7]
libcurl3-gnutls/stable-security 7.88.1-10+deb12u5 amd64 [upgradable from: 7.88.1-10+deb12u4]
libcurl4/stable-security 7.88.1-10+deb12u5 amd64 [upgradable from: 7.88.1-10+deb12u4]
libgnutls-dane0/stable 3.7.9-2+deb12u1 amd64 [upgradable from: 3.7.9-2]
libgnutls30/stable 3.7.9-2+deb12u1 amd64 [upgradable from: 3.7.9-2]
libgnutlsxx30/stable 3.7.9-2+deb12u1 amd64 [upgradable from: 3.7.9-2]
libnss-systemd/stable 252.19-1~deb12u1 amd64 [upgradable from: 252.17-1~deb12u1]
libnvpair3linux/stable 2.2.2-pve1 amd64 [upgradable from: 2.2.0-pve4]
libpam-systemd/stable 252.19-1~deb12u1 amd64 [upgradable from: 252.17-1~deb12u1]
libperl5.36/stable 5.36.0-7+deb12u1 amd64 [upgradable from: 5.36.0-7]
libproxmox-rs-perl/stable 0.3.3 amd64 [upgradable from: 0.3.1]
libsystemd-shared/stable 252.19-1~deb12u1 amd64 [upgradable from: 252.17-1~deb12u1]
libsystemd0/stable 252.19-1~deb12u1 amd64 [upgradable from: 252.17-1~deb12u1]
libudev1/stable 252.19-1~deb12u1 amd64 [upgradable from: 252.17-1~deb12u1]
libuutil3linux/stable 2.2.2-pve1 amd64 [upgradable from: 2.2.0-pve4]
libzfs4linux/stable 2.2.2-pve1 amd64 [upgradable from: 2.2.0-pve4]
libzpool5linux/stable 2.2.2-pve1 amd64 [upgradable from: 2.2.0-pve4]
lxcfs/stable 5.0.3-pve4 amd64 [upgradable from: 5.0.3-pve3]
openssh-client/stable-security 1:9.2p1-2+deb12u2 amd64 [upgradable from: 1:9.2p1-2+deb12u1]
openssh-server/stable-security 1:9.2p1-2+deb12u2 amd64 [upgradable from: 1:9.2p1-2+deb12u1]
openssh-sftp-server/stable-security 1:9.2p1-2+deb12u2 amd64 [upgradable from: 1:9.2p1-2+deb12u1]
perl-base/stable 5.36.0-7+deb12u1 amd64 [upgradable from: 5.36.0-7]
perl-modules-5.36/stable 5.36.0-7+deb12u1 all [upgradable from: 5.36.0-7]
perl/stable 5.36.0-7+deb12u1 amd64 [upgradable from: 5.36.0-7]
postfix/stable-updates 3.7.9-0+deb12u1 amd64 [upgradable from: 3.7.6-0+deb12u2]
proxmox-kernel-6.2/stable 6.2.16-20 all [upgradable from: 6.2.16-19]
proxmox-kernel-6.5/stable 6.5.11-7 all [upgradable from: 6.5.11-6]
pve-i18n/stable 3.1.5 all [upgradable from: 3.1.4]
pve-qemu-kvm/stable 8.1.2-6 amd64 [upgradable from: 8.1.2-4]
pve-xtermjs/stable 5.3.0-3 all [upgradable from: 5.3.0-2]
spl/stable 2.2.2-pve1 all [upgradable from: 2.2.0-pve4]
ssh/stable-security 1:9.2p1-2+deb12u2 all [upgradable from: 1:9.2p1-2+deb12u1]
systemd-boot-efi/stable 252.19-1~deb12u1 amd64 [upgradable from: 252.17-1~deb12u1]
systemd-boot/stable 252.19-1~deb12u1 amd64 [upgradable from: 252.17-1~deb12u1]
systemd-sysv/stable 252.19-1~deb12u1 amd64 [upgradable from: 252.17-1~deb12u1]
systemd/stable 252.19-1~deb12u1 amd64 [upgradable from: 252.17-1~deb12u1]
tzdata/stable 2023c-5+deb12u1 all [upgradable from: 2023c-5]
udev/stable 252.19-1~deb12u1 amd64 [upgradable from: 252.17-1~deb12u1]
zfs-initramfs/stable 2.2.2-pve1 all [upgradable from: 2.2.0-pve4]
zfs-zed/stable 2.2.2-pve1 amd64 [upgradable from: 2.2.0-pve4]
zfsutils-linux/stable 2.2.2-pve1 amd64 [upgradable from: 2.2.0-pve4]

pongraczi · Jan 12, 2024

Just for the record:
The test guest server was running for 58 hours without any jbd2/io lockups, after I removed the discard mount option from the guest os fstab.
As I use autosnap (qm snapshot) hourly, that means, I had 58 occasion, when the lockups did not happen, even, before this particular test the guest server died randomly between 12-48 hours, mostly less than 28 hours.

Yesterday night I upgraded this proxmox server and started agan this guest os, seems running (only 11+ hours uptime yet).
I keep watching and report back, if anything negative happens.
Thanks for playing.

pongraczi · Jan 12, 2024

Update:
After the upgrade, my guest server died again. It was running for 15,5 hours.

Guest OS related data:

Code:

Jan 12 13:12:28 ucs-6743 kernel: [55583.858674] jbd2/sda2-8     D    0   319      2 0x80000000
Jan 12 13:12:28 ucs-6743 kernel: [55583.858674] Call Trace:
Jan 12 13:12:28 ucs-6743 kernel: [55583.858676]  __schedule+0x29f/0x840
Jan 12 13:12:28 ucs-6743 kernel: [55583.858677]  ? blk_mq_sched_insert_requests+0x80/0xa0
Jan 12 13:12:28 ucs-6743 kernel: [55583.858678]  ? bit_wait_timeout+0x90/0x90
Jan 12 13:12:28 ucs-6743 kernel: [55583.858678]  schedule+0x28/0x80
Jan 12 13:12:28 ucs-6743 kernel: [55583.858679]  io_schedule+0x12/0x40
Jan 12 13:12:28 ucs-6743 kernel: [55583.858679]  bit_wait_io+0xd/0x50
Jan 12 13:12:28 ucs-6743 kernel: [55583.858680]  __wait_on_bit+0x73/0x90
Jan 12 13:12:28 ucs-6743 kernel: [55583.858680]  out_of_line_wait_on_bit+0x91/0xb0
Jan 12 13:12:28 ucs-6743 kernel: [55583.858681]  ? init_wait_var_entry+0x40/0x40
Jan 12 13:12:28 ucs-6743 kernel: [55583.858682]  jbd2_journal_commit_transaction+0xf9c/0x1840 [jbd2]
Jan 12 13:12:28 ucs-6743 kernel: [55583.858685]  kjournald2+0xbd/0x270 [jbd2]
Jan 12 13:12:28 ucs-6743 kernel: [55583.858686]  ? finish_wait+0x80/0x80
Jan 12 13:12:28 ucs-6743 kernel: [55583.858687]  ? commit_timeout+0x10/0x10 [jbd2]
Jan 12 13:12:28 ucs-6743 kernel: [55583.858688]  kthread+0x112/0x130
Jan 12 13:12:28 ucs-6743 kernel: [55583.858689]  ? kthread_bind+0x30/0x30
Jan 12 13:12:28 ucs-6743 kernel: [55583.858690]  ret_from_fork+0x35/0x40
Jan 12 13:12:28 ucs-6743 kernel: [55583.858711] dockerd         D    0  1519      1 0x00000000
Jan 12 13:12:28 ucs-6743 kernel: [55583.858712] Call Trace:
Jan 12 13:12:28 ucs-6743 kernel: [55583.858712]  __schedule+0x29f/0x840
Jan 12 13:12:28 ucs-6743 kernel: [55583.858713]  ? bit_wait_timeout+0x90/0x90
Jan 12 13:12:28 ucs-6743 kernel: [55583.858713]  schedule+0x28/0x80
Jan 12 13:12:28 ucs-6743 kernel: [55583.858714]  io_schedule+0x12/0x40
Jan 12 13:12:28 ucs-6743 kernel: [55583.858714]  bit_wait_io+0xd/0x50
Jan 12 13:12:28 ucs-6743 kernel: [55583.858714]  __wait_on_bit+0x73/0x90
Jan 12 13:12:28 ucs-6743 kernel: [55583.858715]  out_of_line_wait_on_bit+0x91/0xb0
Jan 12 13:12:28 ucs-6743 kernel: [55583.858716]  ? init_wait_var_entry+0x40/0x40
Jan 12 13:12:28 ucs-6743 kernel: [55583.858717]  do_get_write_access+0x297/0x410 [jbd2]
Jan 12 13:12:28 ucs-6743 kernel: [55583.858718]  jbd2_journal_get_write_access+0x57/0x70 [jbd2]
Jan 12 13:12:28 ucs-6743 kernel: [55583.858721]  __ext4_journal_get_write_access+0x36/0x70 [ext4]
Jan 12 13:12:28 ucs-6743 kernel: [55583.858724]  __ext4_new_inode+0xa9a/0x1580 [ext4]
Jan 12 13:12:28 ucs-6743 kernel: [55583.858725]  ? d_splice_alias+0x153/0x3c0
Jan 12 13:12:28 ucs-6743 kernel: [55583.858729]  ext4_create+0xe0/0x1c0 [ext4]
Jan 12 13:12:28 ucs-6743 kernel: [55583.858729]  path_openat+0x117e/0x1480
Jan 12 13:12:28 ucs-6743 kernel: [55583.858730]  do_filp_open+0x93/0x100
Jan 12 13:12:28 ucs-6743 kernel: [55583.858732]  ? __check_object_size+0x162/0x180
Jan 12 13:12:28 ucs-6743 kernel: [55583.858733]  do_sys_open+0x186/0x210
Jan 12 13:12:28 ucs-6743 kernel: [55583.858734]  do_syscall_64+0x53/0x110
Jan 12 13:12:28 ucs-6743 kernel: [55583.858735]  entry_SYSCALL_64_after_hwframe+0x5c/0xc1

Code:

 319 ?        D      0:00 [jbd2/sda2-8]
 998 ?        Ds     0:00 postgres: 11/main: walwriter
 1022 ?        Ds     0:02 /usr/sbin/nmbd -D
 8881 pts/2    S+     0:09 watch -n 5 ps ax | grep "  D"
 9094 pts/3    D+     0:00 mc
17694 ?        D      0:00 /usr/sbin/nmbd -D
17713 ?        D      0:00 /usr/sbin/nmbd -D
17731 ?        D      0:00 /usr/sbin/nmbd -D
17750 ?        D      0:00 /usr/sbin/nmbd -D
17769 ?        D      0:00 /usr/sbin/nmbd -D
17787 ?        D      0:00 /usr/sbin/nmbd -D
17805 ?        D      0:00 /usr/sbin/nmbd -D
17828 ?        D      0:00 /usr/sbin/nmbd -D
17846 ?        D      0:00 /usr/sbin/nmbd -D
17865 ?        D      0:00 /usr/sbin/nmbd -D
17882 ?        D      0:00 /usr/bin/python3 /usr/share/univention-monitoring-client/scripts//check_univention_dns
17908 ?        D      0:00 /usr/sbin/nmbd -D
17948 ?        D      0:00 /usr/sbin/nmbd -D
17966 ?        D      0:00 /usr/sbin/nmbd -D
17985 ?        D      0:00 /usr/sbin/nmbd -D
18003 ?        D      0:00 /usr/sbin/nmbd -D
18021 ?        D      0:00 /usr/sbin/nmbd -D
18040 ?        D      0:00 /usr/sbin/nmbd -D
18058 ?        D      0:00 /usr/sbin/nmbd -D
18076 ?        D      0:00 /usr/sbin/nmbd -D
18095 ?        D      0:00 /usr/sbin/nmbd -D
18113 ?        D      0:00 /usr/sbin/nmbd -D
18131 ?        D      0:00 /usr/sbin/nmbd -D
18150 ?        D      0:00 /usr/sbin/nmbd -D
18163 ?        D      0:00 /usr/sbin/nmbd -D
18180 ?        D      0:00 /usr/bin/python3 /usr/sbin/univention-config-registry commit /etc/apt/sources.list.d/20_ucs-online-component.list
18195 ?        D      0:00 /usr/sbin/nmbd -D
18217 ?        D      0:00 /usr/bin/python3 /usr/share/univention-monitoring-client/scripts//check_univention_dns
18261 ?        D      0:00 /usr/sbin/nmbd -D
18301 ?        D      0:00 /usr/sbin/nmbd -D
18565 ?        D      0:00 /usr/bin/python3 /usr/share/univention-monitoring-client/scripts//check_univention_dns
18935 ?        D      0:00 /usr/bin/python3 /usr/share/univention-monitoring-client/scripts//check_univention_dns
19206 pts/2    S+     0:00 watch -n 5 ps ax | grep "  D"
19207 pts/2    S+     0:00 sh -c ps ax | grep "  D"
19209 pts/2    S+     0:00 grep   D
22312 ?        Ds     0:01 /usr/sbin/univention-directory-listener -F -d 2 -b dc=intranet,dc=logmaster,dc=hu -m /usr/lib/univention-directory-listener/system -c /var/lib/univention-directory-listener -ZZ -x -D cn=admin,dc=intranet,dc=logmaster,dc=hu -y /etc/ldap.secret

Load: 38.98 36.66 25.48 (it is not really interesting, because it is a consequence of other problems).

Guest OS kernel: 4.19.0-25-amd64

I restarted the guest os, cache disabled on proxmox level for the guest.

Proxmox related information:

Underlying filesystem: ZFS on local nvme ssd mirror

Code:

qm config 3441
balloon: 0
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 8
cpu: host
efidisk0: local-zfs:vm-3441-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide2: none,media=cdrom
memory: 6144
meta: creation-qemu=8.1.2,ctime=1704366686
name: unidc
net0: virtio=BC:24:11:3D:19:CA,bridge=vmbr4000,firewall=1,mtu=1400
numa: 0
ostype: l26
parent: autohourly_2024_01_12T13_05_26
scsi0: local-zfs:vm-3441-disk-1,discard=on,iothread=1,size=128G,ssd=1
scsi1: local-zfs:vm-3441-disk-2,discard=on,iothread=1,size=128G,ssd=1
scsi2: local-zfs:vm-3441-disk-3,discard=on,iothread=1,size=32G,ssd=1
scsi3: local-zfs:vm-3441-disk-4,discard=on,iothread=1,size=32G,ssd=1
scsi4: local-zfs:vm-3441-disk-5,discard=on,iothread=1,size=32G,ssd=1
scsi5: local-zfs:vm-3441-disk-6,discard=on,iothread=1,size=16G,ssd=1
scsi6: local-zfs:vm-3441-disk-7,discard=on,iothread=1,size=256G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=xxxxxxxxxxxx
sockets: 1
tablet: 0
vmgenid: xxxxxxxxxx

Code:

UUID=4c309d7c-5d69-4581-a49c-f35a7c6ced53       /       ext4    errors=remount-ro,user_xattr    0       1
# /boot/efi was on /dev/sda1 during installation
UUID=D483-9E27  /boot/efi       vfat    umask=0077      0       1
# /home was on /dev/sdb1 during installation
UUID=3a3e5fb7-965e-4924-8fbb-e8818b34df1a       /home   ext4    noatime,user_xattr,usrquota     0       2
# /var/flexshares was on /dev/sdg1 during installation
UUID=e778d05f-564e-47ae-9935-b7ecb8562084       /var/flexshares ext4    noatime,user_xattr      0       2
# /var/lib/univention-ldap was on /dev/sde1 during installation
UUID=a9a2869f-9c6b-4bec-8dc2-baa1e5c14413       /var/lib/univention-ldap        ext4    noatime,user_xattr      0       2
# /var/log was on /dev/sdc1 during installation
UUID=1e7c8eab-f0cf-4317-8529-4c053ccf4535       /var/log        ext4    noatime,user_xattr      0       2
# /var/univention-backup was on /dev/sdd1 during installation
UUID=f9a4e15f-78ff-4693-bf4d-fd114fd62c69       /var/univention-backup  ext4    noatime,user_xattr      0       2
# swap was on /dev/sdf1 during installation
UUID=b099fa49-b643-4989-bb37-55c2c65c4e18       none    swap    sw      0       0
/dev/sr0        /media/cdrom0   udf,iso9660     user,noauto     0       0

pongraczi · Jan 15, 2024

Update: after the server died less than 12 hours, I restarted the test with new changes: I removed the iothread from all disks.
Virtio-scsi-single still presents.
Qemu-guest-agent installed, just to push the limits and increasing the risk to crash.

It is working perfectly for 2 days and 22+ hours now. Promising.

Code:

 qm config 3441
agent: 1
balloon: 4096
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 8
cpu: host
efidisk0: local-zfs:vm-3441-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide2: none,media=cdrom
memory: 8192
meta: creation-qemu=8.1.2,ctime=1704366686
name: unidc
net0: virtio=xx:xx:xx:xx:xx:xx,bridge=vmbr4000,firewall=1,mtu=1400
numa: 0
ostype: l26
parent: autohourly_2024_01_15T17_05_29
scsi0: local-zfs:vm-3441-disk-1,discard=on,size=128G,ssd=1
scsi1: local-zfs:vm-3441-disk-2,discard=on,size=128G,ssd=1
scsi2: local-zfs:vm-3441-disk-3,discard=on,size=32G,ssd=1
scsi3: local-zfs:vm-3441-disk-4,discard=on,size=32G,ssd=1
scsi4: local-zfs:vm-3441-disk-5,discard=on,size=32G,ssd=1
scsi5: local-zfs:vm-3441-disk-6,discard=on,size=16G,ssd=1
scsi6: local-zfs:vm-3441-disk-7,discard=on,size=256G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=yyyyyyyyyyyyyyyyy
sockets: 1
tablet: 0
vmgenid: xxxxxxxxxxxxxxxxx

roms2000 · Jan 15, 2024

pongraczi said:

Update: after the server died less than 12 hours, I restarted the test with new changes: I removed the iothread from all disks.
Virtio-scsi-single still presents.
Qemu-guest-agent installed, just to push the limits and increasing the risk to crash.

It is working perfectly for 2 days and 22+ hours now. Promising.

Code:

 qm config 3441
agent: 1
balloon: 4096
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 8
cpu: host
efidisk0: local-zfs:vm-3441-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide2: none,media=cdrom
memory: 8192
meta: creation-qemu=8.1.2,ctime=1704366686
name: unidc
net0: virtio=xx:xx:xx:xx:xx:xx,bridge=vmbr4000,firewall=1,mtu=1400
numa: 0
ostype: l26
parent: autohourly_2024_01_15T17_05_29
scsi0: local-zfs:vm-3441-disk-1,discard=on,size=128G,ssd=1
scsi1: local-zfs:vm-3441-disk-2,discard=on,size=128G,ssd=1
scsi2: local-zfs:vm-3441-disk-3,discard=on,size=32G,ssd=1
scsi3: local-zfs:vm-3441-disk-4,discard=on,size=32G,ssd=1
scsi4: local-zfs:vm-3441-disk-5,discard=on,size=32G,ssd=1
scsi5: local-zfs:vm-3441-disk-6,discard=on,size=16G,ssd=1
scsi6: local-zfs:vm-3441-disk-7,discard=on,size=256G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=yyyyyyyyyyyyyyyyy
sockets: 1
tablet: 0
vmgenid: xxxxxxxxxxxxxxxxx

This is a new records !

Nice to see improvement, and see that it's working without iothreads. It seems there is something bad with it.

What version of qemu is installed ?

pongraczi · Jan 15, 2024

roms2000 said:
This is a new records !
Nice to see improvement, and see that it's working without iothreads. It seems there is something bad with it.

What version of qemu is installed ?

Exactly.
It can proof the followings:
- guest level discard has nothing to do with the issue
- cache probably has no effect (at least alone)
- enabled iothread with virtio-scsi-single seems problematic, at least up to now the issue happened with this combination (anyway, guest kernel is 4.x)

I will report back.

pongraczi · Jan 16, 2024

roms2000 said:
What version of qemu is installed ?

Sorry, I forgot to answer.

Code:

proxmox-ve: 8.1.0 (running kernel: 6.5.11-7-pve)
pve-manager: 8.1.3 (running version: 8.1.3/b46aac3b42da5d15)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.5: 6.5.11-7
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.7
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.2-1
proxmox-backup-file-restore: 3.1.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.2
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-2
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.1.5
pve-qemu-kvm: 8.1.2-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1

Uptime: 3 days + 17 hours
Definitely record, probably safe to consider the server stable.

pongraczi · Jan 18, 2024

Update: as the uptime is 5 days, 19:13 hours, I could tell that, problem solved by removing iothread from the kvm config (virtio-scsi-single still in use).
My problem solved by this settings.

roms2000 · Jan 18, 2024

For me it is the same

It's much less stressful knowing that when you get up in the morning, there's no need to rush to restart all the VMs

privnote · Jan 23, 2024

I suddenly had problems with the problematic VMs again, where there were no more problems with PBS after the move from Ceph to ZFS.

But when I tried to take snapshots of these VMs last week, various VMs kept freezing.
I removed the iothread flag and the snapshots could be created again without any problems

roms2000 · Jan 23, 2024

iothreads seems to be the culprit.

For history, I launched a backup on PBS of 42 VMs yesterday evening. The backup is still in progress, with the addition of a PG rebalance on Ceph.
Nothing to report

Everything is going well with the iothreads unchecked/off.

privnote · Jan 28, 2024

But iothreads do have their advantages.
It would therefore be a great loss if we had to leave them permanently deactivated.
Does anyone know whether the Proxmox team is reading along here and when a fix can be expected?
@cheiss

pvefanmark · Feb 12, 2024

I can confirm that in my case, removing iothreads from the kvm config has no effect in preventing vm freeze.

What's more, backing up my NAS vm to PBS worked for two days with no issues. Then suddenly stopped working. So the problem is sporadic.
What's even more interesting is that without using PBS, backing this NAS VM to a NAS used to have the same problem (freeze). But it works now.
So it's totally crap shoot for me. Sometimes backup without PBS works, sometimes backup with PBS works.

privnote · Feb 12, 2024

It continues to work for me without any problems since the removal of iothreads from the VMs that were causing problems
And I also use a PBS daily / weekly backup for over 100 VMs

pvefanmark · Feb 12, 2024

pvefanmark said:
I can confirm that in my case, removing iothreads from the kvm config has no effect in preventing vm freeze.

What's more, backing up my NAS vm to PBS worked for two days with no issues. Then suddenly stopped working. So the problem is sporadic.
What's even more interesting is that without using PBS, backing this NAS VM to a NAS used to have the same problem (freeze). But it works now.
So it's totally crap shoot for me. Sometimes backup without PBS works, sometimes backup with PBS works.

Sorry, I should clarify. It's possible my case is different from others.

I am backing up a NAS vm to either

1. a network share from the same NAS VM; or
2. a pbs in LXC that uses a network share from the same NAS VM.

I know, I know, this is not the best practice. But my home lab is not an enterprise solution. I can live with a little downtime.

( I prefer less power consumption with one pve host as long as I have one good backup as the NAS vm rarely changes. )

What I found is that there is a race condition. I'd expect the backup creates a VM snapshot and unfreezes the VM right away, and then begins to backup the snapshot. By then, the VM has been thawed so the backup should proceed without problems.

Instead, I believe what happens is that the backup freezes VM and contacts the PBS or NAS at the same time. If the NAS is frozen and the pbs/or nas is not reachable as a result, the whole backup fails.

However, it doesn't fail all the time. In fact, the backup of NAS VM to NAS directly used to fail all the time but now it doesn't. But the backup of the NAS VM to PBS with the nas backend fails all the time after being okay for 2 days.

I believe the backup process can be improved to eliminate the race condition. The VM should be freezed and unfrozen first, resulting a snapshot ready to be backedup. Then the backup process should contact the NAS directly or via PBS. This is a better design.

BTW, backup the pbs LXC to the same pbs itself never fails. In this case, I believe there is no freeze/unfreez command issued. A snapshot must have been made that doesn't have a race condition with connecting to the pbs LXC.

=================================================

This is what the failed backup job looks like:

INFO: starting new backup job: vzdump 104 --mode snapshot --node pve3 --notes-template '{{guestname}} test' --notification-mode auto --storage pbsRepo --remove 0
INFO: Starting Backup of VM 104 (qemu)
INFO: Backup started at 2024-02-12 09:49:45
INFO: status = running
INFO: VM Name: fs
INFO: include disk 'scsi0' 'local-lvm:vm-104-disk-0' 32G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/104/2024-02-12T14:49:45Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 104 qmp command 'backup' failed - backup connect failed: command error: http upgrade request timed out
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 104 failed - VM 104 qmp command 'backup' failed - backup connect failed: command error: http upgrade request timed out
INFO: Failed at 2024-02-12 09:51:45
INFO: Backup job finished with errors
INFO: notified via target `mail-to-root`
TASK ERROR: job errors

This is what the failed job looks like on PBS:

2024-02-12T14:51:45+00:00: starting new backup on datastore 'pbsRepo' from ::ffff:192.168.1.99: "ns/pve3/vm/104/2024-02-12T14:49:45Z"
2024-02-12T14:51:45+00:00: backup failed: connection error: Broken pipe (os error 32)
2024-02-12T14:51:45+00:00: removing failed backup
2024-02-12T14:51:45+00:00: TASK ERROR: connection error: Broken pipe (os error 32)

privnote · Feb 13, 2024

pvefanmark said:
Sorry, I should clarify. It's possible my case is different from others.

Yes, your problem does indeed sound different and probably has a different cause

I never had any problems with the PBS backup, it was always successful.
It was only with the VMs that I sometimes had freezes

Disable fs-freeze on snapshot backups

Renowned Member

New Member

Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Member

Renowned Member

Renowned Member

Renowned Member

Member

New Member

Member

New Member

New Member

New Member

New Member

New Member

We value your privacy