I haven't been able to find any case online of this specifically happening.
I've made previous posts about similar problems (when using nfs, the issues are even worse, see https://forum.proxmox.com/threads/pve-backups-causing-vm-stalls.178941/#post-849687 ), though I have a more complete picture now. It appears that my issues aren't over yet after switching to the SMB protocol for storage and there's something deeper going on after trying/checking all the variables.
First, let me preface that the exact same hardware setup has been working flawlessly for years. The issues with VMs locking up have only begun to appear after we migrated from proprietary VM solutions to proxmox. The problem's somewhere in the backup software.
The issues stem from the backup process being, as far as I can tell, very flaky and unable to handle errors which the commercial solutions were handling gracefully before. When handling large (enough) volumes of data, those eventually become common enough that backups often enough fail. Others are reporting with much larger systems than we have, but even at the scale of a terabyte or so of data things go awry. Note that the speed/quality of hardware seems to affect the severity as well. The longer a backup takes, the greater the risk of an error, and the greater the risk of everything falling apart due to the fragility of it.
The very worst part of it is the failure mode. Rather than missing out on a backup occasionally, the entire VM effectively crashes and freezes, which causes common outages for all VMs with significantly large disks. I've also seen VMs with smaller disks (much more rarely) fail to do the same, so I perceive that the problem happens every ~5TB of written data. Or, there is about a 1 in 1,000,000 chance that a chunk fails to write, and rather than gracefully retrying a few times, the result is akin to 'halt and catch fire'.
I've done more digging and turned up various logs that may be of interest. Technical details below.
After a failure, the VM is effectively bricked until reboot, because it can't write to its boot drive anymore, according to the kernel logs, which contain many, many entries of the form:
I do find it somewhat interesting that apparently it can still write said kernel logs, but the disk is otherwise readonly.
In all cases of seeing "aborted journal" reported by other proxmox users through web searches I was able to turn up, it was some form of disk corruption, and the issues happened at the hypervisor level.
But, in my case, it happens at the VM level. The hypervisor's fine, the disks there report healthy. It's inside of some VMs that I see common readonly errors. The most likely moment for it happening is during a backup, though it's not exactly guaranteed.
I'm also observing that the problem is more common with larger VMs over smaller ones. These have a higher failure rate. When the failure happens, the machine is stuck with a readonly disk and has to be rebooted/fsck'ed (at which point fsck will delete a number of orphan nodes), and after that the machine tends to operate until the next backup failure (days to weeks depending on luck and how 'big' the VM is). I'm not exactly confident that the issue can be solely due to a backup due to the fact that the write/read load of regular use is quite low compared to the read load of a backup.
pveversion -v:
qm config 123
I'm using a networked backup target, but I can't enable fleecing. The system runs out of space if I do.
The controller's set to VirtIO single, which, according to the docs at https://10.0.49.11:8006/pve-docs/chapter-qm.html#qm_hard_disk is the most recent/well supported configuration, so I shouldn't be seeing this.
If I look into kernel logs inside the VM itself, I can see some stuck processes giving stack traces on the dmesg, followed by some interesting logs, and then a whole lot of repeats of being unable to write due to aborted journals on the filesystem. I've put the interesting part of the log down here:
Meanwhile, the backup produces the following:
I've come across one more highly interesting observation: One of the VM's that's been failing quite often is the monitoring server (zabbix), it's the heaviest VM in terms of usage/traffic, and it currently runs all by itself on a server. This is interesting: it seems that the problem occurs regardless of the presence of other VMs on the same node. In fact, if a node contains only a 'heavy' VM that VM seems to break down more. Of course, since this is only about a small handful of servers, I can't really draw any real statistics from it, there's just too few VMs to say anything very conclusive. Still, it's a bit unexpected to see: Other servers are much more heavily utilized, yet it is this VM that seems to break down very often, while others only do so occasionally.
I've made previous posts about similar problems (when using nfs, the issues are even worse, see https://forum.proxmox.com/threads/pve-backups-causing-vm-stalls.178941/#post-849687 ), though I have a more complete picture now. It appears that my issues aren't over yet after switching to the SMB protocol for storage and there's something deeper going on after trying/checking all the variables.
First, let me preface that the exact same hardware setup has been working flawlessly for years. The issues with VMs locking up have only begun to appear after we migrated from proprietary VM solutions to proxmox. The problem's somewhere in the backup software.
The issues stem from the backup process being, as far as I can tell, very flaky and unable to handle errors which the commercial solutions were handling gracefully before. When handling large (enough) volumes of data, those eventually become common enough that backups often enough fail. Others are reporting with much larger systems than we have, but even at the scale of a terabyte or so of data things go awry. Note that the speed/quality of hardware seems to affect the severity as well. The longer a backup takes, the greater the risk of an error, and the greater the risk of everything falling apart due to the fragility of it.
The very worst part of it is the failure mode. Rather than missing out on a backup occasionally, the entire VM effectively crashes and freezes, which causes common outages for all VMs with significantly large disks. I've also seen VMs with smaller disks (much more rarely) fail to do the same, so I perceive that the problem happens every ~5TB of written data. Or, there is about a 1 in 1,000,000 chance that a chunk fails to write, and rather than gracefully retrying a few times, the result is akin to 'halt and catch fire'.
I've done more digging and turned up various logs that may be of interest. Technical details below.
After a failure, the VM is effectively bricked until reboot, because it can't write to its boot drive anymore, according to the kernel logs, which contain many, many entries of the form:
Code:
[44930.639950] EXT4-fs error (device sda1): ext4_journal_check_start:83: comm systemd: Detected aborted journal
I do find it somewhat interesting that apparently it can still write said kernel logs, but the disk is otherwise readonly.
In all cases of seeing "aborted journal" reported by other proxmox users through web searches I was able to turn up, it was some form of disk corruption, and the issues happened at the hypervisor level.
But, in my case, it happens at the VM level. The hypervisor's fine, the disks there report healthy. It's inside of some VMs that I see common readonly errors. The most likely moment for it happening is during a backup, though it's not exactly guaranteed.
I'm also observing that the problem is more common with larger VMs over smaller ones. These have a higher failure rate. When the failure happens, the machine is stuck with a readonly disk and has to be rebooted/fsck'ed (at which point fsck will delete a number of orphan nodes), and after that the machine tends to operate until the next backup failure (days to weeks depending on luck and how 'big' the VM is). I'm not exactly confident that the issue can be solely due to a backup due to the fact that the write/read load of regular use is quite low compared to the read load of a backup.
pveversion -v:
Code:
pveversion -v
proxmox-ve: 9.1.0 (running kernel: 6.17.13-2-pve)
pve-manager: 9.1.7 (running version: 9.1.7/16b139a017452f16)
proxmox-kernel-helper: 9.0.4
proxmox-kernel-6.17: 6.17.13-2
proxmox-kernel-6.17.13-2-pve-signed: 6.17.13-2
proxmox-kernel-6.14: 6.14.11-6
proxmox-kernel-6.14.11-6-pve-signed: 6.14.11-6
proxmox-kernel-6.14.8-2-pve-signed: 6.14.8-2
ceph-fuse: 19.2.3-pve1
corosync: 3.1.10-pve2
criu: 4.1.1-1
frr-pythontools: 10.4.1-1+pve1
ifupdown2: 3.3.0-1+pmx12
intel-microcode: 3.20251111.1~deb13u1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.1
libproxmox-backup-qemu0: 2.0.2
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.6
libpve-apiclient-perl: 3.4.2
libpve-cluster-api-perl: 9.1.1
libpve-cluster-perl: 9.1.1
libpve-common-perl: 9.1.9
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.5
libpve-network-perl: 1.2.5
libpve-rs-perl: 0.11.4
libpve-storage-perl: 9.1.1
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2+pmx1
lxc-pve: 6.0.5-4
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
proxmox-backup-client: 4.1.5-1
proxmox-backup-file-restore: 4.1.5-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.2.1
proxmox-kernel-helper: 9.0.4
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.3
proxmox-widget-toolkit: 5.1.9
pve-cluster: 9.1.1
pve-container: 6.1.2
pve-docs: 9.1.2
pve-edk2-firmware: 4.2025.05-2
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.4
pve-firmware: 3.18-2
pve-ha-manager: 5.1.3
pve-i18n: 3.7.0
pve-qemu-kvm: 10.1.2-7
pve-xtermjs: 5.5.0-3
qemu-server: 9.1.6
smartmontools: 7.4-pve1
spiceterm: 3.4.1
swtpm: 0.8.0+pve3
vncterm: 1.9.1
zfsutils-linux: 2.4.1-pve1
qm config 123
Code:
boot: order=scsi0;ide2;net0
cores: 4
cpu: Haswell-noTSX-IBRS
ide2: none,media=cdrom
memory: 32768
meta: creation-qemu=10.1.2,ctime=1766157030
name: ********
net0: virtio=BC:24:11:9D:5C:FB,bridge=vmbr1,firewall=1,mtu=1500
net1: virtio=BC:24:11:3D:0F:ED,bridge=vmbr0,firewall=1,mtu=1500
numa: 0
ostype: l26
scsi0: zpool:vm-123-disk-0,cache=writeback,format=raw,iothread=1,size=16G
scsi1: zpool:vm-123-disk-1,cache=writeback,format=raw,iothread=1,size=1T
scsihw: virtio-scsi-single
smbios1: uuid=72d8ec5f-3177-4153-9340-0c07803994da
sockets: 2
unused0: vm_mover:123/vm-123-disk-0.vmdk
unused1: vm_mover:123/vm-123-disk-1.vmdk
vmgenid: 5b32a5d4-55b1-4602-8f85-942cddf6960a
I'm using a networked backup target, but I can't enable fleecing. The system runs out of space if I do.
The controller's set to VirtIO single, which, according to the docs at https://10.0.49.11:8006/pve-docs/chapter-qm.html#qm_hard_disk is the most recent/well supported configuration, so I shouldn't be seeing this.
If I look into kernel logs inside the VM itself, I can see some stuck processes giving stack traces on the dmesg, followed by some interesting logs, and then a whole lot of repeats of being unable to write due to aborted journals on the filesystem. I've put the interesting part of the log down here:
Code:
(Some similar program crashes omitted)
[44709.346983] INFO: task apache2:80651 blocked for more than 120 seconds.
[44709.347008] Not tainted 6.1.0-44-amd64 #1 Debian 6.1.164-1
[44709.347032] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[44709.347059] task:apache2 state:D stack:0 pid:80651 ppid:919 flags:0x00000002
[44709.347063] Call Trace:
[44709.347065] <TASK>
[44709.347068] __schedule+0x34d/0x9e0
[44709.347073] schedule+0x5a/0xd0
[44709.347077] schedule_preempt_disabled+0x11/0x20
[44709.347081] __mutex_lock.constprop.0+0x399/0x700
[44709.347087] __fdget_pos+0x4a/0x70
[44709.347094] ksys_write+0x2a/0xf0
[44709.347103] do_syscall_64+0x55/0xb0
[44709.347110] ? __x64_sys_setitimer+0x14f/0x170
[44709.347118] ? exit_to_user_mode_prepare+0x44/0x240
[44709.347125] ? syscall_exit_to_user_mode+0x1e/0x40
[44709.347147] ? do_syscall_64+0x61/0xb0
[44709.347153] ? do_syscall_64+0x61/0xb0
[44709.347160] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[44709.347167] RIP: 0033:0x7fd49e249340
[44709.347170] RSP: 002b:00007ffd4be90d18 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[44709.347175] RAX: ffffffffffffffda RBX: 00007ffd4be90d78 RCX: 00007fd49e249340
[44709.347177] RDX: 0000000000000138 RSI: 00007fd49631d2b8 RDI: 0000000000000009
[44709.347180] RBP: 00007fd49e0ad968 R08: 0000000000000017 R09: 0000000000000138
[44709.347183] R10: 00007fd49631d198 R11: 0000000000000202 R12: 00007fd49631d2b8
[44709.347186] R13: 0000000000000000 R14: 0000000000000138 R15: 00007fd49e0ae188
[44709.347191] </TASK>
[44925.627009] sd 3:0:0:1: [sdb] tag#139 timing out command, waited 180s
[44925.627010] sd 3:0:0:1: [sdb] tag#3 timing out command, waited 180s
[44925.627068] sd 3:0:0:1: [sdb] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=383s
[44925.627093] sd 3:0:0:1: [sdb] tag#139 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=387s
[44925.627093] sd 3:0:0:1: [sdb] tag#246 timing out command, waited 180s
[44925.627107] sd 3:0:0:1: [sdb] tag#246 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=383s
[44925.627111] sd 3:0:0:1: [sdb] tag#246 Sense Key : Aborted Command [current]
[44925.627119] sd 3:0:0:1: [sdb] tag#3 Sense Key : Aborted Command [current]
[44925.627128] sd 3:0:0:1: [sdb] tag#246 Add. Sense: I/O process terminated
[44925.627142] sd 3:0:0:1: [sdb] tag#246 CDB: Write(10) 2a 00 31 90 73 f8 00 00 10 00
[44925.627147] sd 3:0:0:1: [sdb] tag#139 Sense Key : Aborted Command [current]
[44925.627149] sd 3:0:0:1: [sdb] tag#3 Add. Sense: I/O process terminated
[44925.627150] I/O error, dev sdb, sector 831550456 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 2
[44925.627152] sd 3:0:0:1: [sdb] tag#139 Add. Sense: I/O process terminated
[44925.627154] sd 3:0:0:1: [sdb] tag#3 CDB: Write(10) 2a 00 3f ca 6b 80 00 00 28 00
[44925.627157] sd 3:0:0:1: [sdb] tag#139 CDB: Write(10) 2a 00 7c f5 a9 f0 00 00 2b 00
[44925.627158] I/O error, dev sdb, sector 1070230400 op 0x1:(WRITE) flags 0x9800 phys_seg 5 prio class 2
[44925.627204] EXT4-fs warning (device sdb1): ext4_end_bio:343: I/O error 10 writing to inode 44826635 starting block 103943807)
[44925.627245] Aborting journal on device sdb1-8.
[44925.627250] I/O error, dev sdb, sector 2096474608 op 0x1:(WRITE) flags 0x8800 phys_seg 4 prio class 2
[44925.627259] Buffer I/O error on device sdb1, logical block 103943551
[44925.627269] sd 2:0:0:0: [sda] tag#118 timing out command, waited 180s
[44925.627276] EXT4-fs warning (device sdb1): ext4_end_bio:343: I/O error 10 writing to inode 44826635 starting block 103943808)
[44925.627284] sd 2:0:0:0: [sda] tag#118 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=383s
[44925.627290] sd 2:0:0:0: [sda] tag#118 Sense Key : Aborted Command [current]
[44925.627295] sd 2:0:0:0: [sda] tag#118 Add. Sense: I/O process terminated
[44925.627302] sd 2:0:0:0: [sda] tag#118 CDB: Write(10) 2a 00 00 c4 28 e8 00 00 58 00
[44925.627297] EXT4-fs error (device sdb1) in ext4_reserve_inode_write:5933: Journal has aborted
[44925.627305] I/O error, dev sda, sector 12855528 op 0x1:(WRITE) flags 0x9800 phys_seg 11 prio class 2
[44925.627322] EXT4-fs error (device sdb1): ext4_dirty_inode:6137: inode #44826635: comm zabbix_server: mark_inode_dirty error
[44925.627442] Aborting journal on device sda1-8.
[44925.627462] EXT4-fs error (device sdb1) in ext4_dirty_inode:6138: Journal has aborted
[44925.627582] EXT4-fs error (device sda1): ext4_journal_check_start:83: comm kworker/u16:2: Detected aborted journal
[44925.627599] EXT4-fs error (device sdb1): ext4_journal_check_start:83: comm kworker/u16:0: Detected aborted journal
[44925.627600] EXT4-fs error (device sdb1): ext4_journal_check_start:83: comm zabbix_server: Detected aborted journal
(Remove hundreds of thousands of repeat lines of this)
Meanwhile, the backup produces the following:
Code:
INFO: starting new backup job: vzdump 100 105 108 109 110 111 112 114 115 116 117 118 119 120 121 102 103 104 106 101 113 129 128 127 122 125 124 130 131 132 133 136 123 126 137 139 141 --mode snapshot --prune-backups 'keep-daily=3,keep-monthly=2' --compress zstd --notes-template '{{guestname}}' --mailto <mailaddr> --quiet 1 --zstd 0 --fleecing 0 --notification-mode legacy-sendmail --mailnotification always --storage pve_backups
INFO: skip external VMs: 100, 101, 102, 103, 104, 105, 106, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 136, 137, 139, 141
INFO: Starting Backup of VM 123 (qemu)
INFO: Backup started at 2026-04-29 22:15:04
INFO: status = running
INFO: VM Name: vmname
INFO: include disk 'scsi0' 'zpool:vm-123-disk-0' 16G
INFO: include disk 'scsi1' 'zpool:vm-123-disk-1' 1T
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: skip unused drive 'vm_mover:123/vm-123-disk-0.vmdk' (not included into backup)
INFO: skip unused drive 'vm_mover:123/vm-123-disk-1.vmdk' (not included into backup)
INFO: creating vzdump archive '/mnt/pve/pve_backups/dump/vzdump-qemu-123-2026_04_29-22_15_04.vma.zst'
INFO: skipping guest-agent 'fs-freeze', agent not configured in VM options
INFO: started backup task '5f4b0b9a-88e9-4435-b153-13cf4b10b7a6'
INFO: resuming VM again
INFO: 0% (1.0 GiB of 1.0 TiB) in 3s, read: 344.7 MiB/s, write: 319.1 MiB/s
INFO: 1% (10.6 GiB of 1.0 TiB) in 32s, read: 337.5 MiB/s, write: 313.8 MiB/s
INFO: 2% (20.9 GiB of 1.0 TiB) in 1m 6s, read: 311.5 MiB/s, write: 292.1 MiB/s
INFO: 3% (31.3 GiB of 1.0 TiB) in 1m 38s, read: 333.2 MiB/s, write: 310.4 MiB/s
INFO: 4% (41.7 GiB of 1.0 TiB) in 2m 10s, read: 332.6 MiB/s, write: 311.4 MiB/s
zstd: error 70 : Write error : cannot write block : Bad address
INFO: 4% (50.3 GiB of 1.0 TiB) in 9m 3s, read: 21.2 MiB/s, write: 19.3 MiB/s
ERROR: vma_queue_write: write error - Broken pipe
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 123 failed - vma_queue_write: write error - Broken pipe
INFO: Failed at 2026-04-29 22:24:08
INFO: Backup job finished with errors
INFO: notified via target `<mailaddr>`
TASK ERROR: job errors
I've come across one more highly interesting observation: One of the VM's that's been failing quite often is the monitoring server (zabbix), it's the heaviest VM in terms of usage/traffic, and it currently runs all by itself on a server. This is interesting: it seems that the problem occurs regardless of the presence of other VMs on the same node. In fact, if a node contains only a 'heavy' VM that VM seems to break down more. Of course, since this is only about a small handful of servers, I can't really draw any real statistics from it, there's just too few VMs to say anything very conclusive. Still, it's a bit unexpected to see: Other servers are much more heavily utilized, yet it is this VM that seems to break down very often, while others only do so occasionally.
Last edited: