I/O stall while backup to PBS is running and fdatasync failure after backup fails

Mar 26, 2026
3
1
3
Hi,
I have a Debian Trixie PostgreSQL 17.9 VM running on PVE 9.1.6.

PVE started running a scheduled backup to PBS, which after slowing down and then failing caused I/O stalls and errors to be observed by the PG VM. The other VMs did not experience the stall.

I tried to collect as much information as possible, but I don't get why this happened given the very small load on the server at that time. All VMs have very little load at night, and the PVE backup job and a pgbackrest incremental backup job within the VM are the main resource consumer. Also, squinting at the logs, it looks like writes in the VM may slow down when the PBS backup slows down.

The timeline is as follows:
  1. 02:30:02 a scheduled backup job started taking the backup of the PG VM on PBS. The PVE machine and the PBS machine are talking to each other via Wireguard.
  2. 02:39:25 the application started experiencing timeouts accessing the PostgreSQL server.
  3. 02:40:51 two commits in the postgresql VM have just taken 48s to run. Other simple write queries have also taken 45-80s to finish
  4. 02:41:08 the application timeouts went away.
  5. 02:43:30 the PBS machine completely lost internet connectivity.
  6. 03:00:42 the backup job failed. At that exact same time the PostgreSQL process experienced an fdatasync failure and crashed. It recovered successfully afterwards.
  7. 03:00:42 PVE tries to backup the rest of the VMs, but fails after 10s with error command 'backup' failed - backup connect failed: command error: client error (Connect)

The PVE machine has two disks, a SAMSUNG MZQL23T8HCLS-00A07 and a Micron_7300_MTFDHBE3T8TDF, which run the ZFS mirror. It's running on PVE 9.1.6 with Kernel 6.17.13-2-pve and zfs-2.4.0-pve1. The PVE machine has a dedicated 1 Gbit/s connection, while the PBS machine only has 100 Mbit/s down, 20 Mbit/s up.

Backup job logs:

Code:
2026-03-26 02:30:02 INFO: Starting Backup of VM 450 (qemu)
2026-03-26 02:30:02 INFO: status = running
2026-03-26 02:30:02 INFO: VM Name: REDACTED
2026-03-26 02:30:02 INFO: include disk 'scsi0' 'local-zfs:vm-450-disk-0' 256G
2026-03-26 02:30:02 INFO: backup mode: snapshot
2026-03-26 02:30:02 INFO: bandwidth limit: 409600 KiB/s
2026-03-26 02:30:02 INFO: ionice priority: 7
2026-03-26 02:30:02 INFO: creating Proxmox Backup Server archive 'vm/450/2026-03-26T02:30:02Z'
2026-03-26 02:30:02 INFO: issuing guest-agent 'fs-freeze' command
2026-03-26 02:30:05 INFO: issuing guest-agent 'fs-thaw' command
2026-03-26 02:30:05 INFO: started backup task '4ba9b6b9-2c3a-431e-81d5-5631d5c9202e'
2026-03-26 02:30:05 INFO: resuming VM again
2026-03-26 02:30:05 INFO: scsi0: dirty-bitmap status: OK (45.4 GiB of 256.0 GiB dirty)
2026-03-26 02:30:05 INFO: using fast incremental mode (dirty-bitmap), 45.4 GiB dirty of 256.0 GiB total
2026-03-26 02:30:08 INFO:   0% (240.0 MiB of 45.4 GiB) in 3s, read: 80.0 MiB/s, write: 72.0 MiB/s
2026-03-26 02:30:23 INFO:   1% (508.0 MiB of 45.4 GiB) in 18s, read: 17.9 MiB/s, write: 17.9 MiB/s
2026-03-26 02:30:38 INFO:   2% (1020.0 MiB of 45.4 GiB) in 33s, read: 34.1 MiB/s, write: 23.5 MiB/s
2026-03-26 02:30:56 INFO:   3% (1.4 GiB of 45.4 GiB) in 51s, read: 21.6 MiB/s, write: 21.1 MiB/s
2026-03-26 02:31:07 INFO:   4% (1.9 GiB of 45.4 GiB) in 1m 2s, read: 48.0 MiB/s, write: 31.6 MiB/s
2026-03-26 02:31:16 INFO:   5% (2.4 GiB of 45.4 GiB) in 1m 11s, read: 52.4 MiB/s, write: 43.6 MiB/s
2026-03-26 02:31:23 INFO:   6% (2.7 GiB of 45.4 GiB) in 1m 18s, read: 56.6 MiB/s, write: 42.9 MiB/s
2026-03-26 02:31:31 INFO:   7% (3.2 GiB of 45.4 GiB) in 1m 26s, read: 59.5 MiB/s, write: 31.5 MiB/s
2026-03-26 02:31:41 INFO:   8% (3.7 GiB of 45.4 GiB) in 1m 36s, read: 52.8 MiB/s, write: 48.4 MiB/s
2026-03-26 02:31:50 INFO:   9% (4.2 GiB of 45.4 GiB) in 1m 45s, read: 52.9 MiB/s, write: 52.9 MiB/s
2026-03-26 02:31:59 INFO:  10% (4.7 GiB of 45.4 GiB) in 1m 54s, read: 57.8 MiB/s, write: 56.4 MiB/s
2026-03-26 02:32:06 INFO:  11% (5.1 GiB of 45.4 GiB) in 2m 1s, read: 60.6 MiB/s, write: 56.0 MiB/s
2026-03-26 02:32:14 INFO:  12% (5.5 GiB of 45.4 GiB) in 2m 9s, read: 46.5 MiB/s, write: 41.0 MiB/s
2026-03-26 02:32:22 INFO:  13% (6.0 GiB of 45.4 GiB) in 2m 17s, read: 70.0 MiB/s, write: 68.5 MiB/s
2026-03-26 02:32:30 INFO:  14% (6.4 GiB of 45.4 GiB) in 2m 25s, read: 48.0 MiB/s, write: 42.0 MiB/s
2026-03-26 02:32:42 INFO:  15% (6.9 GiB of 45.4 GiB) in 2m 37s, read: 39.7 MiB/s, write: 18.7 MiB/s
2026-03-26 02:32:52 INFO:  16% (7.3 GiB of 45.4 GiB) in 2m 47s, read: 42.0 MiB/s, write: 22.4 MiB/s
2026-03-26 02:32:59 INFO:  17% (7.8 GiB of 45.4 GiB) in 2m 54s, read: 77.7 MiB/s, write: 40.0 MiB/s
2026-03-26 02:33:29 INFO:  18% (8.2 GiB of 45.4 GiB) in 3m 24s, read: 14.9 MiB/s, write: 8.8 MiB/s
2026-03-26 02:33:43 INFO:  19% (8.8 GiB of 45.4 GiB) in 3m 38s, read: 40.9 MiB/s, write: 19.1 MiB/s
2026-03-26 02:33:52 INFO:  20% (9.2 GiB of 45.4 GiB) in 3m 47s, read: 47.6 MiB/s, write: 22.7 MiB/s
2026-03-26 02:34:02 INFO:  21% (9.6 GiB of 45.4 GiB) in 3m 57s, read: 39.6 MiB/s, write: 15.2 MiB/s
2026-03-26 02:34:13 INFO:  22% (10.1 GiB of 45.4 GiB) in 4m 8s, read: 44.4 MiB/s, write: 19.3 MiB/s
2026-03-26 02:34:26 INFO:  23% (10.7 GiB of 45.4 GiB) in 4m 21s, read: 45.5 MiB/s, write: 21.8 MiB/s
2026-03-26 02:34:30 INFO:  24% (10.9 GiB of 45.4 GiB) in 4m 25s, read: 61.0 MiB/s, write: 39.0 MiB/s
2026-03-26 02:34:44 INFO:  25% (11.4 GiB of 45.4 GiB) in 4m 39s, read: 36.6 MiB/s, write: 17.4 MiB/s
2026-03-26 02:34:51 INFO:  26% (11.8 GiB of 45.4 GiB) in 4m 46s, read: 62.3 MiB/s, write: 28.0 MiB/s
2026-03-26 02:35:00 INFO:  27% (12.3 GiB of 45.4 GiB) in 4m 55s, read: 56.4 MiB/s, write: 28.9 MiB/s
2026-03-26 02:35:13 INFO:  28% (12.8 GiB of 45.4 GiB) in 5m 8s, read: 38.8 MiB/s, write: 20.9 MiB/s
2026-03-26 02:35:34 INFO:  29% (13.2 GiB of 45.4 GiB) in 5m 29s, read: 20.2 MiB/s, write: 20.2 MiB/s
2026-03-26 02:35:58 INFO:  30% (13.7 GiB of 45.4 GiB) in 5m 53s, read: 22.2 MiB/s, write: 22.2 MiB/s
2026-03-26 02:36:20 INFO:  31% (14.1 GiB of 45.4 GiB) in 6m 15s, read: 19.1 MiB/s, write: 17.3 MiB/s
2026-03-26 02:36:36 INFO:  32% (14.6 GiB of 45.4 GiB) in 6m 31s, read: 27.8 MiB/s, write: 12.0 MiB/s
2026-03-26 02:36:44 INFO:  33% (15.0 GiB of 45.4 GiB) in 6m 39s, read: 57.5 MiB/s, write: 56.0 MiB/s
2026-03-26 02:37:00 INFO:  34% (15.5 GiB of 45.4 GiB) in 6m 55s, read: 32.5 MiB/s, write: 31.8 MiB/s
2026-03-26 02:37:10 INFO:  35% (15.9 GiB of 45.4 GiB) in 7m 5s, read: 35.6 MiB/s, write: 23.6 MiB/s
2026-03-26 02:37:28 INFO:  36% (16.4 GiB of 45.4 GiB) in 7m 23s, read: 28.7 MiB/s, write: 27.6 MiB/s
2026-03-26 02:37:48 INFO:  37% (16.8 GiB of 45.4 GiB) in 7m 43s, read: 22.2 MiB/s, write: 22.2 MiB/s
2026-03-26 02:38:05 INFO:  38% (17.4 GiB of 45.4 GiB) in 8m, read: 33.6 MiB/s, write: 33.6 MiB/s
2026-03-26 02:40:53 INFO:  39% (17.8 GiB of 45.4 GiB) in 10m 48s, read: 2.5 MiB/s, write: 2.5 MiB/s
2026-03-26 02:41:12 INFO:  40% (18.2 GiB of 45.4 GiB) in 11m 7s, read: 21.7 MiB/s, write: 21.7 MiB/s
2026-03-26 02:41:32 INFO:  41% (18.7 GiB of 45.4 GiB) in 11m 27s, read: 27.8 MiB/s, write: 27.8 MiB/s
2026-03-26 02:41:45 INFO:  42% (19.1 GiB of 45.4 GiB) in 11m 40s, read: 25.5 MiB/s, write: 25.5 MiB/s
2026-03-26 02:43:22 INFO:  43% (19.6 GiB of 45.4 GiB) in 13m 17s, read: 5.5 MiB/s, write: 5.5 MiB/s
2026-03-26 03:00:42 INFO:  43% (19.9 GiB of 45.4 GiB) in 30m 37s, read: 295.4 KiB/s, write: 295.4 KiB/s
2026-03-26 03:00:42 ERROR: backup write data failed: command error: protocol canceled
2026-03-26 03:00:42 INFO: aborting backup job
2026-03-26 03:00:42 INFO: resuming VM again
2026-03-26 03:00:42 ERROR: Backup of VM 450 failed - backup write data failed: command error: protocol canceled

PostgreSQL logs: https://gist.github.com/paolobarbolini/158f82b95632c5763305ebfc9ca34b2c

dmesg on the VM also shows the problem: https://gist.github.com/paolobarbolini/c0a49efd8cab4bb348e60ef847330797

No dmesg logs around that date on the PVE host.

zpool status shows no errors (zpool upgrade message is caused by the recent ZFS 2.3 -> ZFS 2.4 upgrade).

Code:
zpool status -v
  pool: rpool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:13:17 with 0 errors on Thu Mar 26 03:57:59 2026
config:

        NAME           STATE     READ WRITE CKSUM
        rpool          ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            nvme0n1p3  ONLINE       0     0     0
            nvme1n1p3  ONLINE       0     0     0

errors: No known data errors

Host graphs (time in GMT):

full(1).png

VM graphs (time in GMT):

vm_redacted(1).png

VM hardware config:

Screenshot 2026-03-26 at 06-18-23 pve01 - Proxmox Virtual Environment(1).png

Happy to provide more information.

Thanks,
Paolo
 
Last edited:
try to enable fleecing option in backup scheduling advanced option.

when backup is running, the new writes are going first to the pbs storage and that can give you slowdown if the pbs is slow or have more lantency or have limited bandwidth.

The fleecing option is creating a temporary disk on a storage (can be the same local-zfs), which is working like a buffer .
 
The PVE machine and the PBS machine are talking to each other via Wireguard.
Why? Transmission is encrypted, you don't need Wireguard.

when backup is running, the new writes are going first to the pbs storage and that can give you slowdown if the pbs is slow or have more lantency or have limited bandwidth.
I reframe that so it gets easier for a beginner to understand. I know this, because I only learnt that a few days ago :)

When the backup is running, and the VM wants to do a write, the write will cause a system slowdown, because it has to go to PBS first, before the local VM disk can be written to.
But why does the write have to go to the PBS first?

If we look at a timeline, we start the backup at time_0.
We want to backup the data, that the system had at exactly time_0.

The backup started, and we are now at time_1.
The VM tries to write to block_1.
Block_1 we already backed up to PBS, so it can directly write it to the local VM disk.
No slowdown here.

We are now at time_2.
The VM tries to write to block_2. That block is not yet in PBS.
Hypothetically, if we would write that block_2 with the new write from the VM, and at time_8 the backup reads that changed block and writes it to the PBS, the backup would not be consistent. The backup would be from time_0 for block_1 and time_2 for block_2.

So in reality, we do is this at time_2:
The VM wants to write block_2. We say "ok but since block_2 is not yet in the PBS yet, we first need to write the state of time_0 to the PBS!"
We backup block_2 to the PBS and then let the VM do the write.
That is a slowdown, since we need to wait for PBS.

How does fleecing fit into this?
Well instead of waiting to the high latency or low bandwith, or low PBS performance
connection to PBS to finish the write of block_2, we write block_2 (at time 0) to the fleecing device. That way PVE can come back at time_8 and read the block_2 from the fleecing device instead of our already changed VM disk.
The VM still has to wait until the write to the fleecing device was done, but since this is a local and fast SSD, the wait time potentially will be way smaller than to PBS.

PS: I have clients with copper and slow uploads. While a SSD of course is always better, even a single slow 3,5" HDD as fleecing device, can be faster than a remote PBS.
 
Last edited:
Why? Transmission is encrypted, you don't need Wireguard.
The current PBS does not have a public IP address. I'm planning to add a second PBS with a good connection, and then have the original one sync from the new one. That would make it possible to restore the bigger VMs without having to wait almost half a day.

try to enable fleecing option in backup scheduling advanced option.
I reframe that so it gets easier for a beginner to understand. I know this, because I only learnt that a few days ago :)
Thank you both for your quick replies and explanations. This cleared up my understanding of how backups work.

I have now enabled Fleecing and configured it to use the same ZFS storage as the VMs, then manually triggered the backup job. This time, the VM did not experience any I/O delays.
 
  • Like
Reactions: IsThisThingOn
The current PBS does not have a public IP address.
It probably does not need one. If you have Wireguard (probably over IPv4) there is a public IPv4 and you can probably use port 8007 or something like that, which you then NAT to the local IPv4 of your PBS.

Or even better, most ISP offer static IPv6 /56 prefixes. Simply use IPv6. Then you can even make great firewall rules like
"Allow all TCP traffic form static IPv6 PVE to connect to static IPv6 PBS and destination port 8007".
I have now enabled Fleecing and configured it to use the same ZFS storage as the VMs, then manually triggered the backup job. This time, the VM did not experience any I/O delays.
Nice!
 
It probably does not need one. If you have Wireguard (probably over IPv4) there is a public IPv4 and you can probably use port 8007 or something like that, which you then NAT to the local IPv4 of your PBS.

Or even better, most ISP offer static IPv6 /56 prefixes. Simply use IPv6. Then you can even make great firewall rules like
"Allow all TCP traffic form static IPv6 PVE to connect to static IPv6 PBS and destination port 8007".
I don't disagree, I could use the second solution given that PBS is behind NAT, but the current config is just more convenient.

For example, I have IPv6 on all servers, at the office and at home, but AFAIK none of the mobile ISPs in Italy have IPv6 (including when roaming) with a large enough prefix to also work via the hotspot, so if I fully relied on IPv6 there would we some cases in which I wouldn't be able to myself connect remotely to the PBS. I could then use IPv4 + Dynamic DNS, but that's too annoying to do when I can simply use Wireguard and with no extra effort also avoid opening ports to the entire world.

The only real issue I guess Wireguard has caused for now was making PVE take 15 minutes to realize PBS was gone, but now that I have enabled Fleecing I don't expect that to matter anymore.
 
which I wouldn't be able to myself connect remotely to the PBS.
I am not trying to take away your wireguard :)
I just don't get why you would want traffic that is already encrypted (PVE to PBS) to be encrypted again and routed trough a tunnel, when there is no need for it.

That is regardless if you use IPv4 or IPv6 for it.
IPv6 is just nicer in terms of firewall rules, no pesky NAT and no dyndns needed, because you get a static prefix for free, while for IPv4 you have to pay for a static IP. But even without IPv6, I would not use wireguard to tunnel my IPv4 traffic from PVE to PBS.