WARNING! Do not use PBS with Kernel 7 on a Synology NAS

jacotec

Member
Nov 19, 2024
56
17
8
Kerpen, DE
Hi Community,

although officially not supported or best practice I know that many people (like me) are running PBS in a VM on a Synology NAS for their backups of their homelabs. I had zero issues with this in the last 18 months.

Two days ago a backup session suddenly failed with a non-writable datastore. The datastore mount fell back into emergency-ro without a further reason. I ran an e2fsck on the datastore with no issues and rebooted the VM.

Yesterday evening, already being in bed, my monitoring cried. PBS down, Backup Datastore mounts on the host down, Synology down. The complete DS923+ crashed. Did not even react to shutdown via button and I needed to force-poweroff it. Rebooted, e2fschk the datastore, remounted it, started the failed backup session manually.

That backup session again failed after 3 minutes. Datastore was back in emergency-ro.

What was happening?

3 days ago I updated the PBS. With it came the new V7 kernel which was in use. This was the start of all iussues.

The crash only happens under certain load. Which is in my case: All 3 hosts in my homelab are running a backup to PBS. This is leading to DID_BAD_TARGET I/O-Errors in the PBS-VM:

Code:
   sd 1:0:1:0: [sdb] tag#X FAILED Result: hostbyte=DID_BAD_TARGET
   I/O error, dev sdb, sector ...
   EXT4-fs (sdb1): Remounting filesystem read-only

There are no disk errors etc. on the host (Synology) side. e2fsck on the virtual datastore disk never found errors. It's reproducible with every backup which causes significant load.

My Synology:
DS923+, DSM 7.3.2-86009 (Kernel 4.4)
Guest: PBS-VM with Kernel 7.0.2-4-pve
Storage: ext4 on virtio-scsi/blk, btrfs/SHR on the Synology host

Solution: Pin the Kernel to V6 In my case:

Code:
proxmox-boot-tool kernel pin 6.17.13-9-pve

All is running well here since then.

I guess the virtio/virtio-scsi behavior in Kernel 7 might have some compatibility issues with the QEMU/Kernel-4.4 in DSM.

EDIT: I don't think it's related only to DSM - I found this here which is an identical issue on a PVE host, all pointing to the new io_uring in Kernel 7: https://forum.proxmox.com/threads/hung-on-restore-since-upgrade-to-kernel-7-proxmox-9.183717/

Take care,
Marco
 
Last edited:
  • Like
Reactions: Sunilkumar
Thanks for sharing this — I can confirm I’ve seen very similar behavior after upgrading to kernel 7 on a PBS VM running on non-native hardware. In my case, the issue also appeared only under higher I/O load, especially during concurrent backup jobs.


The read-only remount with DID_BAD_TARGET errors is a good catch. I initially suspected disk or controller issues, but like you mentioned, filesystem checks came back clean. Rolling back to a 6.x kernel stabilized everything for me as well.


It does seem like there could be a compatibility gap with virtio-scsi under certain host environments, especially when the underlying system is not a standard Linux kernel (like DSM or older kernels). The io_uring changes in kernel 7 might be a factor here.


For now, pinning the kernel feels like the safest workaround until there’s more clarity or a fix upstream.
 
OK, I am experiencing kernel 7 issues as well with PBS on a bare metal server. Pinning the kernel to 6.17 fixes the problem.

The PBS runs fine UNTIL I run a backup. Ha ha. And it starts out normally, and then partway through it gets VERY slow. And finally it locks up. The slowness makes me think there is a memory leak or an SDD is overheating. But since kernel 6.17 doesn't have that issue, we can rule out the SDD.

In exploring this further, it appears that kernel 7.0.10 fixes a memory leak issue related to networking. Apparently when networking is being run to the limit, there are problems with earlier 7.0 kernels with a memory leak. The latest kernel Proxmox has released is 7.0.6, so we may need to wait until they release the 7.0.10 kernel for PBS.

Incidentally, I have another PBS that runs kernel 7.0.6 just fine, however the hardware is quite a bit beefier on that unit so it's probably able to keep up with the network flow. But now I'm worried about it. And my PVE instances. Hopefully Proxmox releases a 7.0.10 kernel Real Soon Now.
 
  • Like
Reactions: Johannes S