LXC Boot and vzdump Failures on Ceph RBD after Upgrade: fsconfig() failed ... Can't lookup blockdev (Exit Code 32)

chrispage1

Well-Known Member
Sep 1, 2021
108
52
48
34
Hi,

After all of the CVE's and security disclosures over the past week or two, I thought it'd be useful to upgrade my Proxmox nodes to the latest. I recently updated everything to PVE 9 & Ceph 19 and it has been working without fault for a few weeks. Since todays package updates, I am getting sporadic LXC boot and vzdump failures with the error Can't lookup blockdev (Exit code 32)

Environment:

- PVE 9.1.9 / Linux 6.17.13-9-pve (2026-05-15T08:46Z)
- Ceph 19.2.3 - RBD (krbd mapped)
- Guest LXC container 119

---

After upgrading a cluster node, I hit two distinct but identically rooted issues involving LXC container storage mapping. The system fails during automated storage operations with a mount exit code 32, specifically spitting out an underlying kernel fsconfig() error.

I suspected it could be just the one node because it was ahead of the others so proceeded to continue the upgrade and can replicate the issue on more than one node now.

Symptom 1: Container Fails to Start (HA or Local)​

When attempting to boot the container, it immediately crashes during the pre-start phase. Checking the container debug logs (lxc-start -n 119 -F -l DEBUG -o /tmp/lxc.log) gave:

Code:
lxc-start produced output: mount: /var/lib/lxc/.pve-staged-mounts/mp0: fsconfig() failed: /dev/rbd-pve/[FSID]/ceph_data/vm-119-disk-1: Can't lookup blockdev.
dmesg(1) may have more information after failed mount system call.
command 'mount /dev/rbd-pve/[FSID]/ceph_data/vm-119-disk-1 /var/lib/lxc/.pve-staged-mounts/mp0' failed: exit code 32
ERROR: Failed to run lxc.hook.pre-start for container "119"

Manually running rbd map to force the symlink to generate, leaving it mapped, and then executing pct start allowed the container to boot on a subsequent attempt.

Symptom 2: vzdump Snapshot Backups Fail​

Even with the container successfully running, executing an automated vzdump backup in snapshot mode to a Proxmox Backup Server (PBS) fails instantly at the mount phase:

Code:
INFO: create storage snapshot 'vzdump'
Creating snap: 100% complete...done.
/dev/rbd4
mount: /mnt/vzsnap0: fsconfig() failed: /dev/rbd-pve/[FSID]/ceph_data/vm-119-disk-0@vzdump: Can't lookup blockdev.
umount: /mnt/vzsnap0/: not mounted.
ERROR: Backup of VM 119 failed - command 'mount -o ro,noload /dev/rbd-pve/[FSID]/ceph_data/vm-119-disk-0@vzdump /mnt/vzsnap0//' failed: exit code 32

With some consultation with AI, it appears to be an asynchronous race condition between the kernel mapping the RBD and udevd generating the links inside /dev/rbd-pve/[FSID]/...

Is this a known issue or has anyone else experienced this?
 
i did try to reproduce but couldn't, more information about your setup (ct/ceph config etc.) would be useful.

does this persist after you reboot your nodes?
 
Hi Dominik,

Thanks for your reply!

Sure - happy to supply as much information as I can. So this has only begun since the latest update and reboot and can be replicated across nodes.

This cluster has gone all the way from PVE 6 with Ceph 16 (from memory) and been upgraded up to PVE 9 with Ceph 19. The issues seem spontaneous, like some sort of race condition. I can replicate with multiple containers

1779284761517.png

Here are the exact upgrade version changes from my dist-upgrade log:
  • Proxmox Kernel: upgraded to proxmox-kernel-6.17.13-9-pve-signed (6.17.13-9)
  • pve-container: upgraded from 6.1.4 to 6.1.5
  • udev / systemd: upgraded from 257.9-1~deb13u1 to 257.13-1~deb13u1
  • libc6: upgraded from 2.41-12+deb13u2 to 2.41-12+deb13u3

Here is my entire LXC config (some bits redacted):

Code:
root@pve03:/etc/pve/lxc# cat 119.conf
arch: amd64
cores: 6
features: nesting=1
hostname: search-server
memory: 10240
mp0: ceph_data:vm-119-disk-1,mp=/etc/meilisearch,backup=1,size=20G
nameserver: 8.8.8.8 8.8.4.4
net0: name=eth0,bridge=BRIDGE,firewall=1,gw=XXX.XX.XXX.XXX,hwaddr=02:61:D4:8F:94:07,ip=XXX.XX.XXX.XXX/27,type=veth
ostype: ubuntu
rootfs: ceph_data:vm-119-disk-0,size=20G
swap: 0
tags: services
unprivileged: 1
 
Last edited:
Possibly related observation from our environment.

Same symptom: mount: ... fsconfig() failed: /dev/rbd-pve/[FSID]/[pool]/[image]@vzdump: Can't lookup blockdev during vzdump --mode snapshot of LXC on Ceph RBD. Persists across systemctl restart systemd-udevd and reboot.

Versions:
- pve-manager 9.1.19, pve-container 6.1.10, ceph-common 19.2.3-pve4
- Running kernel: proxmox-kernel-7.0.2-6-pve (same symptom on this kernel as OP saw on 6.17.x)

Tests point to a NUMA-related trigger:

In a 7-node cluster running identical software stack, our tests show the bug triggers reliably only on the single dual-socket / dual-NUMA-node host. All 6 single-socket hosts appear unaffected. Migrating the failing CT to a single-socket host and running vzdump there:
works. Migrating back: failure resumes.

Tests also suggest it's the timing race described:
- After vzdump's mount fails, the expected symlink /dev/rbd-pve/[FSID]/[pool]/[image]@vzdump exists and points correctly to the mapped /dev/rbdN
- Running the same mount -o ro,noload manually after the failure succeeds
- ceph-rbdnamer-pve returns the correct path when invoked manually on the still-mapped snap device

Tentative hypothesis (not verified):
On dual-socket NUMA systems, cross-socket scheduling between the kernel RBD-map event path and the systemd-udevd worker running ceph-rbdnamer-pve may extend symlink-creation latency past vzdump's mount syscall. On single-socket hosts the worker tends to stay colocated
and wins the race.