rbd - vm image files inaccessible after update to 7.4-16

stirling

New Member
Sep 18, 2023
1
0
1
Four server cluster, each server with 4 ceph osds. Ran updates on the servers without rebooting.
Rebooted server1, comes back online and looks fine.
Rebooted server2, but the external interface wouldn't come back up. Ceph complained about clock skew because it couldn't reach the ntp server to sync the time.
Eventually solved the problem after a few days and got server2 back up, (needed vmbr0 removed from config and then recreated in the gui).
server3 and server4 haven't been rebooted yet, but ceph was restarted after server2 was back online.
The VMs that were running on server1, server3, and server4 are still running. VMs that were powered off and those on server2 won't start, (rbd error, can't open disk).

Accessing the rbd storage in the gui gives:
rbd error: rbd: listing images failed: (2) No such file or directory (500)

The common solution I see in the forum for that error is to delete the disk files in rbd, (assuming they're not important).
Unsure how to proceed or recover the vm disk files on rbd.
There's a chance rebooting servers 3 and 4 could bring things back, but there are VMs running that I'd rather not bring down until I have a good plan.
Maybe there's a service that can be restarted instead to get the rbd back.

More info:

# ceph osd pool stats
pool .mgr id 1
client io 3.7 KiB/s wr, 0 op/s rd, 0 op/s wr

pool cephfs_data id 2
nothing is going on

pool cephfs_metadata id 3
nothing is going on


# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content backup,vztmpl,iso

zfspool: local-zfs
pool rpool/data
content rootdir,images
sparse 1

cephfs: cephfs
path /mnt/pve/cephfs
content iso,backup,vztmpl
fs-name cephfs

rbd: cephrbd
content images,rootdir
krbd 0
pool .mgr

pbs: pbs
datastore backup
server pbs
content backup
fingerprint ..
prune-backups keep-all=1
username pmve@pbs


# pvesm status
Name Type Status Total Used Available %
cephfs cephfs active 4392787968 29970432 4362817536 0.68%
cephrbd rbd active 4602766863 239945743 4362821120 5.21%
local dir active 1129769472 37028864 1092740608 3.28%
local-zfs zfspool active 1092740750 24 1092740726 0.00%
pbs pbs active 618149024 130858672 455816688 21.17%


# rbd -p .mgr list
base-109-disk-0
vm-100-disk-0
vm-101-disk-0
vm-102-disk-0
vm-102-state-upgrade
vm-103-disk-0
vm-104-disk-0
vm-105-disk-0
vm-106-disk-0
vm-107-disk-0
vm-108-disk-0
vm-110-disk-0
vm-111-disk-0
vm-112-disk-0


# rbd -p .mgr list --long
rbd: error opening base-109-disk-0: (2) No such file or directory
rbd: error opening vm-100-disk-0: (2) No such file or directory
rbd: error opening vm-101-disk-0: (2) No such file or directory
rbd: error opening vm-102-disk-0: (2) No such file or directory
rbd: error opening vm-102-state-upgrade: (2) No such file or directory
rbd: error opening vm-103-disk-0: (2) No such file or directory
rbd: error opening vm-104-disk-0: (2) No such file or directory
rbd: error opening vm-105-disk-0: (2) No such file or directory
rbd: error opening vm-106-disk-0: (2) No such file or directory
rbd: error opening vm-107-disk-0: (2) No such file or directory
rbd: error opening vm-108-disk-0: (2) No such file or directory
rbd: error opening vm-110-disk-0: (2) No such file or directory
rbd: error opening vm-111-disk-0: (2) No such file or directory
rbd: error opening vm-112-disk-0: (2) No such file or directory
NAME SIZE PARENT FMT PROT LOCK
rbd: listing images failed: (2) No such file or directory
 
Last edited:
Hi. I found out I had this problem too after upgrading from 7.3 to the latest 7.4-16. Don't really know if the upgrade is the culprit, or if the problem just was just found after Oct 6th. In my case i can list the images with "sudo ceph rbd ls" and i took the output to "rbd info" and found out that the "rbd: error opening image vm-xxx-disk-3: (2) No such file or directory" was only on one particular image. that image is referenced as an unused disk on a VM... The question is... how to fix this?

Any feedback from Proxmox team?

UPDATE: I managed to remove the "stuck" image with the command: "rbd rm -p xxx vm-xxx-disk-3". I now can see the "VM Disks" listing in the storage dashboard for that ceph rdb. Still doing some more tests.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!