I have a small 5 node Ceph (hammer) test cluster. Every node runs Proxmox, a Ceph MON and 1 or 2 OSDs. There are two pools defined, one with 2 copies (pool2), and one with 3 copies of data (pool3). Ceph has a dedicated 1Gbps network. There are a few RAW disks stored on pool2 at the moment, belonging to an OpenMediaVault KVM guest and some others.
Last night one of the nodes spontaneously rebooted during backups (probably because of memory exhaustion on ZFS, no idea really as nothing is in the logs), and since then the RAW disk on Ceph pool2 that's attached to a KVM guest has vanished.
When I try to start the KVM guest it gives a "no such file" error:
When checking Ceph from the Proxmox web interface, everything looks fine: all monitors running with quorum, all OSDs Up and In, all PGs active+clean. Ceph status also shows no errors:
If I check the pool2 storage from the web interface, it shows correct usage of 54.19% (5.41 TiB of 9.98 TiB), yet in Content no RAW disks are visible:
Ceph logs show nothing in particular. Has anyone seen anything like this? Where is my data?
Last night one of the nodes spontaneously rebooted during backups (probably because of memory exhaustion on ZFS, no idea really as nothing is in the logs), and since then the RAW disk on Ceph pool2 that's attached to a KVM guest has vanished.
When I try to start the KVM guest it gives a "no such file" error:
Code:
root@proxmox:~# qm start 126
kvm: -drive file=rbd:rbd/vm-126-disk-1:mon_host=192.168.0.7;192.168.0.6;192.168.0.5:id=admin:auth_supported=cephx:keyring=/etc/pve/priv/ceph/pool2.keyring,if=none,id=drive-virtio2,cache=writeback,format=raw,aio=threads,detect-zeroes=on: error reading header from vm-126-disk-1: No such file or directory
When checking Ceph from the Proxmox web interface, everything looks fine: all monitors running with quorum, all OSDs Up and In, all PGs active+clean. Ceph status also shows no errors:
Code:
root@proxmox:~# ceph status
cluster 98c9d762-ea24-4e28-88b9-a0a585d53cfd
health HEALTH_WARN
too many PGs per OSD (537 > max 300)
monmap e5: 5 mons at {0=192.168.0.3:6789/0,1=192.168.0.4:6789/0,2=192.168.0.5:6789/0,3=192.168.0.6:6789/0,4=192.168.0.7:6789/0}
election epoch 368, quorum 0,1,2,3,4 0,1,2,3,4
osdmap e329: 5 osds: 5 up, 5 in
pgmap v1053656: 1088 pgs, 3 pools, 2764 GB data, 691 kobjects
5536 GB used, 4679 GB / 10216 GB avail
1088 active+clean
If I check the pool2 storage from the web interface, it shows correct usage of 54.19% (5.41 TiB of 9.98 TiB), yet in Content no RAW disks are visible:
Ceph logs show nothing in particular. Has anyone seen anything like this? Where is my data?
Last edited: