Disk usage ramp after VM cloning went wrong

andrea.fait · Nov 23, 2023

Greetings, I'm almost at my wits end with an issue that we are monitoring on Proxmox (Ceph storage)... it is probably due to the fact that I'm quite new with Proxmox (practically learned about it a month ago).

The setup is a 3 node cluster.
Storage is managed by Ceph, and it comprises 12 SSD (1Tb each), default replication.

About a month ago (space used was about 1,3Tb, with around 2,2Tb free space) a VM with a (faultly) massive amount of assigned disk space was cloned but (as you may guess by the embedded image) it went horribly south, it stuck at a given % and finally as we got hold of what was happening we stopped it (stopped: unexpected status).

First try was to remove the clone gone awry, which was present in the GUI as locked: I unlocked it via CLI and tried to remove it to no avail... so we experienced the first of many istances (every cloning/migrate/destroy action on any VM) of the following error message:

TASK ERROR: rbd error: rbd: listing images failed: (2) No such file or directory

Furthermore, the same message was the only output we got on the VM section of the Proxmox GUI.

Trying to remove it through the CLI reported:
rbd: delete error: (117) Structure needs cleaning.
error: image is pending moving to the trash.

It took a lot of reading and interpreting of documentation and user similar cases, but eventually we fixed that error: as it seems, after checking for snapshot and trash both on a Ceph and on a RADOS level, clearing a remaining index on RADOS level unstuck "something" and finally all actions on VMs and the GUI turned to normal... it was a solution, I'm still not sure if that was the correct one.

All in all, the faulty clone was finally removed.

We also had quite a ruckus as HEALTH_WARNings, since a lot of OSD got the "nearful" status... after a day of rebalancing the situation got a little better, but still I can not for the sake of it understand why all that disk reads as used.

So far the system is working, we have no more Health warnings, all OSD are green and all PGs are active+clean... still I can't figure why the space usage is so high, by accounting all VMs we sit at 1,44Tb not counting replication, ceph df reports 8.8Tb used, 2,1Tb available considering replication.

Is there a way to have ceph recheck the space it has assigned to the existing VMs?

Hope I was clear enough on explaining what happened.

Search

Search

Disk usage ramp after VM cloning went wrong

andrea.fait

New Member

We value your privacy