Can't remove image when migrated with CEPH + KRBD

hybrid512

Active Member
Jun 6, 2013
76
4
28
Hi,

I just found a nasty bug when using ProxMox 4.2 in a clustered setup with Ceph and a KRBD configured storage.

Using KRBD, a /dev/rbdxx entry is created on the server to gain access to the RBD image.
When migrating a VM using such volume from server A to server B, the /dev/rbdxx device is still mapped on the server A and then mapped again on server B.
If you try to delete this VM, Ceph will complain because there are still existing watchers on this image.
In fact the "watcher" is the /dev/rbdxx that is still mapped on the server A.

In order to be able to remove that image again, you first have to unmap this device on the server A.

So to conclude, when migrating VM between servers when there are KRBD images involved, ProxMox should not forget to unmap properly the device on the preceding node.

Regards.
 
which ceph version are you using? cannot reproduce this here using hammer (both client & cluster)
 
I ran into this problem when I cloned a VM that had ceph storage devices to another host. The clone was fine, however I couldn't remove the cloned VM as there were RBD devices mapped on the host where the VM was cloned from.

Running Ceph 0.94.7 and latest Proxmox (free) update.s
 
Package versions

proxmox-ve: 4.2-54 (running kernel: 4.4.10-1-pve)
pve-manager: 4.2-15 (running version: 4.2-15/6669ad2c)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.8-1-pve: 4.4.8-52
pve-kernel-4.4.10-1-pve: 4.4.10-54
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-42
qemu-server: 4.0-81
pve-firmware: 1.1-8
libpve-common-perl: 4.0-68
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-55
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-19
pve-container: 1.0-68
pve-firewall: 2.0-29
pve-ha-manager: 1.0-32
ksm-control-daemon: 1.2-1
glusterfs-client: 3.6.9-2
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie
ceph: 0.94.7-1~bpo80+1
 
Here is a way to fix the problem when encountered :

on any node with access to ceph :

Code:
rbd info vm-100-disk-1
rbd image 'vm-100-disk-1':
        size 1024 TB in 268435456 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.82072ae8944a
        format: 2
        features: layering

rados -p <ceph pool name> listwatchers rbd_header.82072ae8944a

It will return a line like this where "watcher=" is the IP of the node where the image is mapped.

Code:
watcher=172.16.10.123:0/2563416157 client.4043115 cookie=1

Connect this node and type :

Code:
rbd showmapped
id pool image          snap device    
0  rbd  vm-100-disk-1 -    /dev/rbd0  
1  rbd  vm-101-disk-1 -    /dev/rbd1  

rbd unmap  /dev/rbd0

Once unmapped, you can then delete the image (by deleting its according VM for example)
 
thanks! patch that fixes this is on pve-devel.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!