[SOLVED] ceph can not remove image - watchers

RobFantini · Nov 26, 2018

hello
I've checked other threads and ceph lists.

For some reason a ceph disk lists as on 2 different ceph storage.

I backed up, deleted and restored the lxc to zfs.

now on pve storage list contents the disk still shows on both places.

so i spent some hours trying to remove. not done yet and may end up deleting the pool in a day or 2 any way.
i thought i'd post this info

Code:

# rbd status -p ceph vm-213-disk-1
Watchers:
        watcher=10.11.12.3:0/2474997724 client.78184312 cookie=18446462598732840963


# rbd  info vm-213-disk-1   -p ceph
rbd image 'vm-213-disk-1':
        size 101GiB in 25856 objects
        order 22 (4MiB objects)
        block_name_prefix: rbd_data.77ad306b8b4567
        format: 2
        features: layering
        flags:
        create_timestamp: Sat Aug 25 04:44:22 2018

# rados -p ceph listwatchers rbd_header.77ad306b8b4567
watcher=10.11.12.3:0/2474997724 client.78184312 cookie=18446462598732840963

# i tried removing the mon . did not fix,  added it back

# rbd showmapped
id pool image         snap device  
... ceph vm-213-disk-1 -    /dev/rbd0
1  ceph vm-213-disk-1 -    /dev/rbd1

# this worked
rbd unmap /dev/rbd0

# rbd unmap /dev/rbd1
rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy


# rbd showmapped
id pool image         snap device  
1  ceph vm-213-disk-1 -    /dev/rbd1


cat /sys/kernel/debug/ceph/220b9a53-4556-48e3-a73c-28deff665e45.client78184312/osdc

REQUESTS 0 homeless 0
LINGER REQUESTS
18446462598732840963    osd9    13.d511aa64     13.264  [9,49,40]/9     [9,49,40]/9     e76294  rbd_header.77ad306b8b4567    0x20     2       WC/0
BACKOFFS


#  I tried stopping osd9 .   that did not fix.   another osd showed

# cat /sys/kernel/debug/ceph/220b9a53-4556-48e3-a73c-28deff665e45.client78184312/osdc
REQUESTS 0 homeless 0
LINGER REQUESTS
18446462598732840963    osd49   13.d511aa64     13.264  [49,40]/49      [49,40]/49      e76296  rbd_header.77ad306b8b4567    0x20     3       WC/0
BACKOFFS


# started osd9 again after a few min

# cat /sys/kernel/debug/ceph/220b9a53-4556-48e3-a73c-28deff665e45.client78184312/osdc
REQUESTS 0 homeless 0
LINGER REQUESTS
18446462598732840963    osd9    13.d511aa64     13.264  [9,49,40]/9     [9,49,40]/9     e76298  rbd_header.77ad306b8b4567    0x20     4       WC/0
BACKOFFS

and that is where i left off.

RobFantini · Nov 26, 2018

Code:

pve3  ~ # ceph -s
  cluster:
    id:     220b9a53-4556-48e3-a73c-28deff665e45
    health: HEALTH_WARN
            noout flag(s) set
  services:
    mon: 3 daemons, quorum pve3,sys8,pve10
    mgr: pve3(active), standbys: sys8, pve10
    osd: 65 osds: 65 up, 65 in
         flags noout
  data:
    pools:   2 pools, 1088 pgs
    objects: 32.70k objects, 124GiB
    usage:   436GiB used, 25.1TiB / 25.5TiB avail
    pgs:     1088 active+clean
  io:
    client:   43.9KiB/s wr, 0op/s rd, 10op/s wr

Code:

# pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.18-8-pve)
pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)
pve-kernel-4.15: 5.2-11
pve-kernel-4.15.18-8-pve: 4.15.18-28
pve-kernel-4.15.18-7-pve: 4.15.18-27
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-41
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-3
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-29
pve-docs: 5.2-9
pve-firewall: 3.0-14
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-38
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve2~bpo1

RobFantini · Nov 26, 2018

attached shows ceph disk appears two different storage ID's

alexskysilk · Nov 26, 2018

generally if you're unable to release a mounted rbd, AND you're sure there is no outstanding IO, you can use the force switch (eg rbd unmap -o force /dev/rbd0.)

if that doesnt work, you'll need to reboot the node. (edit- the node would not go down quietly. you'd need to shoot it in the head.)

RobFantini · Nov 27, 2018

this removed the watcher - thank you.

Code:

rbd unmap -o force /dev/rbd1

which also fixed these

Code:

rados -p ceph listwatchers rbd_header.77ad306b8b4567

cat /sys/kernel/debug/ceph/220b9a53-4556-48e3-a73c-28deff665e45.client78184312/osdc
REQUESTS 0 homeless 0
LINGER REQUESTS
BACKOFFS

rbd showmapped

Next -

vm-213-disk-1 still shows up in both ceph_vm and ceph_ct storage screens .

Do you have a suggestion to remove those?

thank you for the help .

RobFantini · Nov 27, 2018

I am trying to get info on disk or disks. in progress...

Code:

rbd --pool ceph  info vm-213-disk-1
rbd image 'vm-213-disk-1':
        size 101GiB in 25856 objects
        order 22 (4MiB objects)
        block_name_prefix: rbd_data.77ad306b8b4567
        format: 2
        features: layering
        flags:
        create_timestamp: Sat Aug 25 04:44:22 2018

I am not sure if that is the disk at ceph_ct or ceph_vm ... or if there is only one disk getting reported as being at both storages.

alexskysilk · Nov 27, 2018

RobFantini said:
I am not sure if that is the disk at ceph_ct or ceph_vm ... or if there is only one disk getting reported as being at both storages.

This is the expected behavior, and the rest of your disks should be showing in both places as well.

both your proxmox storage pools are pointing to the same ceph pool, one in krbd mode (ct) and the other in kernel mode (vm.)

RobFantini · Nov 27, 2018

alexskysilk said:
This is the expected behavior, and the rest of your disks should be showing in both places as well.

both your proxmox storage pools are pointing to the same ceph pool, one in krbd mode (ct) and the other in kernel mode (vm.)

OK thank you.

-----------------------------------------

now I am still trying to remove the stranded disk.

Code:

# rbd rm -f  -p ceph   vm-213-disk-1
2018-11-27 14:22:02.709825 7f6d5effd700 -1 librbd::image::RemoveRequest: 0x5559bfb876e0 check_image_watchers: image has watchers - not removing
Removing image: 0% complete...failed.
rbd: error: image still has watchers
This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout.
pve3  ~ #

the watchers are back

Code:

# cat /sys/kernel/debug/ceph/220b9a53-4556-48e3-a73c-28deff665e45.client78498966/osdc
REQUESTS 0 homeless 0                                                                                                                            
LINGER REQUESTS                                                                                                                                              
18446462598732840961    osd9    13.d511aa64     13.264  [9,49,40]/9     [9,49,40]/9     e76420  rbd_header.77ad306b8b4567       0x20    0       WC/0        
BACKOFFS

That disk is the only one left at ceph. We moved all our vm's to zfs while we upgrade the storage network.
We had bad ceph slow request issues , and I wonder if it is related to the LINGER REQUESTS / BACKOFFS

Is it normal to have 'LINGER REQUESTS' ?

alexskysilk · Nov 27, 2018

Having pending IO for an RBD is usually normal and doesnt necessarily denote a problem. What is the output of
# ceph health detail

RobFantini · Nov 28, 2018

I found out the issue is an operator error [ me ].

there is still a vm on ceph.

thank you for the assistance.

Search

Search

[SOLVED] ceph can not remove image - watchers

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

Attachments

alexskysilk

Distinguished Member

RobFantini

Famous Member

RobFantini

Famous Member

alexskysilk

Distinguished Member

RobFantini

Famous Member

alexskysilk

Distinguished Member

RobFantini

Famous Member