LXC - Ceph don't start

Gastondc

Well-Known Member
Aug 3, 2017
35
0
46
41
Launch the task of moving a volume from one pool to another in CEPH. The task was left hanging. Stop the task, and can't start again CT. The problem is over the ceph volume.



Code:
root@pve2:~# pct start 103 
run_buffer: 314 Script exited with status 32
lxc_init: 798 Failed to run lxc.hook.pre-start for container "103"
__lxc_start: 1945 Failed to initialize container "103"
startup for container '103' failed



root@pve2:~# lxc-start 103 
lxc-start: 103: lxccontainer.c: wait_on_daemonized_start: 851 No such file or directory - Failed to receive the container state
lxc-start: 103: tools/lxc_start.c: main: 308 The container failed to start
lxc-start: 103: tools/lxc_start.c: main: 311 To get more details, run the container in foreground mode
lxc-start: 103: tools/lxc_start.c: main: 314 Additional information can be obtained by setting the --logfile and --logpriority options


root@pve2:~# pct mount 103 
/dev/rbd5
mount: /var/lib/lxc/103/rootfs/mnt/DS01: /dev/rbd1 already mounted or mount point busy.
mounting container failed
command 'mount /dev/rbd1 /var/lib/lxc/103/rootfs//mnt/DS01' failed: exit code 32

Code:
root@pve2:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.101-1-pve)
pve-manager: 6.4-11 (running version: 6.4-11/28d576c2)
pve-kernel-5.4: 6.4-4
pve-kernel-helper: 6.4-4
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-5.4.101-1-pve: 5.4.101-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph: 15.2.13-pve1~bpo10
ceph-fuse: 15.2.13-pve1~bpo10
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.10-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

Code:
root@pve2:~# ceph -v 
ceph version 15.2.13 (1f5c7871ec0e36ade641773b9b05b6211c308b9d) octopus (stable)

Code:
root@pve2:~# pct config 103
arch: amd64
cores: 4
hostname: PBS01
memory: 8192
mp0: ceph_wi3tb:vm-103-disk-1,mp=/mnt/DS01,size=3001G
net0: name=eth0,bridge=vmbr0,firewall=1,gw=192.168.0.1,hwaddr=0E:63:00:AB:73:A2,ip=192.168.0.201/22,type=veth
ostype: debian
rootfs: ceph_wi3tb:vm-103-disk-0,size=8G
swap: 0
 
Last edited:
I found this process

2184875 ? D 91:55 rsync --stats -X -A --numeric-ids -aH --whole-file --sparse --one-file-system --bwlimit=0 /var/lib/lxc/103/.copy-volume-2/ /var/lib/lxc/103/.copy-volume-1

i try to kill with -9 but nothing . and i can't restart the node.

any idea?
 
I delete the 2 hidden folders:

/var/lib/lxc/103/.copy-volume-2/
/var/lib/lxc/103/.copy-volume-1/

And now start de CT wiouth problems!

but i have de dead procces in the system.
 
Now i cant delete de failed destination of my copy.

I try:

Code:
rbd rm replicated_1tb/vm-103-disk-0

but don't start.


Code:
root@pve2:/var/lib/lxc/103# rbd info replicated_1tb/vm-103-disk-0
rbd image 'vm-103-disk-0':
    size 2.9 TiB in 768000 objects
    order 22 (4 MiB objects)
    snapshot_count: 0
    id: 1d5bdf68c29eab
    block_name_prefix: rbd_data.1d5bdf68c29eab
    format: 2
    features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
    op_features:
    flags:
    create_timestamp: Tue Jul 20 16:04:08 2021
    access_timestamp: Tue Jul 20 16:04:08 2021
    modify_timestamp: Tue Jul 20 16:04:08 2021
 
Now. i can make some free space on the pool and i have an error:


Code:
root@pve2:# rbd unmap replicated_1tb/vm-103-disk-0
rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy


Code:
root@pve2:# rbd rm replicated_1tb/vm-103-disk-0
2021-07-21T13:53:04.425-0300 7fc3e57fa700 -1 librbd::image::PreRemoveRequest: 0x558d80c4c550 check_image_watchers: image has watchers - not removing
Removing image: 0% complete...failed.
rbd: error: image still has watchers
This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout.


any idea?
 
I found , i can't remove image becaouse it's open

root@pve2:~# cat /sys/kernel/debug/ceph/80e7521d-57fb-4683-9d67-943eef4a91b5.client17544956/osdc
REQUESTS 0 homeless 0
LINGER REQUESTS
18446462598732841075 osd2 22.3beeb5b 22.1b [2,5,7]/2 [2,5,7]/2 e16330 rbd_header.2067f444687225 0x20 0 WC/0
18446462598732841027 osd3 22.dceff552 22.12 [3,28,7]/3 [3,28,7]/3 e16330 rbd_header.1d5bdf68c29eab 0x20 5 WC/0
18446462598732841029 osd21 7.61a1d11f 7.1f [21,24,29]/21 [21,24,29]/21 e16330 rbd_header.3d4b0727c6040d 0x20 0 WC/0
18446462598732840984 osd22 7.25253c4b 7.4b [22,16,4]/22 [22,16,4]/22 e16330 rbd_header.0dc3ab83ea3912 0x20 7 WC/0
18446462598732840982 osd23 7.7b8d1e46 7.46 [23,26,20]/23 [23,26,20]/23 e16330 rbd_header.0dc35da1fc23af 0x20 7 WC/0
18446462598732840974 osd29 7.1cb9d71 7.71 [29,14,13]/29 [29,14,13]/29 e16330 rbd_header.10c8e5417a3b6f 0x20 7 WC/0
18446462598732841069 osd29 7.65271274 7.74 [29,15,21]/29 [29,15,21]/29 e16330 rbd_header.3d476edbca1a4 0x20 3 WC/0
BACKOFFS


¿any idea how to close this task?

Thanks!