Unable to remove VMs from GUI, CLI, or API

hanoon · May 17, 2022

We have been using Proxmox for years and have a 3 nodes cluster that had been healthy for a while then recently we noticed that VMs couldn't be deleted (destroyed) by any means (GUI, CLI, or the API)

We don't see any storage issues that could refer to a problem with file locking, The VM disk image is getting deleted however the configuration file stays so the VM remains in the VM list even though no disk is attached, Disk capacity is 5% utilized - CPU and memory are not taxed

We are using local disks on each server, as well as a simple Ceph cluster with 2 RBDs per node - the network is stable and has been tested for a long time

What we have tested so far :
- Destroy from GUI, CLI, API
- Restart the PVE, cluster, ceph services
- Upgraded the node to the latest version
- Rebooted all the nodes (Was painful as they we have to migrate VMs)

Note: After reboot destroy worked for a short time then stopped working

Checking the logs is not giving any errors or clues

Any hint or assistance would be appreciated
---------------------

pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.174-2-pve)
pve-manager: 6.4-14 (running version: 6.4-14/15e2bf61)
pve-kernel-5.4: 6.4-15
pve-kernel-helper: 6.4-15
pve-kernel-5.4.174-2-pve: 5.4.174-2
pve-kernel-5.4.114-1-pve: 5.4.114-1
ceph: 15.2.15-pve1~bpo10
ceph-fuse: 15.2.15-pve1~bpo10
corosync: 3.1.5-pve2~bpo10+1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve2~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-network-perl: 0.6.0
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-2
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.7-pve1

fabian · May 18, 2022

please post the task log of an attempted VM removal..

hanoon · May 18, 2022

Thanks for the response - I don't see any logs

pvesh delete /node/<nodename>/qemu-image/<vmid> shows nothing - they just hang

when stopping the destroy task [just spinning and not ended] then when try to delete again
this logs
disk image '/var/lib/vz/images/33058/vm-33058-disk-0.qcow2' does not exist

The only issue we were able to see in the logs is related to Ceph Rados GW - we think it may be related

==================
# cat /var/log/ceph/client.radosgw.node01.log
2022-05-18T11:56:55.203-0400 7ff87034b400 0 ceph version 15.2.15 (4b7a17f73998a0b4d9bd233cda1db482107e5908) octopus (stable), process radosgw, pid 224265
2022-05-18T11:56:55.203-0400 7ff87034b400 0 framework: civetweb
2022-05-18T11:56:55.203-0400 7ff87034b400 0 framework conf key: port, val: xx.xx.xx.xx:23336
2022-05-18T11:56:55.203-0400 7ff87034b400 1 radosgw_Main not setting numa affinity
2022-05-18T11:56:55.203-0400 7ff87034b400 0 pidfile_write: ignore empty --pid-file
2022-05-18T12:01:55.202-0400 7ff86eadd700 -1 Initialization timeout, failed to initialize

==================
/var/log/syslog

May 18 12:12:10 node01 ceph-osd[4147874]: 2022-05-18T12:12:10.977-0400 7f7b9e0e0700 -1 osd.1 3793 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.138400702.0:19 3.11 3.3e6a8311 (undecoded) ondisk+read+known_if_redirected e3793)
May 18 12:12:11 node01 ceph-mon[4147822]: 2022-05-18T12:12:11.113-0400 7faa22bc8700 -1 mon.node01@0(leader) e3 get_health_metrics reporting 8 slow ops, oldest is osd_failure(failed timeout osd.0 [v2:xxx.xxx.xxx.240:6800/4147870,v1:xxx.xxx.xxx.240:6801/4147870] for 20sec e3793 v3793)
May 18 12:12:11 node01 ceph-osd[4147870]: 2022-05-18T12:12:11.469-0400 7fa47290c700 -1 osd.0 3793 heartbeat_check: no reply from xxx.xxx.xxx.242:6802 osd.4 ever on either front or back, first ping sent 2022-05-18T12:03:27.581748-0400 (oldest deadline 2022-05-18T12:03:47.581748-0400)
May 18 12:12:11 node01 ceph-osd[4147870]: 2022-05-18T12:12:11.469-0400 7fa47290c700 -1 osd.0 3793 heartbeat_check: no reply from xxx.xxx.xxx.242:6806 osd.5 ever on either front or back, first ping sent 2022-05-18T12:03:27.581748-0400 (oldest deadline 2022-05-18T12:03:47.581748-0400)
May 18 12:12:11 node01 ceph-osd[4147870]: 2022-05-18T12:12:11.469-0400 7fa47290c700 -1 osd.0 3793 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.138400702.0:17 2.32 2.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e3793)
May 18 12:12:12 node01 ceph-osd[4147874]: 2022-05-18T12:12:12.013-0400 7f7b9e0e0700 -1 osd.1 3793 heartbeat_check: no reply from xxx.xxx.xxx.242:6802 osd.4 ever on either front or back, first ping sent 2022-05-18T12:03:27.742478-0400 (oldest deadline 2022-05-18T12:03:47.742478-0400)
May 18 12:12:12 node01 ceph-osd[4147874]: 2022-05-18T12:12:12.013-0400 7f7b9e0e0700 -1 osd.1 3793 heartbeat_check: no reply from xxx.xxx.xxx.242:6806 osd.5 ever on either front or back, first ping sent 2022-05-18T12:03:27.742478-0400 (oldest deadline 2022-05-18T12:03:47.742478-0400)
May 18 12:12:12 node01 ceph-osd[4147874]: 2022-05-18T12:12:12.013-0400 7f7b9e0e0700 -1 osd.1 3793 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.138400702.0:19 3.11 3.3e6a8311 (undecoded) ondisk+read+known_if_redirected e3793)
May 18 12:12:12 node01 ceph-osd[4147870]: 2022-05-18T12:12:12.473-0400 7fa47290c700 -1 osd.0 3793 heartbeat_check: no reply from xxx.xxx.xxx.242:6802 osd.4 ever on either front or back, first ping sent 2022-05-18T12:03:27.581748-0400 (oldest deadline 2022-05-18T12:03:47.581748-0400)
May 18 12:12:12 node01 ceph-osd[4147870]: 2022-05-18T12:12:12.473-0400 7fa47290c700 -1 osd.0 3793 heartbeat_check: no reply from xxx.xxx.xxx.242:6806 osd.5 ever on either front or back, first ping sent 2022-05-18T12:03:27.581748-0400 (oldest deadline 2022-05-18T12:03:47.581748-0400)
May 18 12:12:12 node01 ceph-osd[4147870]: 2022-05-18T12:12:12.473-0400 7fa47290c700 -1 osd.0 3793 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.138400702.0:17 2.32 2.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e3793)
May 18 12:12:13 node01 ceph-osd[4147874]: 2022-05-18T12:12:13.049-0400 7f7b9e0e0700 -1 osd.1 3793 heartbeat_check: no reply from xxx.xxx.xxx.242:6802 osd.4 ever on either front or back, first ping sent 2022-05-18T12:03:27.742478-0400 (oldest deadline 2022-05-18T12:03:47.742478-0400)
May 18 12:12:13 node01 ceph-osd[4147874]: 2022-05-18T12:12:13.049-0400 7f7b9e0e0700 -1 osd.1 3793 heartbeat_check: no reply from xxx.xxx.xxx.242:6806 osd.5 ever on either front or back, first ping sent 2022-05-18T12:03:27.742478-0400 (oldest deadline 2022-05-18T12:03:47.742478-0400)

If there are specific logs that could help - I can collect

fabian · May 19, 2022

hanoon said:
Thanks for the response - I don't see any logs

pvesh delete /node/<nodename>/qemu-image/<vmid> shows nothing - they just hang

that's not a valid API path?

hanoon said:
when stopping the destroy task [just spinning and not ended] then when try to delete again
this logs
disk image '/var/lib/vz/images/33058/vm-33058-disk-0.qcow2' does not exist

okay, that means it came as far as deleting that disk at least

hanoon said:
The only issue we were able to see in the logs is related to Ceph Rados GW - we think it may be related

==================
# cat /var/log/ceph/client.radosgw.node01.log
2022-05-18T11:56:55.203-0400 7ff87034b400 0 ceph version 15.2.15 (4b7a17f73998a0b4d9bd233cda1db482107e5908) octopus (stable), process radosgw, pid 224265
2022-05-18T11:56:55.203-0400 7ff87034b400 0 framework: civetweb
2022-05-18T11:56:55.203-0400 7ff87034b400 0 framework conf key: port, val: xx.xx.xx.xx:23336
2022-05-18T11:56:55.203-0400 7ff87034b400 1 radosgw_Main not setting numa affinity
2022-05-18T11:56:55.203-0400 7ff87034b400 0 pidfile_write: ignore empty --pid-file
2022-05-18T12:01:55.202-0400 7ff86eadd700 -1 Initialization timeout, failed to initialize

==================

okay. could you send
- the config of a VM you remove
- your storage.cfg
- 'pvesm status' output

thanks!

hanoon · May 20, 2022

Thanks for the answers - requested details below:

cat /etc/pve/qemu-server/30313.conf
agent: 1
args: -vnc unix:/var/run/qemu-server/30313.vnc
balloon: 512
boot: order=virtio0;net0
cores: 20
hotplug: disk,network,usb,memory,cpu
ide0: none,media=cdrom
memory: 2048
name: test-vm
net0: virtio=72:FE:CC:6A:04:E7,bridge=vmbr10,firewall=1,link_down=1,rate=4
numa: 1
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=3b58bcd6-4cf1-45bd-851e-829f1fbfad69
sockets: 1
vcpus: 1
virtio0: local:30313/vm-30313-disk-0.qcow2,format=qcow2,size=40G
vmgenid: f2ac764f-6b08-4ad8-970d-d4ca2b6d14c3

============================
rbd: ssdpool
content rootdir,images
krbd 0
pool ssdpool

dir: local
path /var/lib/vz
content rootdir,vztmpl,backup,iso,images
prune-backups keep-all=1
shared 0

rbd: hddpool
content rootdir
krbd 0
pool hddpool

pbs: nybackup02
datastore zfsstorage
server x.x.x.229
content backup
encryption-key 1a:d3:33:77:3e:4d:20:e1:5e:be:4d:10:58:f2:79:fd:ce:85:41:d8:76:ae:9d:7e:c1:1b:40:88:63:8d:a2:fd
fingerprint be:53:a9:7f:e4:5f:b0:51:19:92:be:0c:d0:90:61:ee:ce:0b:b5:ea:cb:95:63:69:37:42:e9:ab:7c:74:d4:d1
prune-backups keep-all=1
username root@pam

nfs: ny2storage
export /storage
path /mnt/pve/storage
server x.x.x.110
content iso,backup
prune-backups keep-all=1

rbd: default.rgw.buckets.index
disable
content rootdir
krbd 0
pool default.rgw.buckets.index

rbd: default.rgw.buckets.data
disable
content rootdir
krbd 0
pool default.rgw.buckets.data

nfs: shared-iso-images
export /shared-iso
path /mnt/pve/shared-iso-images
server x.x.x.111
content iso
prune-backups keep-all=1

nfs: templates
export /shared
path /mnt/pve/templates
server x.x.x.111
content images
prune-backups keep-all=1

=============

pvesm status
Name Type Status Total Used Available %
default.rgw.buckets.data rbd disabled 0 0 0 N/A
default.rgw.buckets.index rbd disabled 0 0 0 N/A
hddpool rbd active 4636365454 2167950 4634197504 0.05%
local dir active 6842713112 280499940 6217289424 4.10%
storage nfs active 3508733440 227856384 3134340096 6.49%
nybackup02 pbs active 100218752 15564544 84654208 15.53%
shared-iso-images nfs active 2112646144 15971328 1989284352 0.76%
ssdpool rbd active 3561405905 1863377 3559542528 0.05%
templates nfs active 324307456 204975104 106237952 63.20%

hanoon · May 20, 2022

Update: I disabled ceph service on all 3 nodes and the issue is cleared - How to find the link with ceph causing the issue?

fabian · May 23, 2022

on VM removal we scan all storages for disks belonging to the VM but not referenced in the config, likely that fails and blocks the task after the first (few) referenced disks have already been removed.

Search

Search

Unable to remove VMs from GUI, CLI, or API

hanoon

Renowned Member

Attachments

fabian

Proxmox Staff Member

hanoon

Renowned Member

fabian

Proxmox Staff Member

hanoon

Renowned Member

hanoon

Renowned Member

fabian

Proxmox Staff Member

We value your privacy