Stuck VM on removed host

bagels1b

New Member
Nov 7, 2022
8
1
3
I have a new (not live) Ceph ProxMox cluster (4 hosts) and I'm doing some experimenting. I had a host in the cluster with a VM (102) and I went through the procedure to delete the host from the cluster (remove OSD/monitor/manager/delnode) with the VM running. Now the host still shows up in the cluster and the VM is there with a '?' next to it. More->Remove asks me to confirm the VMID but then doesn't delete it. If I try to migrate it to an existing host I get a popup "cluster not ready - no quorum? 500". On the host the 102.conf file still exists. On other hosts, the VM on the deleted host is in the /etc/pve/.vmlist file. It's also still listed when I do 'rbd du -p <pool>'. Is there a way to delete this VM and hopefully as a side effect, get the removed host to not show up in the gui?

cat .vmlist
{
"version": 1,
"ids": {
"102": { "node": "prox03", "type": "qemu", "version": 1 }}

}

rbd du -p pool01
NAME PROVISIONED USED
vm-101-disk-0 32 GiB 7.7 GiB
vm-102-disk-0 16 GiB 4.9 GiB
<TOTAL> 48 GiB 13 GiB

Thanks
 
Output of ceph -s ?



Do you still need this vm 102?
You should be able to move the file 102.con from the deletet host to another host:
mv /etc/pve/nodes/host-deleted/qemu-server/102.conf /etc/pve/nodes/host-still-existing/qemu-server/102.conf

Then try deleting /etc/pve/nodes/host-deleted/
 
  • Like
Reactions: bagels1b
Hello,

Does the node you deleted is powered off as described in our docs [0]? Regarding the VM if you really not need it anymore you can remove the config file for '102' you will find it in `/etc/pve/qemu-server/102.conf` and then you can remove the VM image from your pool RBD using the as the following command:

Code:
rbd rm <POOL>/vm-102-disk-0


[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node
 
  • Like
Reactions: Kingneutron
I don't need to keep the VM. The node is currently powered on. It was powered off and the node still showed up in the cluster gui with the VM.

After I go through the process of removing the node all/most files under /etc/pve lose write permissions. As root I can't change the permissions to writable so I can't delete 102.conf. Not sure if that's due to the way /etc/pve is mounted.

/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)

chmod 660 102.conf
chmod: changing permissions of '102.conf': Operation not permitted

lsattr 102.conf
lsattr: Operation not supported While reading flags on 102.conf

ceph -s
cluster:
id: b7a5d9ec-33cf-4faf-9295-7f28422416cf
health: HEALTH_OK

services:
mon: 2 daemons, quorum prox01,prox02 (age 89m)
mgr: prox02(active, since 20h), standbys: prox01
osd: 3 osds: 3 up (since 89m), 3 in (since 115m)

data:
pools: 2 pools, 33 pgs
objects: 3.23k objects, 12 GiB
usage: 37 GiB used, 59 GiB / 96 GiB avail
pgs: 33 active+clean
 
Hi,

Thank you for the output!

I don't need to keep the VM. The node is currently powered on. It was powered off and the node still showed up in the cluster gui with the VM.
The removed node should be powered off!!

May you please post the output of `pvecm status` from a node of your cluster? and provide us with the syslog since yesterday? You can generate the syslog since yesterday using `jounalctl` tool as the following command:

Bash:
journalctl --since yesterday | gzip > $(hostname)-syslog.txt.gz

By the way, you dont' have to edit the permissions!
 
  • Like
Reactions: bagels1b
Hi,

Thank you for the output!


The removed node should be powered off!!

May you please post the output of `pvecm status` from a node of your cluster? and provide us with the syslog since yesterday? You can generate the syslog since yesterday using `jounalctl` tool as the following command:

Bash:
journalctl --since yesterday | gzip > $(hostname)-syslog.txt.gz

By the way, you dont' have to edit the permissions!

Hi Moayad,

The pvecm status is below. After experimenting with this Prox cluster I needed to get it back to a stable state so I re-built it so I don't have this situation anymore. Thanks for your help in debugging. One last question if you don't mind, regarding the removed node being powered off. You mentioned deleting the 102.conf file and then removing from the pool RBD. 102.conf is only on the removed node so I had to power it on to try and delete it. That's when I discovered I don't have permissions anymore. Do you expect 102.conf to be on one of the remaining hosts still in the cluster?

pvecm status

Cluster information
-------------------
Name: Cluster
Config Version: 11
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Tue Apr 16 16:22:08 2024
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.f4
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.16.116.101 (local)
0x00000002 1 172.16.116.102
0x00000004 1 172.16.116.104
 
Thank you for the output!

Yes, I would check if the 102.conf file is still in the other clusters `find /etc/pve -name 102.conf` if it's exist you can remove it.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!