Hello,
I'm using a 3 node proxmox cluster (5.3.9), connected to a remote ceph cluster via a dedicated 10G network.
Everything work fine, it's very reliable, but a problem occurs when a proxmox node crashes.
proxmox's HA moves the VMs from the node that crashes to other nodes, and start VMs. VMs are seen as started by proxmox, but rbd images are still locked on ceph side by the crashed node. So the VM stays in a state where it's disk is locked by another process, and fails to boot correctly.
Example :
proxmox1 : 192.168.171.61/24 (172.18.7.61 on ceph side)
proxmox2 : 192.168.171.62/24(172.18.7.62 on ceph side)
proxmox3 : 192.168.171.63/24(172.18.7.63 on ceph side)
ceph1 : 172.18.7.51/24
ceph2 : 172.18.7.52/24
ceph3 : 172.18.7.53/24
VM 201 is running on proxmox1. On ceph side, I see that the rbd is locked by proxmox1 address:
proxmox1 crashes. proxmox's HA moves the VM to another node (proxmox3 in this case).
on ceph side :
The lock from the dead proxmox node is still present.
On the VM console, I see :
the only way to make the VM work again is to unlock it on ceph side :
As soon as I unlock the rbd, fsck works and VM starts.
what would be the way to resolve this issue ? as I still have quorum, can fencing send commands to ceph cluster to unlock the RBDs ?
Regards,
Cédric
I'm using a 3 node proxmox cluster (5.3.9), connected to a remote ceph cluster via a dedicated 10G network.
Everything work fine, it's very reliable, but a problem occurs when a proxmox node crashes.
proxmox's HA moves the VMs from the node that crashes to other nodes, and start VMs. VMs are seen as started by proxmox, but rbd images are still locked on ceph side by the crashed node. So the VM stays in a state where it's disk is locked by another process, and fails to boot correctly.
Example :
proxmox1 : 192.168.171.61/24 (172.18.7.61 on ceph side)
proxmox2 : 192.168.171.62/24(172.18.7.62 on ceph side)
proxmox3 : 192.168.171.63/24(172.18.7.63 on ceph side)
ceph1 : 172.18.7.51/24
ceph2 : 172.18.7.52/24
ceph3 : 172.18.7.53/24
VM 201 is running on proxmox1. On ceph side, I see that the rbd is locked by proxmox1 address:
Code:
root@ceph-am7-1:~# rbd lock ls --pool c7000-pxmx1-am7 vm-201-disk-0
There is 1 exclusive lock on this image.
Locker ID Address
client.70464 auto 140450841490944 172.18.7.61:0/2839087142
proxmox1 crashes. proxmox's HA moves the VM to another node (proxmox3 in this case).
on ceph side :
Code:
root@ceph-am7-1:~# rbd lock ls --pool c7000-pxmx1-am7 vm-201-disk-0
There is 1 exclusive lock on this image.
Locker ID Address
client.70464 auto 140450841490944 172.18.7.61:0/2839087142
On the VM console, I see :
the only way to make the VM work again is to unlock it on ceph side :
Code:
root@ceph-am7-1:~# rbd lock remove --pool c7000-pxmx1-am7 vm-201-disk-0 "auto 140450841490944" client.70464
As soon as I unlock the rbd, fsck works and VM starts.
what would be the way to resolve this issue ? as I still have quorum, can fencing send commands to ceph cluster to unlock the RBDs ?
Regards,
Cédric
Attachments
Last edited: