Hi!
I think everything is under control but i would like to ask for confirmation that this is expected thinking-behaviour-etc of PVE cluster. Events happen in 20-ish node non-HA-configured v. 6.4 cluster (and also VM's are not configured to start automatically). Starting point is all nodes are present, storage is shared (lvm over iscsi), each node has some KVM-kind of virtual machines running. Next one node crashes due to physical memory error (actually this is something what got known some time later after crash has happened) and these virtual machines residing on crashed node are not working and they are not automatically migrated to other nodes.
In this situation i followed advice from https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs)#_recovery i.e. residing at existing node i issued (i think wiki misses 'qemu-server' directory element from destination path) where xxx is respective vmid number
In response to this change pve webgui shows under existing-node xxx.conf in non-running state. And since storage is shared i can successfully start this virtual machine. And so i treated also others (except one which is not needed). So far all good.
After some days got crashed node repaired like switching its hard disk into new physical computer, /etc/network/interfaces file is modified so it works with new hardware and so on.
Last step would be to make this node reappear in its old cluster-service-etc networks. My thinking is that pmxcfs and pve in general take good care of this reappering i.e. they see that there is conflict about under what node is xxx.conf virtual machine (and i am especially concerned because cluster has xxx.conf running). And conflict is resolved in favor to node where xxx.conf is running currently because there is quorum, configuration database is of newer date etc. I think that part of pve cluster configuration i am worried about gets forgotten what comes with old reapperaing node sqlite database; and in pve startup early phases reappearing node starts to use conf what it gets from cluster.
Actually i made also test about this kind of sequence of events and it seems to be like i described here, everything works. Esp. virtual machines what got manually moved over other nodes stay there and keep running after lost node reappeared; and those machines are not present at reappeared node; node itself is present in cluster normally (and no node rejoin needed etc).
A hope although i expressed my concern with lot of words it is still possible to understand and follow I would be thankful if somebody could comment on if this procedure i presented about node-getting-lost-cluster-conf-changing-node-reappearing-with-locally-saved-stale-conf treatment is adequate. And also maybe explaining at some lenght logic what happens behind the scene what makes this situation resolvable as it seems to be.
Best regards,
Imre
I think everything is under control but i would like to ask for confirmation that this is expected thinking-behaviour-etc of PVE cluster. Events happen in 20-ish node non-HA-configured v. 6.4 cluster (and also VM's are not configured to start automatically). Starting point is all nodes are present, storage is shared (lvm over iscsi), each node has some KVM-kind of virtual machines running. Next one node crashes due to physical memory error (actually this is something what got known some time later after crash has happened) and these virtual machines residing on crashed node are not working and they are not automatically migrated to other nodes.
In this situation i followed advice from https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs)#_recovery i.e. residing at existing node i issued (i think wiki misses 'qemu-server' directory element from destination path) where xxx is respective vmid number
Code:
# mv /etc/pve/nodes/crashed-node/qemu-server/xxx.conf /etc/pve/nodes/existing-node/qemu-server/xxx.conf
In response to this change pve webgui shows under existing-node xxx.conf in non-running state. And since storage is shared i can successfully start this virtual machine. And so i treated also others (except one which is not needed). So far all good.
After some days got crashed node repaired like switching its hard disk into new physical computer, /etc/network/interfaces file is modified so it works with new hardware and so on.
Last step would be to make this node reappear in its old cluster-service-etc networks. My thinking is that pmxcfs and pve in general take good care of this reappering i.e. they see that there is conflict about under what node is xxx.conf virtual machine (and i am especially concerned because cluster has xxx.conf running). And conflict is resolved in favor to node where xxx.conf is running currently because there is quorum, configuration database is of newer date etc. I think that part of pve cluster configuration i am worried about gets forgotten what comes with old reapperaing node sqlite database; and in pve startup early phases reappearing node starts to use conf what it gets from cluster.
Actually i made also test about this kind of sequence of events and it seems to be like i described here, everything works. Esp. virtual machines what got manually moved over other nodes stay there and keep running after lost node reappeared; and those machines are not present at reappeared node; node itself is present in cluster normally (and no node rejoin needed etc).
A hope although i expressed my concern with lot of words it is still possible to understand and follow I would be thankful if somebody could comment on if this procedure i presented about node-getting-lost-cluster-conf-changing-node-reappearing-with-locally-saved-stale-conf treatment is adequate. And also maybe explaining at some lenght logic what happens behind the scene what makes this situation resolvable as it seems to be.
Best regards,
Imre