recommended recovery strategy?

Carsten Bleek · Jun 11, 2019

I have a problem with a 4 Node Proxmox 4.4 cluster. At a node the board was changed and since then the system doesn't boot from a ZFS root partition anymore. The remaining nodes can only read from /etc/pve.
I don't need the cluster feature. I wanted to break the cluster and build a new one with 5.4.

I see the following possibilities.

1) break cluster
No replication is configured. All VMs run on the remaining nodes.

2) Remove the failed node from the cluster.
In the hope that /etc/pve will become rw again.

3) Restore failed node.
The hardware is hosted at OVH. Standard OVH installation. I can use a recue system to load the zfs pool. I can chroot into the system. I could update grub. But the view on the boot process does not show a "grub" entry. I am afraid that the new board only supports UEFI.

Does anyone have a hint how to solve such a problem in the safest way?

alexskysilk · Jun 11, 2019

SInce you've detached the node from the cluster, not only do you not NEED to recover it, you SHOULDNT. just reinstall and use the default (not ZFS) on your boot partition.

Carsten Bleek · Jun 11, 2019

I don't think reinstalling the missing node will help. The missing node was not detached (via pvecm remove). It jiust went down. The bindnetaddr: 10.11.12.1 is the IP of the offline node. I thought it would be easiest to reboot node3. But it might be that you have to make a bios configuration for the board. I think it can be difficult to figure that out.

the corosync.conf of the remaining nodes looks like this

nodelist {
node {
name: node3
nodeid: 1
quorum_votes: 2
ring0_addr: node3
}

node {
name: node1
nodeid: 3
quorum_votes: 1
ring0_addr: node1
}

node {
name: node4
nodeid: 4
quorum_votes: 1
ring0_addr: node4
}

node {
name: node2
nodeid: 2
quorum_votes: 1
ring0_addr: node2
}
}

quorum {
expected_votes: 2
provider: corosync_votequorum
}

totem {
cluster_name: CLUSTER
config_version: 22
ip_version: ipv4
secauth: on
transport: udpu
version: 2
interface {
bindnetaddr: 10.11.12.1
ringnumber: 0
}
}

pvecm of the remaing nodes shows

root@node1:~# pvecm status
Quorum information
------------------
Date: Tue Jun 11 22:32:53 2019
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000003
Ring ID: 3/35184
Quorate: Yes

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 3
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Nam
0x00000003 1 10.11.12.2 (local)
0x00000002 1 10.11.12.3
0x00000004 1 10.11.12.4

Which is better?
- try to break the cluster and reach all nodes individually?
- try to remove node3 from the cluster?

alexskysilk · Jun 11, 2019

as long as you have three survivors this should pose no trouble at all. Do you have quorum on the surviving nodes?
# pvecm status

If you have quorom, you can easily evict the malfunctioning node from any survivor:
# pvecm delnode node3
(substitute actual node name for node3)

you can then remove the node safely and reinstall. full instructions and explanation can be found here: https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node

t.lamprecht · Jun 12, 2019

Carsten Bleek said:
quorum_votes: 2

why do you have two votes for this one?

Carsten Bleek said:
Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 3
Quorum: 3
Flags: Quorate

Also, your remaining nodes are still quorate, so /etc/pve should be already, or better said, still read-writable there?

If you plan that the failed node never comes up again just re-install it, issue a

Code:

pvecm delnode node3

on one of the remaining nodes (/etc/pve won't be touched besides the corosync.conf here), and re-add the newly installed node. If you, why ever cannot fully re-install the broken node, ensure at least that the whole /etc/corosync/* is emptied and /etc/pve/corosync.conf is gone too, from the broken node only - naturally..

Carsten Bleek · Jun 12, 2019

node3 has 2 quorum_votes, because I started with a two nodes cluster some time ago.

the remaining 3 nodes are still quorate. But only one of the has an /etc/pve, which is read-writeable.

On 2 nodes the /etc/pve is not accessable anymore. The container are running. But eg. pct list hangs

I've removed the broken node and all remaining nodes now show

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

the problem with the locked /etc/pve may have something to do with a blocked resource. I have some vzdump processes that I can't abort with kill -9. The processes try to backup containers to backup space from OVH via NFS.

In the past I could only solve such a problem by restarting a node.

Carsten Bleek · Jun 12, 2019

I've rebooted on node. /etc/pve is available, but it contains not the last change, which I've done.

it currently does not start the VMs

They did not start, because a stotage, which I've removed via the frontend was still available in the
/etc/pve/storage.cfg

I've removed the Stoage from /etc/pve/storage.cfg and rebootet the node again. VMs came up again.

It seems that the last remaining node starts to run the remaining vzdump prozesses. If the does not complete whithin the expected time, I'll reboot the last remaining node, too.

Carsten Bleek · Jun 13, 2019

I had to reboot the last node, too.And everything is ok now.

TWIMC:

after rebooting one node it seems that the remaining vzdump jobs starts to execute the remaining vzdump processes. . There was a job, which did not complete as expected. Identified by

root@node2:~# pct list
VMID Status Lock Name
.....
123 running backup stackoverflow

unlocking the CT did not change anything, I had to reboot the node.

Search

Search

recommended recovery strategy?

Carsten Bleek

Active Member

alexskysilk

Distinguished Member

Carsten Bleek

Active Member

alexskysilk

Distinguished Member

t.lamprecht

Proxmox Staff Member

Carsten Bleek

Active Member

Carsten Bleek

Active Member

Carsten Bleek

Active Member