recommended recovery strategy?

Carsten Bleek

Active Member
Sep 29, 2016
11
0
41
53
cross-solution.de
I have a problem with a 4 Node Proxmox 4.4 cluster. At a node the board was changed and since then the system doesn't boot from a ZFS root partition anymore. The remaining nodes can only read from /etc/pve.
I don't need the cluster feature. I wanted to break the cluster and build a new one with 5.4.

I see the following possibilities.

1) break cluster
No replication is configured. All VMs run on the remaining nodes.

2) Remove the failed node from the cluster.
In the hope that /etc/pve will become rw again.

3) Restore failed node.
The hardware is hosted at OVH. Standard OVH installation. I can use a recue system to load the zfs pool. I can chroot into the system. I could update grub. But the view on the boot process does not show a "grub" entry. I am afraid that the new board only supports UEFI.

Does anyone have a hint how to solve such a problem in the safest way?
 
I don't think reinstalling the missing node will help. The missing node was not detached (via pvecm remove). It jiust went down. The bindnetaddr: 10.11.12.1 is the IP of the offline node. I thought it would be easiest to reboot node3. But it might be that you have to make a bios configuration for the board. I think it can be difficult to figure that out.

the corosync.conf of the remaining nodes looks like this

nodelist {
node {
name: node3
nodeid: 1
quorum_votes: 2
ring0_addr: node3
}

node {
name: node1
nodeid: 3
quorum_votes: 1
ring0_addr: node1
}

node {
name: node4
nodeid: 4
quorum_votes: 1
ring0_addr: node4
}

node {
name: node2
nodeid: 2
quorum_votes: 1
ring0_addr: node2
}
}

quorum {
expected_votes: 2
provider: corosync_votequorum
}

totem {
cluster_name: CLUSTER
config_version: 22
ip_version: ipv4
secauth: on
transport: udpu
version: 2
interface {
bindnetaddr: 10.11.12.1
ringnumber: 0
}
}

pvecm of the remaing nodes shows

root@node1:~# pvecm status
Quorum information
------------------
Date: Tue Jun 11 22:32:53 2019
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000003
Ring ID: 3/35184
Quorate: Yes

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 3
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Nam
0x00000003 1 10.11.12.2 (local)
0x00000002 1 10.11.12.3
0x00000004 1 10.11.12.4


Which is better?
- try to break the cluster and reach all nodes individually?
- try to remove node3 from the cluster?
 
as long as you have three survivors this should pose no trouble at all. Do you have quorum on the surviving nodes?
# pvecm status

If you have quorom, you can easily evict the malfunctioning node from any survivor:
# pvecm delnode node3
(substitute actual node name for node3)

you can then remove the node safely and reinstall. full instructions and explanation can be found here: https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node
 
quorum_votes: 2

why do you have two votes for this one?

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 3
Quorum: 3
Flags: Quorate

Also, your remaining nodes are still quorate, so /etc/pve should be already, or better said, still read-writable there?

If you plan that the failed node never comes up again just re-install it, issue a
Code:
pvecm delnode node3
on one of the remaining nodes (/etc/pve won't be touched besides the corosync.conf here), and re-add the newly installed node. If you, why ever cannot fully re-install the broken node, ensure at least that the whole /etc/corosync/* is emptied and /etc/pve/corosync.conf is gone too, from the broken node only - naturally..
 
node3 has 2 quorum_votes, because I started with a two nodes cluster some time ago.

the remaining 3 nodes are still quorate. But only one of the has an /etc/pve, which is read-writeable.

On 2 nodes the /etc/pve is not accessable anymore. The container are running. But eg. pct list hangs

I've removed the broken node and all remaining nodes now show


Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate



the problem with the locked /etc/pve may have something to do with a blocked resource. I have some vzdump processes that I can't abort with kill -9. The processes try to backup containers to backup space from OVH via NFS.

In the past I could only solve such a problem by restarting a node.
 
I've rebooted on node. /etc/pve is available, but it contains not the last change, which I've done.

it currently does not start the VMs

They did not start, because a stotage, which I've removed via the frontend was still available in the
/etc/pve/storage.cfg

I've removed the Stoage from /etc/pve/storage.cfg and rebootet the node again. VMs came up again.

It seems that the last remaining node starts to run the remaining vzdump prozesses. If the does not complete whithin the expected time, I'll reboot the last remaining node, too.
 
I had to reboot the last node, too.And everything is ok now.

TWIMC:

after rebooting one node it seems that the remaining vzdump jobs starts to execute the remaining vzdump processes. . There was a job, which did not complete as expected. Identified by

root@node2:~# pct list
VMID Status Lock Name
.....
123 running backup stackoverflow


unlocking the CT did not change anything, I had to reboot the node.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!