Recovering from HW failure

robnavrey

New Member
Oct 21, 2020
9
1
3
Valencia, Spain
I have a small, 2 nodes, devel cluster.

Today, one of the nodes has died.

It seems that the perc controller is damaged. Luckily, I have replication enabled and after diminishing the quorum on the remaining node I was able to enable the vms running on the failed node.

This is the question. What should we do when I recover the failed node?

If I'm not wrong, the failed node will still think that it still owns the migrated vm..

Regards,

Roberto
 
Last edited:
Hi @showip,

Thanks for the link. But I'm already in a split brain situation.

I can't add another node. When I try to execute "pvecm qdevice setup 172.16.0.130 -f" it complains that the other node is down.

Anyway, even if I was able to execute the command, what would happen when I bring the other node back?

Regards,

Roberto.
 
It might be possible that the cluster complains about twice existing VMs.
What you can do: Removing the old PVE node before you bring it back online.
See this link -> https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node , chapter "Separate A Node Without Reinstalling"

Bring back the crashed node w/o network access & delete all the VMs which you have migrated to the up & running node.
Plug it back to the network and join the cluster.
 
Hello again,

I'm unable to add the qdevice:

# pvecm qdevice setup 172.16.0.130 -f
/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.
(if you think this is a mistake, you may want to use -f option)


INFO: initializing qnetd server
Creating /etc/corosync/qnetd/nssdb
Creating new key and cert db
password file contains no data
Creating new noise file /etc/corosync/qnetd/nssdb/noise.txt
Creating new CA


Generating key. This may take a few moments...

Is this a CA certificate [y/N]?
Enter the path length constraint, enter to skip [<0 for unlimited path]: > Is this a critical extension [y/N]?


Generating key. This may take a few moments...

Notice: Trust flag u is set automatically if the private key is present.
QNetd CA certificate is exported as /etc/corosync/qnetd/nssdb/qnetd-cacert.crt

INFO: copying CA cert and initializing on all nodes
Host key verification failed.

INFO: generating cert request
Certificate database doesn't exists. Use /sbin/corosync-qdevice-net-certutil -i to create it
command 'corosync-qdevice-net-certutil -r -n pve-cluster' failed: exit code 1
#
 
If I remember right it was a bit of playing around and doing some research in the internet.
Can't remember unfortunately what I have done exactly
 
Ok, Looking at Datacenter -> node name/cluster name -> System -> Certificates, I found that the IP of the main node was outdated (I installed the server with another IP and then, changed it)

Just ran: pvecm updatecerts --force

And after that, I was able to add the qdevice.

Thanks for your help,

Roberto
 
  • Like
Reactions: showiproute
Ok, something still seems to be wrong. The number of votes seems strange:

# pvecm status
Cluster information
-------------------
Name: pve-cluster
Config Version: 4
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Wed Sep 22 16:38:25 2021
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1.178
Quorate: Yes

Votequorum information
----------------------
Expected votes: 1
Highest expected: 1
Total votes: 1
Quorum: 1
Flags: Quorate Qdevice

Membership information
----------------------
Nodeid Votes Qdevice Name
0x00000001 1 A,NV,NMW 172.16.0.140 (local)
0x00000000 0 Qdevice (votes 0)
 
The qdevice would only work if you have a 2 node cluster - otherwise it would not make much sense.

Mine looks like
Code:
Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000002          1    A,V,NMW 10.0.5.1
0x00000003          1    A,V,NMW 10.0.5.2 (local)
0x00000000          1            Qdevice