Recovering from HW failure

robnavrey

New Member
Oct 21, 2020
9
1
3
Valencia, Spain
I have a small, 2 nodes, devel cluster.

Today, one of the nodes has died.

It seems that the perc controller is damaged. Luckily, I have replication enabled and after diminishing the quorum on the remaining node I was able to enable the vms running on the failed node.

This is the question. What should we do when I recover the failed node?

If I'm not wrong, the failed node will still think that it still owns the migrated vm..

Regards,

Roberto
 
Last edited:
Hi @showip,

Thanks for the link. But I'm already in a split brain situation.

I can't add another node. When I try to execute "pvecm qdevice setup 172.16.0.130 -f" it complains that the other node is down.

Anyway, even if I was able to execute the command, what would happen when I bring the other node back?

Regards,

Roberto.
 
It might be possible that the cluster complains about twice existing VMs.
What you can do: Removing the old PVE node before you bring it back online.
See this link -> https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node , chapter "Separate A Node Without Reinstalling"

Bring back the crashed node w/o network access & delete all the VMs which you have migrated to the up & running node.
Plug it back to the network and join the cluster.
 
Hello again,

I'm unable to add the qdevice:

# pvecm qdevice setup 172.16.0.130 -f
/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.
(if you think this is a mistake, you may want to use -f option)


INFO: initializing qnetd server
Creating /etc/corosync/qnetd/nssdb
Creating new key and cert db
password file contains no data
Creating new noise file /etc/corosync/qnetd/nssdb/noise.txt
Creating new CA


Generating key. This may take a few moments...

Is this a CA certificate [y/N]?
Enter the path length constraint, enter to skip [<0 for unlimited path]: > Is this a critical extension [y/N]?


Generating key. This may take a few moments...

Notice: Trust flag u is set automatically if the private key is present.
QNetd CA certificate is exported as /etc/corosync/qnetd/nssdb/qnetd-cacert.crt

INFO: copying CA cert and initializing on all nodes
Host key verification failed.

INFO: generating cert request
Certificate database doesn't exists. Use /sbin/corosync-qdevice-net-certutil -i to create it
command 'corosync-qdevice-net-certutil -r -n pve-cluster' failed: exit code 1
#
 
If I remember right it was a bit of playing around and doing some research in the internet.
Can't remember unfortunately what I have done exactly
 
Ok, Looking at Datacenter -> node name/cluster name -> System -> Certificates, I found that the IP of the main node was outdated (I installed the server with another IP and then, changed it)

Just ran: pvecm updatecerts --force

And after that, I was able to add the qdevice.

Thanks for your help,

Roberto
 
  • Like
Reactions: showiproute
Ok, something still seems to be wrong. The number of votes seems strange:

# pvecm status
Cluster information
-------------------
Name: pve-cluster
Config Version: 4
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Wed Sep 22 16:38:25 2021
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1.178
Quorate: Yes

Votequorum information
----------------------
Expected votes: 1
Highest expected: 1
Total votes: 1
Quorum: 1
Flags: Quorate Qdevice

Membership information
----------------------
Nodeid Votes Qdevice Name
0x00000001 1 A,NV,NMW 172.16.0.140 (local)
0x00000000 0 Qdevice (votes 0)
 
The qdevice would only work if you have a 2 node cluster - otherwise it would not make much sense.

Mine looks like
Code:
Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000002          1    A,V,NMW 10.0.5.1
0x00000003          1    A,V,NMW 10.0.5.2 (local)
0x00000000          1            Qdevice
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!