Hi all,
I have a two node cluster that started misbehaving a couple of days ago.... No matter what I did, I couldn't get things back up and happy again. The second node always hung on starting pveproxy in the `pvecm updatecerts` command. Once that hung, I wasn't able to recover.
As the second node only has two small VMs for testing on it, I did a `dd` backup of the VMs, copied them out and did a full, clean install of PVE 8.0.3. I then followed the directions to remove the first node from being configured as a cluster manually under the heading 'Separate a Node Without Reinstalling' here: https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node
This removed the cluster configuration and left me back with the first node being standalone and a freshly installed second node.
I cleaned up the `authorized_keys` and `known_hosts` file in `/etc/pve/priv/` and cleaned up in `/root/.ssh` as well. So far so good.
Now, when I try to join the two nodes together again in a new cluster, things get whacky.... Corosync says that both nodes are fine, but the pve layers are not happy at all. There's strange behaviour when accessing things in /etc/pve/ and on the second clean node, pveproxy fails to launch - again stuck on the updatecerts script.
I have the two nodes running in standalone at the moment, and everything is operating just fine.
What am I missing?
EDIT: Some other strange observation:
* in certain subdirectories in /etc/pve/, I can't even bring up a directory listing on the second node - but can without issue on the first.
* The newly joined node that is misbehaving, I get errors in dmesg about a hanging task - meaning it looks like an underlying filesystem has gone away issue.
* In the Web UI on the first node, I can see the second, and can even get stats from it - but no details pages work (probably because pveproxy is failing on the second node). It does however show a green tick to show its online.
* I can open a shell via the web UI on the first node to the second node perfectly.
I have a two node cluster that started misbehaving a couple of days ago.... No matter what I did, I couldn't get things back up and happy again. The second node always hung on starting pveproxy in the `pvecm updatecerts` command. Once that hung, I wasn't able to recover.
As the second node only has two small VMs for testing on it, I did a `dd` backup of the VMs, copied them out and did a full, clean install of PVE 8.0.3. I then followed the directions to remove the first node from being configured as a cluster manually under the heading 'Separate a Node Without Reinstalling' here: https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node
This removed the cluster configuration and left me back with the first node being standalone and a freshly installed second node.
I cleaned up the `authorized_keys` and `known_hosts` file in `/etc/pve/priv/` and cleaned up in `/root/.ssh` as well. So far so good.
Now, when I try to join the two nodes together again in a new cluster, things get whacky.... Corosync says that both nodes are fine, but the pve layers are not happy at all. There's strange behaviour when accessing things in /etc/pve/ and on the second clean node, pveproxy fails to launch - again stuck on the updatecerts script.
I have the two nodes running in standalone at the moment, and everything is operating just fine.
What am I missing?
EDIT: Some other strange observation:
* in certain subdirectories in /etc/pve/, I can't even bring up a directory listing on the second node - but can without issue on the first.
* The newly joined node that is misbehaving, I get errors in dmesg about a hanging task - meaning it looks like an underlying filesystem has gone away issue.
* In the Web UI on the first node, I can see the second, and can even get stats from it - but no details pages work (probably because pveproxy is failing on the second node). It does however show a green tick to show its online.
* I can open a shell via the web UI on the first node to the second node perfectly.
Last edited: