Proxmox GUI dead, VMs running, quorum ok, cannot write anything to /etc/pve

WhoTookMyJiraa · Jun 1, 2021

Hello,

so one of my PVE clusters got ugly

and the common error on all nodes is that nothing can write to /etc/pve. I can read it but root or pve services cannnot write:

Jun 1 18:25:53 node3 pve-ha-lrm[3185]: unable to write lrm status file - unable to open file '/etc/pve/nodes/node3/lrm_status.tmp.3185' - Device or resource busy
Jun 1 18:50:20 node3 pve-ha-lrm[3185]: unable to write lrm status file - unable to open file '/etc/pve/nodes/node3/lrm_status.tmp.3185' - Transport endpoint is not connected
Jun 1 19:25:46 node3 pve-ha-lrm[3185]: unable to write lrm status file - unable to open file '/etc/pve/nodes/node3/lrm_status.tmp.3185' - Permission denied

Jun 1 18:51:27 node1 pmxcfs[22934]: [main] crit: fuse_mount error: Transport endpoint is not connected

and so on.

quorum looks ok:

Cluster information
-------------------
Name: clustername
Config Version: 5
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Tue Jun 1 19:33:24 2021
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000001
Ring ID: 1.246a
Quorate: Yes

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 5
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.11.198.11 (local)
0x00000002 1 10.11.198.12
0x00000003 1 10.11.198.13
0x00000004 1 10.11.198.14
0x00000005 1 10.11.198.15

I found a possible solution on commitandquit forum where they suggest (if the quorum is ok) to:

On every node do
systemctl stop pve-cluster
This may take a while
On every node do
sudo rm -f /var/lib/pve-cluster/.pmxcfs.lockfile
On each node – one by one do
systemctl start pve-cluster

Is it safe to stop this service on all nodes? Could this help with making the /etc/pve writable?

Thank you. Will add any information needed.

WhoTookMyJiraa · Jun 1, 2021

Seems the same as thread here:

https://forum.proxmox.com/threads/login-failure.71684/

so it gave me the courage to try it. Did it on the cluster stopped pve-cluster on all cleared the lock and started on all but still cannot write to /etc/pve

WhoTookMyJiraa · Jun 1, 2021

My suspission is somewhere around corrupted clustrdb. Since quorum is ok but /etc/pve is still read-only.

Also on all nodes i can see this error:

Jun 1 23:00:32 node3 pmxcfs[23003]: [dcdb] crit: ignore sync request from wrong member 3/23003
Jun 1 23:00:32 node3 pmxcfs[23003]: [status] crit: ignore sync request from wrong member 3/23003

Also the culprit seems to be this:

May 30 00:25:43 node3 corosync[2898]: [KNET ] link: host: 5 link: 0 is down
May 30 00:25:43 node3 corosync[2898]: [KNET ] link: host: 4 link: 0 is down
May 30 00:25:43 node3 corosync[2898]: [KNET ] link: host: 2 link: 0 is down
May 30 00:25:43 node3 corosync[2898]: [KNET ] link: host: 1 link: 0 is down
May 30 00:25:43 node3 corosync[2898]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
May 30 00:25:43 node3 corosync[2898]: [KNET ] host: host: 5 has no active links
May 30 00:25:43 node3 corosync[2898]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
May 30 00:25:43 node3 corosync[2898]: [KNET ] host: host: 4 has no active links
May 30 00:25:43 node3 corosync[2898]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
May 30 00:25:43 node3 corosync[2898]: [KNET ] host: host: 2 has no active links
May 30 00:25:43 node3 corosync[2898]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
May 30 00:25:43 node3 corosync[2898]: [KNET ] host: host: 1 has no active links
May 30 00:25:44 node3 corosync[2898]: [TOTEM ] Token has not been received in 2212 ms

The link went back eventualy:

May 30 01:02:16 node3 pmxcfs[2793]: [dcdb] notice: start cluster connection
May 30 01:02:16 node3 pmxcfs[2793]: [dcdb] crit: cpg_join failed: 14
May 30 01:02:16 node3 pmxcfs[2793]: [dcdb] crit: can't initialize service
May 30 01:02:18 node3 corosync[2898]: [TOTEM ] A new membership (3.1b8c) was formed. Members
May 30 01:02:18 node3 corosync[2898]: [QUORUM] Members[1]: 3
May 30 01:02:18 node3 corosync[2898]: [MAIN ] Completed service synchronization, ready to provide service.
May 30 01:02:18 node3 corosync[2898]: [TOTEM ] A new membership (1.1b90) was formed. Members joined: 1 2 4 5
May 30 01:02:18 node3 pmxcfs[2793]: [status] notice: members: 1/2990, 2/2523, 3/2793, 4/2885, 5/2973
May 30 01:02:18 node3 pmxcfs[2793]: [status] notice: starting data syncronisation
May 30 01:02:18 node3 corosync[2898]: [QUORUM] This node is within the primary component and will provide service.
May 30 01:02:18 node3 corosync[2898]: [QUORUM] Members[5]: 1 2 3 4 5
May 30 01:02:18 node3 corosync[2898]: [MAIN ] Completed service synchronization, ready to provide service.
May 30 01:02:18 node3 pmxcfs[2793]: [status] notice: node has quorum
May 30 01:02:18 node3 pmxcfs[2793]: [status] notice: received sync request (epoch 1/2990/000005FD)
May 30 01:02:18 node3 pmxcfs[2793]: [status] notice: received sync request (epoch 1/2990/000005FE)
May 30 01:02:18 node3 pmxcfs[2793]: [status] notice: received all states
May 30 01:02:18 node3 pmxcfs[2793]: [status] notice: all data is up to date

And since then there are errors like this:

Jun 1 23:25:52 node3 pve-ha-lrm[3185]: unable to write lrm status file - unable to open file '/etc/pve/nodes/node3/lrm_status.tmp.3185' - Device or resource busy
Jun 1 23:25:53 node3 pvesr[15007]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 1 23:25:53 node3 pmxcfs[23003]: [dcdb] crit: cpg_send_message failed: 9
Jun 1 23:25:53 node3 pmxcfs[23003]: [dcdb] crit: cpg_send_message failed: 9
Jun 1 23:25:53 node3 pmxcfs[23003]: [dcdb] crit: cpg_send_message failed: 9

WhoTookMyJiraa · Jun 2, 2021

Sad bump

WhoTookMyJiraa · Jun 2, 2021

As writen above i found out that the source of the problem is faulty HW - 10GbE on the motherboard.

auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

iface enp216s0f0 inet manual

iface enp216s0f1 inet manual

auto bond0
iface bond0 inet static
address 10.11.198.13
netmask 255.255.255.0
gateway 10.11.198.1
slaves eno1 enp216s0f0
bond-primary eno1
bond_miimon 100
bond_mode active-backup
bond_downdelay 200
bond_updelay 200
auto bond1
iface bond1 inet manual
slaves eno2 enp216s0f1
bond-primary eno2
bond-miimon 100
bond-mode active-backup
bond_downdelay 200
bond_updelay 200

auto vmbr0
iface vmbr0 inet manual
bridge_ports bond1
bridge_stp off
bridge_fd 0

I tried plugging the eno1 cable (which is primary port for cluster network) to different port on the switch and since then i cannot see any errors on the interface. But suddenly the eno2 port (primary for VM network) started to flap time to time.

Well i hope i will fix this by replacing the motherboard with the integrated NIC. But i have to first move the VMs to other nodes.

The problem is that the web-gui is still unavailable and i cannot write to /etc/pve.
Sadly i have 3 version of /var/lib/pve-cluster/config.db
- node1 = node2
- node4 = node5
- node3
all have different version.

Is there any way to know which version is the newest or the correct one?
What is the suggested method of distributing the right db across the cluster?

Thank you.

WhoTookMyJiraa · Jun 4, 2021

Hello to all of you, who are checking this thread,

I was able to solve this issue after making a ticket ( buy those standard subscribtions guys

) and the problem was found in the corosync. The procedure we did was in the end quite simple. We stopped pve-cluster service on all nodes unmounted /etc/pve (monted by fuse) and restarted corosync. Then after starting the pve-cluster and few minor bumps to pveproxy and pvestatd all was up and running.

Watch out !! To do this you cannot have HA configured.
The nodes that were not quorate and had no access to /etc/pve would have fenced themselves long time ago if this would have happened.

It was confusing since pvecm status showed everything good, but all ok in the end.

nocipeva · Jul 13, 2022

Thanks, that saved my life!
I was just adding a node to a 2-nodes existing cluster (v7.x), and ended up with GUI not accessible (root account unavailable), corosync error messages, qm list on node1/2 not responding, pvecm commands on node 1/2 taking a long time, issues with /etc/pve accesss from node1/2, node 3 was isolated, etc....
I must say I am a bit disappointed that something so basic as adding a node ends up in a so messy situation, without any notification in the GUI that something is going wrong.

Perhaps a command allowing to run a pre-flight check before trying to add a node would be nice ?

David

nocipeva · Jul 13, 2022

Quick follow-up that could help some of you.
I think the issue with adding a node may occur if node 1 already has a key for new node in SSH known_hosts.
For instance, if you at least once connected to new node from node 1 with ssh, BEFORE joining new node to cluster.
It's possible when joining a cluster that the nw node SSH key gets renewed and then ssh from node 1 to new node will fail because of the previous existing key in known_hosts.

I think the join procedure could anticipate that and force the key renewal in known_hosts.

Search

Search

Proxmox GUI dead, VMs running, quorum ok, cannot write anything to /etc/pve

WhoTookMyJiraa

Member

WhoTookMyJiraa

Member

WhoTookMyJiraa

Member

WhoTookMyJiraa

Member

WhoTookMyJiraa

Member

WhoTookMyJiraa

Member

nocipeva

New Member

nocipeva

New Member