Proxmox GUI dead, VMs running, quorum ok, cannot write anything to /etc/pve

Feb 7, 2019
12
0
21
31
Hello,

so one of my PVE clusters got ugly :) and the common error on all nodes is that nothing can write to /etc/pve. I can read it but root or pve services cannnot write:

Jun 1 18:25:53 node3 pve-ha-lrm[3185]: unable to write lrm status file - unable to open file '/etc/pve/nodes/node3/lrm_status.tmp.3185' - Device or resource busy
Jun 1 18:50:20 node3 pve-ha-lrm[3185]: unable to write lrm status file - unable to open file '/etc/pve/nodes/node3/lrm_status.tmp.3185' - Transport endpoint is not connected
Jun 1 19:25:46 node3 pve-ha-lrm[3185]: unable to write lrm status file - unable to open file '/etc/pve/nodes/node3/lrm_status.tmp.3185' - Permission denied

Jun 1 18:51:27 node1 pmxcfs[22934]: [main] crit: fuse_mount error: Transport endpoint is not connected

and so on.

quorum looks ok:

Cluster information
-------------------
Name: clustername
Config Version: 5
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Tue Jun 1 19:33:24 2021
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000001
Ring ID: 1.246a
Quorate: Yes

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 5
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.11.198.11 (local)
0x00000002 1 10.11.198.12
0x00000003 1 10.11.198.13
0x00000004 1 10.11.198.14
0x00000005 1 10.11.198.15

I found a possible solution on commitandquit forum where they suggest (if the quorum is ok) to:

  • On every node do
    systemctl stop pve-cluster
    This may take a while
  • On every node do
    sudo rm -f /var/lib/pve-cluster/.pmxcfs.lockfile
  • On each node – one by one do
    systemctl start pve-cluster
Is it safe to stop this service on all nodes? Could this help with making the /etc/pve writable?

Thank you. Will add any information needed.
 
Last edited:
My suspission is somewhere around corrupted clustrdb. Since quorum is ok but /etc/pve is still read-only.

Also on all nodes i can see this error:

Jun 1 23:00:32 node3 pmxcfs[23003]: [dcdb] crit: ignore sync request from wrong member 3/23003
Jun 1 23:00:32 node3 pmxcfs[23003]: [status] crit: ignore sync request from wrong member 3/23003


Also the culprit seems to be this:

May 30 00:25:43 node3 corosync[2898]: [KNET ] link: host: 5 link: 0 is down
May 30 00:25:43 node3 corosync[2898]: [KNET ] link: host: 4 link: 0 is down
May 30 00:25:43 node3 corosync[2898]: [KNET ] link: host: 2 link: 0 is down
May 30 00:25:43 node3 corosync[2898]: [KNET ] link: host: 1 link: 0 is down
May 30 00:25:43 node3 corosync[2898]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
May 30 00:25:43 node3 corosync[2898]: [KNET ] host: host: 5 has no active links
May 30 00:25:43 node3 corosync[2898]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
May 30 00:25:43 node3 corosync[2898]: [KNET ] host: host: 4 has no active links
May 30 00:25:43 node3 corosync[2898]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
May 30 00:25:43 node3 corosync[2898]: [KNET ] host: host: 2 has no active links
May 30 00:25:43 node3 corosync[2898]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
May 30 00:25:43 node3 corosync[2898]: [KNET ] host: host: 1 has no active links
May 30 00:25:44 node3 corosync[2898]: [TOTEM ] Token has not been received in 2212 ms

The link went back eventualy:

May 30 01:02:16 node3 pmxcfs[2793]: [dcdb] notice: start cluster connection
May 30 01:02:16 node3 pmxcfs[2793]: [dcdb] crit: cpg_join failed: 14
May 30 01:02:16 node3 pmxcfs[2793]: [dcdb] crit: can't initialize service
May 30 01:02:18 node3 corosync[2898]: [TOTEM ] A new membership (3.1b8c) was formed. Members
May 30 01:02:18 node3 corosync[2898]: [QUORUM] Members[1]: 3
May 30 01:02:18 node3 corosync[2898]: [MAIN ] Completed service synchronization, ready to provide service.
May 30 01:02:18 node3 corosync[2898]: [TOTEM ] A new membership (1.1b90) was formed. Members joined: 1 2 4 5
May 30 01:02:18 node3 pmxcfs[2793]: [status] notice: members: 1/2990, 2/2523, 3/2793, 4/2885, 5/2973
May 30 01:02:18 node3 pmxcfs[2793]: [status] notice: starting data syncronisation
May 30 01:02:18 node3 corosync[2898]: [QUORUM] This node is within the primary component and will provide service.
May 30 01:02:18 node3 corosync[2898]: [QUORUM] Members[5]: 1 2 3 4 5
May 30 01:02:18 node3 corosync[2898]: [MAIN ] Completed service synchronization, ready to provide service.
May 30 01:02:18 node3 pmxcfs[2793]: [status] notice: node has quorum
May 30 01:02:18 node3 pmxcfs[2793]: [status] notice: received sync request (epoch 1/2990/000005FD)
May 30 01:02:18 node3 pmxcfs[2793]: [status] notice: received sync request (epoch 1/2990/000005FE)
May 30 01:02:18 node3 pmxcfs[2793]: [status] notice: received all states
May 30 01:02:18 node3 pmxcfs[2793]: [status] notice: all data is up to date


And since then there are errors like this:

Jun 1 23:25:52 node3 pve-ha-lrm[3185]: unable to write lrm status file - unable to open file '/etc/pve/nodes/node3/lrm_status.tmp.3185' - Device or resource busy
Jun 1 23:25:53 node3 pvesr[15007]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 1 23:25:53 node3 pmxcfs[23003]: [dcdb] crit: cpg_send_message failed: 9
Jun 1 23:25:53 node3 pmxcfs[23003]: [dcdb] crit: cpg_send_message failed: 9
Jun 1 23:25:53 node3 pmxcfs[23003]: [dcdb] crit: cpg_send_message failed: 9
 
As writen above i found out that the source of the problem is faulty HW - 10GbE on the motherboard.

auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

iface enp216s0f0 inet manual

iface enp216s0f1 inet manual

auto bond0
iface bond0 inet static
address 10.11.198.13
netmask 255.255.255.0
gateway 10.11.198.1
slaves eno1 enp216s0f0
bond-primary eno1
bond_miimon 100
bond_mode active-backup
bond_downdelay 200
bond_updelay 200
auto bond1
iface bond1 inet manual
slaves eno2 enp216s0f1
bond-primary eno2
bond-miimon 100
bond-mode active-backup
bond_downdelay 200
bond_updelay 200

auto vmbr0
iface vmbr0 inet manual
bridge_ports bond1
bridge_stp off
bridge_fd 0


I tried plugging the eno1 cable (which is primary port for cluster network) to different port on the switch and since then i cannot see any errors on the interface. But suddenly the eno2 port (primary for VM network) started to flap time to time.

Well i hope i will fix this by replacing the motherboard with the integrated NIC. But i have to first move the VMs to other nodes.

The problem is that the web-gui is still unavailable and i cannot write to /etc/pve.
Sadly i have 3 version of /var/lib/pve-cluster/config.db
- node1 = node2
- node4 = node5
- node3
all have different version.

Is there any way to know which version is the newest or the correct one?
What is the suggested method of distributing the right db across the cluster?


Thank you.
 
Hello to all of you, who are checking this thread,

I was able to solve this issue after making a ticket ( buy those standard subscribtions guys :) ) and the problem was found in the corosync. The procedure we did was in the end quite simple. We stopped pve-cluster service on all nodes unmounted /etc/pve (monted by fuse) and restarted corosync. Then after starting the pve-cluster and few minor bumps to pveproxy and pvestatd all was up and running.


Watch out !! To do this you cannot have HA configured.
The nodes that were not quorate and had no access to /etc/pve would have fenced themselves long time ago if this would have happened.

It was confusing since pvecm status showed everything good, but all ok in the end.
 
Thanks, that saved my life!
I was just adding a node to a 2-nodes existing cluster (v7.x), and ended up with GUI not accessible (root account unavailable), corosync error messages, qm list on node1/2 not responding, pvecm commands on node 1/2 taking a long time, issues with /etc/pve accesss from node1/2, node 3 was isolated, etc....
I must say I am a bit disappointed that something so basic as adding a node ends up in a so messy situation, without any notification in the GUI that something is going wrong.

Perhaps a command allowing to run a pre-flight check before trying to add a node would be nice ?

David
 
Quick follow-up that could help some of you.
I think the issue with adding a node may occur if node 1 already has a key for new node in SSH known_hosts.
For instance, if you at least once connected to new node from node 1 with ssh, BEFORE joining new node to cluster.
It's possible when joining a cluster that the nw node SSH key gets renewed and then ssh from node 1 to new node will fail because of the previous existing key in known_hosts.

I think the join procedure could anticipate that and force the key renewal in known_hosts.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!