Node configuration not updating in cluster

Apr 11, 2019
16
0
6
25
Hello, I have a 4 node Proxmox 6 cluster that I am having issues with. I recently had to change the IP addresses and host names of two of the nodes in my cluster, and have managed to get that done. However I have one node (not one that was changed) that is now giving me troubles. The /etc/pve directory is not allowing anything to be saved, and is also not receiving updates through Corosync. When one opens a file from the folder, it produces an [ Error reading lock file ./.corosync.conf.swp: Not enough data read ] error, or at other times an [ Error reading lock file ./.corosync.conf.swp: Unable to write swap file ] error code.

1596410975653.png

If I run the commands to stop the pve-cluster and corosync processes and then use the local pve filesystem, I get the same errors. I can see that when it ocassionally loads a file's contents, it seems that the configuration has not been updated to match the new ones that all the other nodes have.

This is the corosync.conf file from a healthy node:

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.0.0.4
  }
  node {
    name: pve2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.0.0.5
  }
  node {
    name: pve4
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.0.0.7
  }
  node {
    name: pve5
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.0.0.8
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Cluster-01
  config_version: 12
  interface {
    bindnetaddr: 10.0.0.4
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

I can confirm that the configuration file is the same across the other three nodes. Additionally, the same node that I am having this issue on has issues connecting to other nodes in the cluster. All nodes can access the shell of the other nodes, however the node that is having issues can only access the summary or other node settings pages for itself and one other as can be seen in this screenshot of a table I put together. The node with the issue is pve2
1596411669533.png
 
What does the /etc/corosync/corosync.conf file look on the node that has problems? Is it the same of different?

If it is different, try to replace it with one of the good nodes and then stop start the pve-cluster and corosync services.
 
Currently the file cannot even be opened due to the error in the screenshot above. However I have already copied the corosync file from the working nodes to the non-working node and restarted pve-cluster as well as a full reboot of the node. Additionally in the GUI the node appears to be fully functional, and even has a running LXC container. However attempting to migrate the container, delete the container, or create a container/vm results in an IO lock error on the configuration file for the container just as trying to access the corosync file, CEPH config file or any other file in the PVE directory does. However I can still start or stop the container.
 
I mean the /etc/corosync/corosync.conf file, not in /etc/pve/.
 
Sorry, I was not clear. I have verified that corosync file in both the /etc/corosync/ and /etc/pve/ directories match. When starting the node, the one in the pve directory reverts back to its previous one, with the one in the corosync directory remaining updated. Even after manually updating the one in the pve directory, after a restart it reverts to the old file. Additionally it seems to be random as to when it locks up and will not allow files to be modified as I have not been able to find a correlation as to its cause or anything in the logs I have inspected that would suggest a cause of that.
 
Last edited:
Additionally, in the occasions where I have been able to get the corosync file in the pve directory to match the one in the corosync directory and the other nodes, it appears as though that is not being followed by proxmox as none of the issues are resolved.
 
Additionally, in the occasions where I have been able to get the corosync file in the pve directory to match the one in the corosync directory and the other nodes, it appears as though that is not being followed by proxmox as none of the issues are resolved.
Did you restart the corosync and pve-cluster services?

What is the output of pvecm status on one the working nodes and the problematic node?
 
Did you restart the corosync and pve-cluster services?
I did, as well as rebooting the node when simply restarting the services did not work.

What is the output of pvecm status on one the working nodes and the problematic node?

Working node:
Code:
root@pve1:~# pvecm status
Cluster information
-------------------
Name:             Cluster-01
Config Version:   12
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Aug  3 04:59:24 2020
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000001
Ring ID:          1.1271cb
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.0.4 (local)
0x00000002          1 10.0.0.5
0x00000003          1 10.0.0.7
0x00000004          1 10.0.0.8

Non-working node:
Code:
root@pve2:~# pvecm status
Cluster information
-------------------
Name:             Cluster-01
Config Version:   8
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Aug  3 05:01:25 2020
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000002
Ring ID:          1.1271cb
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.0.4
0x00000002          1 10.0.0.5 (local)
0x00000003          1 10.0.0.7
0x00000004          1 10.0.0.8
 
For reference, here is the /etc/corosync/corosync.conf file from the problematic node:
Code:
root@pve2:/etc/corosync# cat corosync.conf 
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.0.0.4
  }
  node {
    name: pve2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.0.0.5
  }
  node {
    name: pve4
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.0.0.7
  }
  node {
    name: pve5
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.0.0.8
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Cluster-01
  config_version: 12
  interface {
    bindnetaddr: 10.0.0.4
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

And here is the /etc/corosync/corosync.conffile from a working node:
Code:
root@pve1:/etc/corosync# cat corosync.conf 
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.0.0.4
  }
  node {
    name: pve2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.0.0.5
  }
  node {
    name: pve4
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.0.0.7
  }
  node {
    name: pve5
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.0.0.8
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Cluster-01
  config_version: 12
  interface {
    bindnetaddr: 10.0.0.4
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

And finally this is the /etc/pve/corosync.conf file from that same working node:
Code:
root@pve1:/etc/pve# cat corosync.conf 
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.0.0.4
  }
  node {
    name: pve2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.0.0.5
  }
  node {
    name: pve4
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.0.0.7
  }
  node {
    name: pve5
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.0.0.8
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Cluster-01
  config_version: 12
  interface {
    bindnetaddr: 10.0.0.4
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}
 
Hmm. The problematic node shows a lower config version than is set in the config file.
Code:
root@pve2:~# pvecm status
Cluster information
-------------------
Name:             Cluster-01
Config Version:   8

This should be 12 as well.

As a test you can try to increase the config version to 13 throughout the cluster. It should be enough to do it in the /etc/pve directory on one working node. The file will be copied to /etc/corosync automatically. On the problematic node you might have to do it manually.

Is that helps so that the node pve2 also recognizes the same config version as the other nodes, you can try to restart the pve-cluster service.
 
As a test you can try to increase the config version to 13 throughout the cluster. It should be enough to do it in the /etc/pve directory on one working node. The file will be copied to /etc/corosync automatically. On the problematic node you might have to do it manually.
I have tried that already when I bumped the version to 12. It changed it across all nodes including the one that is not working in the corosync and pve directories. I have rebooted the entire cluster since then and the issue with the pve2 node is still present. That is why I am at such a loss. According to the file in the corosync directory it should be fine, but yet what proxmox is actually reading is an old out of date one where the IP addresses would be incorrect.
 
Last edited:
Hey, what you can try is the following:

On the broken node:
Stop pve-cluster and corossync.
Increment the config version in the /etc/pve/corosync.conf on one of the working nodes.
Start pve-cluster in local mode: pmxcfs -l
Copy the corosync.conf from one of the good nodes to /etc/corosync/corosync.conf and /etc/pve/corosync.conf
Stop pmxcfs either with Ctrl+C or by issuing the kill command for that process.
Start corosync and pve-cluster services.

Then check the output of pvecm status again. It should be the same config version across the cluster, even on the broken node.
 
Hey, what you can try is the following:

On the broken node:
Stop pve-cluster and corossync.
Increment the config version in the /etc/pve/corosync.conf on one of the working nodes.
Start pve-cluster in local mode: pmxcfs -l
Copy the corosync.conf from one of the good nodes to /etc/corosync/corosync.conf and /etc/pve/corosync.conf
Stop pmxcfs either with Ctrl+C or by issuing the kill command for that process.
Start corosync and pve-cluster services.

Then check the output of pvecm status again. It should be the same config version across the cluster, even on the broken node.

That is exactly what I did already. The cluster was originally at version 11 and I followed those exact steps to bump it to version 12.

At this point I have simply wiped the node and reinstalled it as I needed it back online ASAP and am no longer having issues. Thank you for your time.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!