Corosync Problem

GoatMaster

Active Member
Oct 4, 2017
41
1
28
35
I used the official instruction to upgrade Proxmox: https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0

In my two test clusters everything went fine. But now on the production cluster I have a problem.

I first update the corsync version to 3.
I stopped the services (systemctl stop pve-ha-lrm & systemctl stop pve-ha-crm)

Then I did the update of corosync and restarted the two services.
But it seems like there is a problem now with Corosync.

The Nodes cannot find the cluster and work as stand alone node.

Here is the output of systemctl status corosync

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2019-12-13 10:17:27 CET; 2h 9min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 1933 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=8)
Main PID: 1933 (code=exited, status=8)

Dec 13 10:17:21 KI-S0162 systemd[1]: Starting Corosync Cluster Engine...
Dec 13 10:17:22 KI-S0162 corosync[1933]: [MAIN ] Corosync Cluster Engine 3.0.2 starting up
Dec 13 10:17:22 KI-S0162 corosync[1933]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlco
Dec 13 10:17:27 KI-S0162 corosync[1933]: [MAIN ] failed to parse node address 'KI-S0161'
Dec 13 10:17:27 KI-S0162 corosync[1933]: [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1353.
Dec 13 10:17:27 KI-S0162 systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
Dec 13 10:17:27 KI-S0162 systemd[1]: corosync.service: Failed with result 'exit-code'.
Dec 13 10:17:27 KI-S0162 systemd[1]: Failed to start Corosync Cluster Engine.

I than did the upgrade to Proxmo VE 6.1 form 5.4 on all three nodes.

After the reboot its still the same problem.
 
Dec 13 10:17:27 KI-S0162 corosync[1933]: [MAIN ] failed to parse node address 'KI-S0161'

That seems to be error causing this..
can you post cat /etc/corosync/corosync.conf from two nodes?

Either it's not resolvable for all nodes (which could be OK with corosync 2 as it could fallback to the multicast group to find the other nodes) or there's something else wrong with the config format. You try to change it to a "real" resolved address.
I.e., edit /etc/corosync/corosync.conf, ensure the config_version is bumped, copy it to the other nodes, then restart corosync. If all works ensure the real copy is now also on /etc/pve/corosync.conf.
 
.
can you post cat /etc/corosync/corosync.conf from two nodes?

This is the output it is the same on the nodes:
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: KI-S0161
nodeid: 3
quorum_votes: 1
ring0_addr: KI-S0161
}
node {
name: KI-S0162
nodeid: 2
quorum_votes: 1
ring0_addr: KI-S0162
}
node {
name: KI-S0163
nodeid: 1
quorum_votes: 1
ring0_addr: KI-S0163
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: Proxmox-Cluster
config_version: 7
interface {
bindnetaddr: 172.16.4.163
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}


So you mean I should change to rin0_addr to the IP of the Node?
 
and the KI-S0161, KI-S0162, KI-S0163 are resovlable by all nodes? I.e., you can ping on them from every node?

So you mean I should change to rin0_addr to the IP of the Node?

Yeah. And in the totem section you could change interface subsection to:
Code:
  interface {
    linknumber: 0
  }
 
It worked. Thank you very much.

But I now have another problem with the first node of the cluster.

It seems like it cannot mount the pve filesystem anymore.

Output of systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2019-12-13 13:44:40 CET; 3min 25s ago
Process: 1860 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION)

Dec 13 13:44:40 KI-S0161 systemd[1]: pve-cluster.service: Service RestartSec=100ms expired, scheduling restart.
Dec 13 13:44:40 KI-S0161 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Dec 13 13:44:40 KI-S0161 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Dec 13 13:44:40 KI-S0161 systemd[1]: pve-cluster.service: Start request repeated too quickly.
Dec 13 13:44:40 KI-S0161 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Dec 13 13:44:40 KI-S0161 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
I tried to restart several times but it stays like this.

Its even not possible to access the webinterface.
 
also, check if you don't have a
/usr/bin/pmxcfs process already running/stuck

ps -aux|grep pmxcfs (this is the process that pve-cluster service try to launch)

you can also try to launch it manually to see if something wrong happen.
 
also, check if you don't have a
/usr/bin/pmxcfs process already running/stuck

ps -aux|grep pmxcfs (this is the process that pve-cluster service try to launch)

you can also try to launch it manually to see if something wrong happen.

Output of pvecm status
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Unable to load access control list: Connection refused

How can I start the pmxcfs status manually?
 
Check ps -aux|grep pmxcfs

You can try to start it manually in foreground with:
Code:
pmxcfs -f

ps -aux|grep pmxcfs
root 3328 0.0 0.0 6072 828 pts/0 S+ 15:01 0:00 grep pmxcfs

pmxcfs -f
fuse: mountpoint is not empty
fuse: if you are sure this is safe, use the 'nonempty' mount option
[main] crit: fuse_mount error: File exists
[main] notice: exit proxmox configuration filesystem (-1)
 
>>fuse: mountpoint is not empty

so, if pmxcfs is not running, do you have something in /etc/pve ?
verify if it's not mounted with df, and if not, delete anything in /etc/pve .

(maybe do you have try to copy manually something in /etc/pve when in was not mounted ?
 
It's not mounted.

I also tried to copy the files manually but it didn't help.

I reinstalled the Node with and used the recovery.
For now it worked but I have a problem to mount the network share again now.

When I figured out how to fix this I will report it here.