Corosync Problem

GoatMaster

Active Member
Oct 4, 2017
41
1
28
35
I used the official instruction to upgrade Proxmox: https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0

In my two test clusters everything went fine. But now on the production cluster I have a problem.

I first update the corsync version to 3.
I stopped the services (systemctl stop pve-ha-lrm & systemctl stop pve-ha-crm)

Then I did the update of corosync and restarted the two services.
But it seems like there is a problem now with Corosync.

The Nodes cannot find the cluster and work as stand alone node.

Here is the output of systemctl status corosync

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2019-12-13 10:17:27 CET; 2h 9min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 1933 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=8)
Main PID: 1933 (code=exited, status=8)

Dec 13 10:17:21 KI-S0162 systemd[1]: Starting Corosync Cluster Engine...
Dec 13 10:17:22 KI-S0162 corosync[1933]: [MAIN ] Corosync Cluster Engine 3.0.2 starting up
Dec 13 10:17:22 KI-S0162 corosync[1933]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlco
Dec 13 10:17:27 KI-S0162 corosync[1933]: [MAIN ] failed to parse node address 'KI-S0161'
Dec 13 10:17:27 KI-S0162 corosync[1933]: [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1353.
Dec 13 10:17:27 KI-S0162 systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
Dec 13 10:17:27 KI-S0162 systemd[1]: corosync.service: Failed with result 'exit-code'.
Dec 13 10:17:27 KI-S0162 systemd[1]: Failed to start Corosync Cluster Engine.

I than did the upgrade to Proxmo VE 6.1 form 5.4 on all three nodes.

After the reboot its still the same problem.
 
Dec 13 10:17:27 KI-S0162 corosync[1933]: [MAIN ] failed to parse node address 'KI-S0161'

That seems to be error causing this..
can you post cat /etc/corosync/corosync.conf from two nodes?

Either it's not resolvable for all nodes (which could be OK with corosync 2 as it could fallback to the multicast group to find the other nodes) or there's something else wrong with the config format. You try to change it to a "real" resolved address.
I.e., edit /etc/corosync/corosync.conf, ensure the config_version is bumped, copy it to the other nodes, then restart corosync. If all works ensure the real copy is now also on /etc/pve/corosync.conf.
 
.
can you post cat /etc/corosync/corosync.conf from two nodes?

This is the output it is the same on the nodes:
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: KI-S0161
nodeid: 3
quorum_votes: 1
ring0_addr: KI-S0161
}
node {
name: KI-S0162
nodeid: 2
quorum_votes: 1
ring0_addr: KI-S0162
}
node {
name: KI-S0163
nodeid: 1
quorum_votes: 1
ring0_addr: KI-S0163
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: Proxmox-Cluster
config_version: 7
interface {
bindnetaddr: 172.16.4.163
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}


So you mean I should change to rin0_addr to the IP of the Node?
 
and the KI-S0161, KI-S0162, KI-S0163 are resovlable by all nodes? I.e., you can ping on them from every node?

So you mean I should change to rin0_addr to the IP of the Node?

Yeah. And in the totem section you could change interface subsection to:
Code:
  interface {
    linknumber: 0
  }
 
It worked. Thank you very much.

But I now have another problem with the first node of the cluster.

It seems like it cannot mount the pve filesystem anymore.

Output of systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2019-12-13 13:44:40 CET; 3min 25s ago
Process: 1860 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION)

Dec 13 13:44:40 KI-S0161 systemd[1]: pve-cluster.service: Service RestartSec=100ms expired, scheduling restart.
Dec 13 13:44:40 KI-S0161 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Dec 13 13:44:40 KI-S0161 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Dec 13 13:44:40 KI-S0161 systemd[1]: pve-cluster.service: Start request repeated too quickly.
Dec 13 13:44:40 KI-S0161 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Dec 13 13:44:40 KI-S0161 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
I tried to restart several times but it stays like this.

Its even not possible to access the webinterface.
 
also, check if you don't have a
/usr/bin/pmxcfs process already running/stuck

ps -aux|grep pmxcfs (this is the process that pve-cluster service try to launch)

you can also try to launch it manually to see if something wrong happen.
 
also, check if you don't have a
/usr/bin/pmxcfs process already running/stuck

ps -aux|grep pmxcfs (this is the process that pve-cluster service try to launch)

you can also try to launch it manually to see if something wrong happen.

Output of pvecm status
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Unable to load access control list: Connection refused

How can I start the pmxcfs status manually?
 
Check ps -aux|grep pmxcfs

You can try to start it manually in foreground with:
Code:
pmxcfs -f

ps -aux|grep pmxcfs
root 3328 0.0 0.0 6072 828 pts/0 S+ 15:01 0:00 grep pmxcfs

pmxcfs -f
fuse: mountpoint is not empty
fuse: if you are sure this is safe, use the 'nonempty' mount option
[main] crit: fuse_mount error: File exists
[main] notice: exit proxmox configuration filesystem (-1)
 
>>fuse: mountpoint is not empty

so, if pmxcfs is not running, do you have something in /etc/pve ?
verify if it's not mounted with df, and if not, delete anything in /etc/pve .

(maybe do you have try to copy manually something in /etc/pve when in was not mounted ?
 
It's not mounted.

I also tried to copy the files manually but it didn't help.

I reinstalled the Node with and used the recovery.
For now it worked but I have a problem to mount the network share again now.

When I figured out how to fix this I will report it here.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!