Corosync Problem

GoatMaster · Dec 13, 2019

I used the official instruction to upgrade Proxmox: https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0

In my two test clusters everything went fine. But now on the production cluster I have a problem.

I first update the corsync version to 3.
I stopped the services (systemctl stop pve-ha-lrm & systemctl stop pve-ha-crm)

Then I did the update of corosync and restarted the two services.
But it seems like there is a problem now with Corosync.

The Nodes cannot find the cluster and work as stand alone node.

Here is the output of systemctl status corosync

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2019-12-13 10:17:27 CET; 2h 9min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 1933 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=8)
Main PID: 1933 (code=exited, status=8)

Dec 13 10:17:21 KI-S0162 systemd[1]: Starting Corosync Cluster Engine...
Dec 13 10:17:22 KI-S0162 corosync[1933]: [MAIN ] Corosync Cluster Engine 3.0.2 starting up
Dec 13 10:17:22 KI-S0162 corosync[1933]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlco
Dec 13 10:17:27 KI-S0162 corosync[1933]: [MAIN ] failed to parse node address 'KI-S0161'
Dec 13 10:17:27 KI-S0162 corosync[1933]: [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1353.
Dec 13 10:17:27 KI-S0162 systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
Dec 13 10:17:27 KI-S0162 systemd[1]: corosync.service: Failed with result 'exit-code'.
Dec 13 10:17:27 KI-S0162 systemd[1]: Failed to start Corosync Cluster Engine.

I than did the upgrade to Proxmo VE 6.1 form 5.4 on all three nodes.

After the reboot its still the same problem.

t.lamprecht · Dec 13, 2019

GoatMaster said:
Dec 13 10:17:27 KI-S0162 corosync[1933]: [MAIN ] failed to parse node address 'KI-S0161'

That seems to be error causing this..
can you post cat /etc/corosync/corosync.conf from two nodes?

Either it's not resolvable for all nodes (which could be OK with corosync 2 as it could fallback to the multicast group to find the other nodes) or there's something else wrong with the config format. You try to change it to a "real" resolved address.
I.e., edit /etc/corosync/corosync.conf, ensure the config_version is bumped, copy it to the other nodes, then restart corosync. If all works ensure the real copy is now also on /etc/pve/corosync.conf.

GoatMaster · Dec 13, 2019

t.lamprecht said:
.
can you post cat /etc/corosync/corosync.conf from two nodes?

This is the output it is the same on the nodes:

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: KI-S0161
nodeid: 3
quorum_votes: 1
ring0_addr: KI-S0161
}
node {
name: KI-S0162
nodeid: 2
quorum_votes: 1
ring0_addr: KI-S0162
}
node {
name: KI-S0163
nodeid: 1
quorum_votes: 1
ring0_addr: KI-S0163
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: Proxmox-Cluster
config_version: 7
interface {
bindnetaddr: 172.16.4.163
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}

So you mean I should change to rin0_addr to the IP of the Node?

t.lamprecht · Dec 13, 2019

and the KI-S0161, KI-S0162, KI-S0163 are resovlable by all nodes? I.e., you can ping on them from every node?

GoatMaster said:
So you mean I should change to rin0_addr to the IP of the Node?

Yeah. And in the totem section you could change interface subsection to:

Code:

  interface {
    linknumber: 0
  }

t.lamprecht · Dec 13, 2019

looks all well besides that, and my guess is for the ring addresses..

GoatMaster · Dec 13, 2019

t.lamprecht said:
and the KI-S0161, KI-S0162, KI-S0163 are resovlable by all nodes? I.e., you can ping on them from every node?

Yes I can ping them all from every node.

Will try to change the ring address now

GoatMaster · Dec 13, 2019

It worked. Thank you very much.

But I now have another problem with the first node of the cluster.

It seems like it cannot mount the pve filesystem anymore.

Output of systemctl status pve-cluster

● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2019-12-13 13:44:40 CET; 3min 25s ago
Process: 1860 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION)

Dec 13 13:44:40 KI-S0161 systemd[1]: pve-cluster.service: Service RestartSec=100ms expired, scheduling restart.
Dec 13 13:44:40 KI-S0161 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Dec 13 13:44:40 KI-S0161 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Dec 13 13:44:40 KI-S0161 systemd[1]: pve-cluster.service: Start request repeated too quickly.
Dec 13 13:44:40 KI-S0161 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Dec 13 13:44:40 KI-S0161 systemd[1]: Failed to start The Proxmox VE cluster filesystem.

I tried to restart several times but it stays like this.

Its even not possible to access the webinterface.

GoatMaster · Dec 13, 2019

Can I reinstall the node and just copy the config.db after the reinstall?

It's described here https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs) und the recovery section.

spirit · Dec 13, 2019

try:

systemctl reset-failed pve-cluster
systemctl restart pve-cluster

GoatMaster · Dec 13, 2019

spirit said:
try:

systemctl reset-failed pve-cluster
systemctl restart pve-cluster

The first one is working but the second one fails.

The status of the pve-cluster service stays the same.

spirit · Dec 13, 2019

what is the result of #pvecm status ? (just to be sure that corosync is ok)

spirit · Dec 13, 2019

also, check if you don't have a
/usr/bin/pmxcfs process already running/stuck

ps -aux|grep pmxcfs (this is the process that pve-cluster service try to launch)

you can also try to launch it manually to see if something wrong happen.

GoatMaster · Dec 13, 2019

spirit said:
also, check if you don't have a
/usr/bin/pmxcfs process already running/stuck

ps -aux|grep pmxcfs (this is the process that pve-cluster service try to launch)

you can also try to launch it manually to see if something wrong happen.

Output of pvecm status

ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Unable to load access control list: Connection refused

How can I start the pmxcfs status manually?

t.lamprecht · Dec 13, 2019

GoatMaster said:
pvecm status

Check ps -aux|grep pmxcfs

You can try to start it manually in foreground with:

Code:

pmxcfs -f

GoatMaster · Dec 13, 2019

t.lamprecht said:
Check ps -aux|grep pmxcfs

You can try to start it manually in foreground with:

Code:

pmxcfs -f

ps -aux|grep pmxcfs

root 3328 0.0 0.0 6072 828 pts/0 S+ 15:01 0:00 grep pmxcfs

pmxcfs -f

fuse: mountpoint is not empty
fuse: if you are sure this is safe, use the 'nonempty' mount option
[main] crit: fuse_mount error: File exists
[main] notice: exit proxmox configuration filesystem (-1)

spirit · Dec 13, 2019

>>fuse: mountpoint is not empty

so, if pmxcfs is not running, do you have something in /etc/pve ?
verify if it's not mounted with df, and if not, delete anything in /etc/pve .

(maybe do you have try to copy manually something in /etc/pve when in was not mounted ?

GoatMaster · Dec 13, 2019

It's not mounted.

I also tried to copy the files manually but it didn't help.

I reinstalled the Node with and used the recovery.
For now it worked but I have a problem to mount the network share again now.

When I figured out how to fix this I will report it here.

Search

Search

Corosync Problem

GoatMaster

Active Member

t.lamprecht

Proxmox Staff Member

GoatMaster

Active Member

t.lamprecht

Proxmox Staff Member

t.lamprecht

Proxmox Staff Member

GoatMaster

Active Member

GoatMaster

Active Member

GoatMaster

Active Member

spirit

Distinguished Member

GoatMaster

Active Member

spirit

Distinguished Member

spirit

Distinguished Member

GoatMaster

Active Member

t.lamprecht

Proxmox Staff Member

GoatMaster

Active Member

spirit

Distinguished Member

GoatMaster

Active Member