[SOLVED] Corosync stopped working on one PVE node (out of 4) after restart

gradinaruvasile · Feb 3, 2021

Hi,
We have a cluster of 4 PVE servers, all were working fine.
Corosync works on the main management interface (vmbr0).

We stopped one node and after booting up again, corosync failed to work on it (no updates were performed before shutdown).
The other nodes work. Tcpdump shows udp packets going to/from port 5405 between all nodes.
The 3 nodes see each other, this one is not joining the quorum.

Tried restarting corosync and pve-cluster on it, restarting the server, still not working.
journalctl --unit=pve-cluster shows:

Code:

Feb 04 03:05:10 ndi-srv-024 pmxcfs[5065]: [status] crit: cpg_send_message failed: 6
Feb 04 03:05:12 ndi-srv-024 pmxcfs[5065]: [status] notice: cpg_send_message retry 10
Feb 04 03:05:13 ndi-srv-024 pmxcfs[5065]: [status] notice: cpg_send_message retry 20
Feb 04 03:05:14 ndi-srv-024 pmxcfs[5065]: [status] notice: cpg_send_message retry 30
Feb 04 03:05:15 ndi-srv-024 pmxcfs[5065]: [status] notice: cpg_send_message retry 40
Feb 04 03:05:15 ndi-srv-024 pmxcfs[5065]: [dcdb] notice: members: 2/1455, 4/5065
Feb 04 03:05:15 ndi-srv-024 pmxcfs[5065]: [dcdb] notice: starting data syncronisation
Feb 04 03:05:15 ndi-srv-024 pmxcfs[5065]: [status] notice: members: 2/1455, 4/5065
Feb 04 03:05:15 ndi-srv-024 pmxcfs[5065]: [status] notice: starting data syncronisation
Feb 04 03:05:15 ndi-srv-024 pmxcfs[5065]: [status] notice: cpg_send_message retried 41 times
Feb 04 03:05:21 ndi-srv-024 pmxcfs[5065]: [dcdb] notice: members: 4/5065
Feb 04 03:05:21 ndi-srv-024 pmxcfs[5065]: [dcdb] notice: all data is up to date
Feb 04 03:05:21 ndi-srv-024 pmxcfs[5065]: [status] notice: members: 4/5065
Feb 04 03:05:21 ndi-srv-024 pmxcfs[5065]: [status] notice: all data is up to date
Feb 04 03:05:22 ndi-srv-024 pmxcfs[5065]: [status] notice: cpg_send_message retry 10
Feb 04 03:05:23 ndi-srv-024 pmxcfs[5065]: [status] notice: cpg_send_message retry 20
Feb 04 03:05:24 ndi-srv-024 pmxcfs[5065]: [status] notice: cpg_send_message retry 30
Feb 04 03:05:25 ndi-srv-024 pmxcfs[5065]: [status] notice: cpg_send_message retry 40
Feb 04 03:05:26 ndi-srv-024 pmxcfs[5065]: [status] notice: cpg_send_message retry 50

Code:

Feb 04 03:06:23 ndi-srv-024 corosync[5071]:   [QUORUM] Members[1]: 4
Feb 04 03:06:23 ndi-srv-024 corosync[5071]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 04 03:06:24 ndi-srv-024 corosync[5071]:   [TOTEM ] A new membership (1.9db5e) was formed. Members joined: 1 2 3
Feb 04 03:06:26 ndi-srv-024 corosync[5071]:   [TOTEM ] FAILED TO RECEIVE
Feb 04 03:06:32 ndi-srv-024 corosync[5071]:   [TOTEM ] A new membership (4.9dbca) was formed. Members left: 1 2 3
Feb 04 03:06:32 ndi-srv-024 corosync[5071]:   [TOTEM ] Failed to receive the leave message. failed: 1 2 3
Feb 04 03:06:32 ndi-srv-024 corosync[5071]:   [QUORUM] Members[1]: 4
Feb 04 03:06:32 ndi-srv-024 corosync[5071]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 04 03:06:32 ndi-srv-024 corosync[5071]:   [TOTEM ] A new membership (1.9dbce) was formed. Members joined: 1 2 3
Feb 04 03:06:35 ndi-srv-024 corosync[5071]:   [TOTEM ] FAILED TO RECEIVE
Feb 04 03:06:40 ndi-srv-024 corosync[5071]:   [TOTEM ] A new membership (4.9dc32) was formed. Members left: 1 2 3
Feb 04 03:06:40 ndi-srv-024 corosync[5071]:   [TOTEM ] Failed to receive the leave message. failed: 1 2 3
Feb 04 03:06:40 ndi-srv-024 corosync[5071]:   [TOTEM ] A new membership (1.9dc36) was formed. Members joined: 1 2 3
Feb 04 03:06:43 ndi-srv-024 corosync[5071]:   [TOTEM ] FAILED TO RECEIVE

pvecm status on the failed node:

Code:

Cluster information
-------------------
Name:             proxmox
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Feb  4 02:58:31 2021
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000004
Ring ID:          1.9c68e
Quorate:          No

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      2
Quorum:           3 Activity blocked
Flags:           

Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 10.10.10.58 (local)

pvecm status on a working node:

Code:

Quorum information
------------------
Date:             Thu Feb  4 03:03:41 2021
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000002
Ring ID:          1/644422
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      3
Quorum:           3 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.10.10.52
0x00000002          1 10.10.10.54 (local)
0x00000003          1 10.10.10.56

This cluster is composed of different version PVE,

non working node (originally had 6.1, worked until this restart, i upgraded it to see if it helps, but no change) :

Code:

pve-manager/6.3-3/eee5f901 (running kernel: 5.4.78-2-pve)

working nodes:

Code:

pve-manager/6.0-4/2a719255 (running kernel: 5.0.15-1-pve)
pve-manager/6.0-4/2a719255 (running kernel: 5.0.15-1-pve)
pve-manager/6.0-6/c71f879f (running kernel: 5.0.21-1-pve)

Now, i am aware that we should upgrade these machines to the same version (although they worked just fine and this specific node was restarted a few times), but i have a few major concerns here:
- i cannot say why this just happened. This is pretty important, having a clue how to deal with these issues and at the end of the day provide a report of what went wrong. The network seems fine, no updates were performed, packets - udp, tcp travel between the hosts. But quorum is not working on this particular node. How can i debug this? Obviously packets are sent but discarded for some reason (maybe some are missing?) but Corosync related issues are quite elusive.
- If we start upgrading a cluster with older nodes, will we face this issue again? We have this cluster that has 6.0 versions, another 2 clusters with 6.1 (these are all at the same level). What will happen if i start upgrade to 6.3?

PS We have a Community subscription for all our clusters.

Moayad · Feb 4, 2021

Hi,

gradinaruvasile said:
Corosync works on the main management interface (vmbr0).

We recommend a separate network for Corosync [0].

Most likely your network is congested / too slow to pass Corosync messages in time to the other nodes.

gradinaruvasile said:
Feb 04 03:06:23 ndi-srv-024 corosync[5071]: [QUORUM] Members[1]: 4
Feb 04 03:06:23 ndi-srv-024 corosync[5071]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 04 03:06:24 ndi-srv-024 corosync[5071]: [TOTEM ] A new membership (1.9db5e) was formed. Members joined: 1 2 3
Feb 04 03:06:26 ndi-srv-024 corosync[5071]: [TOTEM ] FAILED TO RECEIVE

gradinaruvasile said:
But quorum is not working on this particular node. How can i debug this?

Check the network configuration and Corosync logs journalctl -b -u corosync and please post the Corosync config as well cat /etc/pve/corosync.conf.

gradinaruvasile said:
What will happen if i start upgrade to 6.3?

Try to fix the Corosync issue first then do the upgrade for nodes one by one.

Are you running HA in the Cluster?

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy

gradinaruvasile · Feb 4, 2021

We recommend a separate network for Corosync [0].

That is planned but can be done only after we add network cards to the nodes.

Check the network configuration and Corosync logs journalctl -b -u corosync and please post the Corosync config as well cat /etc/pve/corosync.conf.

I checked the logs, the output is in the first post.
Now i restarted the service again and this is the output since starting the service (the "[TOTEM ] FAILED TO RECEIVE" sections just repeat after this) :

Code:

eb 04 14:59:01 ndi-srv-024 systemd[1]: corosync.service: Succeeded.
Feb 04 14:59:01 ndi-srv-024 systemd[1]: Stopped Corosync Cluster Engine.
Feb 04 14:59:01 ndi-srv-024 systemd[1]: Starting Corosync Cluster Engine...
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [MAIN  ] Corosync Cluster Engine 3.0.4 starting up
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [TOTEM ] Initializing transport (Kronosnet).
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [TOTEM ] kronosnet crypto initialized: aes256/sha256
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [TOTEM ] totemknet initialized
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [QB    ] server name: cmap
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [QB    ] server name: cfg
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [QB    ] server name: cpg
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [WD    ] Watchdog not enabled by configuration
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [WD    ] resource load_15min missing a recovery key.
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [WD    ] resource memory_used missing a recovery key.
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [WD    ] no resources configured.
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [QUORUM] Using quorum provider corosync_votequorum
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [QB    ] server name: votequorum
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [QB    ] server name: quorum
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 2 has no active links
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 2 has no active links
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 2 has no active links
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 3 has no active links
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [TOTEM ] A new membership (4.10afd4) was formed. Members joined: 4
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 3 has no active links
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 3 has no active links
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 1 has no active links
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 1 has no active links
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 1 has no active links
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [QUORUM] Members[1]: 4
Feb 04 14:59:01 ndi-srv-024 corosync[33712]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 04 14:59:01 ndi-srv-024 systemd[1]: Started Corosync Cluster Engine.
Feb 04 14:59:02 ndi-srv-024 corosync[33712]:   [KNET  ] rx: host: 1 link: 0 is up
Feb 04 14:59:02 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 04 14:59:02 ndi-srv-024 corosync[33712]:   [KNET  ] rx: host: 2 link: 0 is up
Feb 04 14:59:02 ndi-srv-024 corosync[33712]:   [KNET  ] rx: host: 3 link: 0 is up
Feb 04 14:59:02 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 04 14:59:02 ndi-srv-024 corosync[33712]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 04 14:59:02 ndi-srv-024 corosync[33712]:   [TOTEM ] A new membership (1.10b023) was formed. Members joined: 1 2 3
Feb 04 14:59:03 ndi-srv-024 corosync[33712]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Feb 04 14:59:03 ndi-srv-024 corosync[33712]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Feb 04 14:59:03 ndi-srv-024 corosync[33712]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Feb 04 14:59:03 ndi-srv-024 corosync[33712]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Feb 04 14:59:05 ndi-srv-024 corosync[33712]:   [TOTEM ] FAILED TO RECEIVE
Feb 04 14:59:11 ndi-srv-024 corosync[33712]:   [TOTEM ] A new membership (4.10b087) was formed. Members left: 1 2 3
Feb 04 14:59:11 ndi-srv-024 corosync[33712]:   [TOTEM ] Failed to receive the leave message. failed: 1 2 3
Feb 04 14:59:11 ndi-srv-024 corosync[33712]:   [QUORUM] Members[1]: 4
Feb 04 14:59:11 ndi-srv-024 corosync[33712]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 04 14:59:11 ndi-srv-024 corosync[33712]:   [TOTEM ] A new membership (1.10b08b) was formed. Members joined: 1 2 3
Feb 04 14:59:14 ndi-srv-024 corosync[33712]:   [TOTEM ] FAILED TO RECEIVE
Feb 04 14:59:19 ndi-srv-024 corosync[33712]:   [TOTEM ] A new membership (4.10b0ef) was formed. Members left: 1 2 3
Feb 04 14:59:19 ndi-srv-024 corosync[33712]:   [TOTEM ] Failed to receive the leave message. failed: 1 2 3
Feb 04 14:59:19 ndi-srv-024 corosync[33712]:   [QUORUM] Members[1]: 4
Feb 04 14:59:19 ndi-srv-024 corosync[33712]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 04 14:59:23 ndi-srv-024 corosync[33712]:   [TOTEM ] FAILED TO RECEIVE

Edit:
Network seems fine, packets travel between the hosts on port 5405. I tested ping with 65507 packet size and that works stable too.

gradinaruvasile · Feb 4, 2021

Are you running HA in the Cluster?

Yes, but as for now not many VMs are marked as HA. This specific node has nothing on it since it was stopped and everything on it was transferred beforehand.

Moayad · Feb 4, 2021

Please post the output of pveversion -v for ndi-srv-024 and another one in your cluster and cat /etc/pve/corosync.conf also status of pve-cluster and corosync.

Bash:

systemctl status pve-cluster
systemctl status corosync.service

UPDATE: Syslog from other node may give more hint

gradinaruvasile · Feb 4, 2021

please post the Corosync config as well cat /etc/pve/corosync.conf

Other than the /etc/pve/corosync.conf, what else do i need?

/etc/pve/corosync.conf:

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: ndi-srv-021
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.10.54
  }
  node {
    name: ndi-srv-024
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.10.10.58
  }
  node {
    name: ndi-srv-025
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.10.56
  }
  node {
    name: ndi-srv-029
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.10.52
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: proxmox
  config_version: 6
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  secauth: on
  version: 2
}

gradinaruvasile · Feb 4, 2021

Moayad said:
Please post the output of pveversion -v for ndi-srv-024 and another one in your cluster and cat /etc/pve/corosync.conf also status of pve-cluster and corosync.

Bash:

systemctl status pve-cluster systemctl status corosync.service

UPDATE: Syslog from other node may give more hint

systemctl status pve-cluster

● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2021-02-04 15:29:10 IST; 16min ago
Process: 38601 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 38608 (pmxcfs)
Tasks: 8 (limit: 7372)
Memory: 36.9M
CGroup: /system.slice/pve-cluster.service
└─38608 /usr/bin/pmxcfs

Feb 04 15:45:40 ndi-srv-024 pmxcfs[38608]: [dcdb] notice: members: 2/1455, 4/38608
Feb 04 15:45:40 ndi-srv-024 pmxcfs[38608]: [dcdb] notice: starting data syncronisation
Feb 04 15:45:40 ndi-srv-024 pmxcfs[38608]: [status] notice: members: 2/1455, 4/38608
Feb 04 15:45:40 ndi-srv-024 pmxcfs[38608]: [status] notice: starting data syncronisation
Feb 04 15:45:40 ndi-srv-024 pmxcfs[38608]: [status] notice: cpg_send_message retried 83 times
Feb 04 15:45:45 ndi-srv-024 pmxcfs[38608]: [dcdb] notice: members: 4/38608
Feb 04 15:45:45 ndi-srv-024 pmxcfs[38608]: [dcdb] notice: all data is up to date
Feb 04 15:45:45 ndi-srv-024 pmxcfs[38608]: [status] notice: members: 4/38608
Feb 04 15:45:45 ndi-srv-024 pmxcfs[38608]: [status] notice: all data is up to date
Feb 04 15:45:45 ndi-srv-024 pmxcfs[38608]: [status] notice: cpg_send_message retried 1 times

systemctl status corosync.service

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2021-02-04 15:48:15 IST; 11s ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 41933 (corosync)
Tasks: 9 (limit: 7372)
Memory: 140.8M
CGroup: /system.slice/corosync.service
└─41933 /usr/sbin/corosync -f

Feb 04 15:48:16 ndi-srv-024 corosync[41933]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 04 15:48:16 ndi-srv-024 corosync[41933]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Feb 04 15:48:16 ndi-srv-024 corosync[41933]: [KNET ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Feb 04 15:48:16 ndi-srv-024 corosync[41933]: [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Feb 04 15:48:16 ndi-srv-024 corosync[41933]: [KNET ] pmtud: Global data MTU changed to: 1397
Feb 04 15:48:16 ndi-srv-024 corosync[41933]: [TOTEM ] A new membership (1.112690) was formed. Members joined: 1 2 3
Feb 04 15:48:18 ndi-srv-024 corosync[41933]: [TOTEM ] FAILED TO RECEIVE
Feb 04 15:48:24 ndi-srv-024 corosync[41933]: [TOTEM ] A new membership (4.1126fc) was formed. Members left: 1 2 3
Feb 04 15:48:24 ndi-srv-024 corosync[41933]: [TOTEM ] Failed to receive the leave message. failed: 1 2 3
Feb 04 15:48:24 ndi-srv-024 corosync[41933]: [TOTEM ] A new membership (1.112700) was formed. Members joined: 1 2 3

This is the non working node pveversion -v:

Code:

proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-4
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

pveversion -v for a working node (there are slight version differences between the working ones but all are on PVE 6.0):

Code:

proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-3
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1

gradinaruvasile · Feb 4, 2021

This is syslog from a working node. It has the "[TOTEM ] Failed to receive the leave message. failed: 4" message in it, don't know if it's relevant.

Code:

tail -f /var/log/syslog  | grep -i corosync
Feb  4 15:53:59 ndi-srv-021 corosync[13960]:   [QUORUM] Members[3]: 1 2 3
Feb  4 15:53:59 ndi-srv-021 corosync[13960]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb  4 15:53:59 ndi-srv-021 corosync[13960]:   [TOTEM ] A new membership (1:1127580) was formed. Members joined: 4
Feb  4 15:54:02 ndi-srv-021 corosync[13960]:   [TOTEM ] A new membership (1:1127584) was formed. Members left: 4
Feb  4 15:54:02 ndi-srv-021 corosync[13960]:   [TOTEM ] Failed to receive the leave message. failed: 4
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [TOTEM ] A new membership (1:1127588) was formed. Members
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [QUORUM] Members[3]: 1 2 3
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [TOTEM ] A new membership (1:1127592) was formed. Members
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [QUORUM] Members[3]: 1 2 3
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [TOTEM ] A new membership (1:1127596) was formed. Members
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [QUORUM] Members[3]: 1 2 3
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [TOTEM ] A new membership (1:1127600) was formed. Members
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [QUORUM] Members[3]: 1 2 3
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [TOTEM ] A new membership (1:1127604) was formed. Members
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [QUORUM] Members[3]: 1 2 3
Feb  4 15:54:05 ndi-srv-021 corosync[13960]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb  4 15:54:06 ndi-srv-021 corosync[13960]:   [TOTEM ] A new membership (1:1127608) was formed. Members
Feb  4 15:54:06 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:06 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:06 ndi-srv-021 corosync[13960]:   [CPG   ] downlist left_list: 0 received
Feb  4 15:54:06 ndi-srv-021 corosync[13960]:   [QUORUM] Members[3]: 1 2 3
Feb  4 15:54:06 ndi-srv-021 corosync[13960]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb  4 15:54:06 ndi-srv-021 corosync[13960]:   [TOTEM ] Process pause detected for 1166 ms, flushing membership messages.

Moayad · Feb 4, 2021

Hi again,

Thanks for the outputs

Please try to stop the HA and restart the Corosync by following:

first stop the lrm service on all nodes

Code:

systemctl stop pve-ha-lrm

Once it is stopped on all nodes, stop the crm service as well

Code:

systemctl stop pve-ha-crm

Then restart corosync on all nodes one by one

Code:

systemctl restart corosync

gradinaruvasile · Feb 4, 2021

Hmm. I did that, now it seems to be working. I restarted the pve-ha-crm and pve-ha-lrm services before, but i had reservations about corosync.

So basically all corosync instances have to be restarted in these cases, not just the non working ones.
I suppose the pve-ha-crm and pve-ha-lrm services were stopped in order to prevent nodes rebooting if sync goes out?

Moayad · Feb 4, 2021

Glad you solved your issue

gradinaruvasile said:
I suppose the pve-ha-crm and pve-ha-lrm services were stopped in order to prevent nodes rebooting if sync goes out?

I'm not sure, you need to look at Syslog to see what exactly happened.

Now you can upgrade the nodes to the latest version by doing apt update && apt dist-upgrade

Search

Search

[SOLVED] Corosync stopped working on one PVE node (out of 4) after restart

gradinaruvasile

Renowned Member

Moayad

Proxmox Staff Member

gradinaruvasile

Renowned Member

gradinaruvasile

Renowned Member

Moayad

Proxmox Staff Member

gradinaruvasile

Renowned Member

gradinaruvasile

Renowned Member

gradinaruvasile

Renowned Member

Moayad

Proxmox Staff Member

gradinaruvasile

Renowned Member

Moayad

Proxmox Staff Member

We value your privacy