Issue after upgrading Proxmox cluster: disconnected nodes and pve-cluster service restart

Vladimir.root

New Member
Sep 1, 2023
3
0
1
Hello, esteemed community members!

I have encountered an issue after upgrading our Proxmox cluster from version 7.4 to version 8.1. Half of the 11 cluster nodes stopped functioning properly, which has caused significant problems for our infrastructure.

Specifically, these nodes would disconnect and reconnect until we restarted the pve-cluster service. Only after restarting the service, all nodes became accessible again.

I would like to know if anyone else has experienced a similar issue after upgrading their Proxmox cluster or if there are any recommendations for this situation. Perhaps someone can share their experience or suggest a possible solution.

I would be grateful for any assistance or advice!

Thank you in advance!
 
Hi,

I had a similar problem today.
Our 16 node cluster was upgraded to version 7.4.17 without problems.

Next step was upgrading one node to 8.1.4. During this upgrade 6 nodes was restarted.

Our cluster is used almost 2 years in PROD. Cluster was upraded many times without problems. We have 2 separated network rings for corosync.

I am afraid to proceed with the upgrade of other nodes.

Logs on restarted nodes contains this messages and then was restarted:
Code:
Feb 14 12:03:29 pve1-prg1a corosync[2412]:   [TOTEM ] Token has not been received in 21221 ms
Feb 14 12:03:41 pve1-prg1a corosync[2412]:   [TOTEM ] Token has not been received in 33326 ms
Feb 14 12:03:44 pve1-prg1a corosync[2412]:   [QUORUM] Sync members[9]: 2 3 4 6 8 10 11 12 15
Feb 14 12:03:44 pve1-prg1a corosync[2412]:   [QUORUM] Sync left[7]: 1 5 7 9 13 14 16
Feb 14 12:03:44 pve1-prg1a corosync[2412]:   [TOTEM ] A new membership (2.2b7) was formed. Members left: 1 5 7 9 13 14 16
Feb 14 12:03:44 pve1-prg1a corosync[2412]:   [TOTEM ] Failed to receive the leave message. failed: 1 5 7 9 13 14 16
Feb 14 12:03:44 pve1-prg1a pmxcfs[2260]: [dcdb] notice: members: 2/2270, 3/2260, 4/2273, 6/2259, 8/2338, 10/2340, 11/2312, 12/2330, 15/2340
Feb 14 12:03:44 pve1-prg1a pmxcfs[2260]: [dcdb] notice: starting data syncronisation
Feb 14 12:03:44 pve1-prg1a pmxcfs[2260]: [status] notice: members: 2/2270, 3/2260, 4/2273, 6/2259, 8/2338, 10/2340, 11/2312, 12/2330, 15/2340
Feb 14 12:03:44 pve1-prg1a pmxcfs[2260]: [status] notice: starting data syncronisation
Feb 14 12:03:44 pve1-prg1a pvedaemon[3213573]: <root@pam> successful auth for user 'pve-exporter@pve'
Feb 14 12:03:45 pve1-prg1a pmxcfs[2260]: [status] notice: cpg_send_message retry 10
Feb 14 12:03:46 pve1-prg1a pmxcfs[2260]: [status] notice: cpg_send_message retry 20
Feb 14 12:03:47 pve1-prg1a pmxcfs[2260]: [status] notice: cpg_send_message retry 30
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
 
Hello,

I had a similar case but it was random on certain hosts, when upgrading Proxmox there were several hosts that left the Corosync cluster, but after restarting the hosts returned to synchronization.

Code:
Oct 24 00:18:07 pve01 corosync[2624]:   [TOTEM ] Retransmit List: d04
Oct 24 00:18:19 pve01 corosync[2624]:   [KNET  ] link: host: 25 link: 0 is down
Oct 24 00:18:19 pve01 corosync[2624]:   [KNET  ] host: host: 25 (passive) best link: 1 (pri: 1)
Oct 24 00:18:19 pve01 corosync[2624]:   [KNET  ] link: host: 25 link: 1 is down
Oct 24 00:18:19 pve01 corosync[2624]:   [KNET  ] host: host: 25 (passive) best link: 1 (pri: 1)
Oct 24 00:18:19 pve01 corosync[2624]:   [KNET  ] host: host: 25 has no active links
Oct 24 00:18:19 pve01 corosync[2624]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Oct 24 00:18:31 pve01 pveproxy[1115819]: proxy detected vanished client connection
Oct 24 00:18:55 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 10
Oct 24 00:18:56 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 20
Oct 24 00:18:57 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 30
Oct 24 00:18:58 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 40
Oct 24 00:18:59 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 50
Oct 24 00:19:00 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 60
Oct 24 00:19:01 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 70
Oct 24 00:19:02 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 80
Oct 24 00:19:02 pve01 corosync[2624]:   [TOTEM ] Retransmit List: 15a
Oct 24 00:19:03 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 90
Oct 24 00:19:04 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 100
Oct 24 00:19:04 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retried 100 times
Oct 24 00:19:04 pve01 pmxcfs[2519]: [status] crit: cpg_send_message failed: 6
Oct 24 00:19:05 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 10
Oct 24 00:19:06 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 20
Oct 24 00:19:07 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 30
Oct 24 00:19:07 pve01 corosync[2624]:   [TOTEM ] Retransmit List: 277 27f
Oct 24 00:19:08 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 40
Oct 24 00:19:09 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 50
Oct 24 00:19:10 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 60
Oct 24 00:19:11 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 70
Oct 24 00:19:12 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 80
Oct 24 00:19:13 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 90
Oct 24 00:19:14 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 100
Oct 24 00:19:14 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retried 100 times
Oct 24 00:19:14 pve01 pmxcfs[2519]: [status] crit: cpg_send_message failed: 6
Oct 24 00:19:15 pve01 corosync[2624]:   [TOTEM ] Retransmit List: 3b4
Oct 24 00:19:15 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 10
Code:
it is very critical because the proxmox GUI becomes unresponsive (cannot be accessed), i also found can't write in /etc/pve directory
root@pve01: /etc/pve# for i in test{1..20};do echo "aa" >> test; sleep 1s; done -bash: test: Device or resource busy
-bash: test: Device or resource busy
-bash: test: Device or resource busy

Can anyone provide a solution regarding this?

Thanks
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!