Issue after upgrading Proxmox cluster: disconnected nodes and pve-cluster service restart

Vladimir.root · Feb 14, 2024

Hello, esteemed community members!

I have encountered an issue after upgrading our Proxmox cluster from version 7.4 to version 8.1. Half of the 11 cluster nodes stopped functioning properly, which has caused significant problems for our infrastructure.

Specifically, these nodes would disconnect and reconnect until we restarted the pve-cluster service. Only after restarting the service, all nodes became accessible again.

I would like to know if anyone else has experienced a similar issue after upgrading their Proxmox cluster or if there are any recommendations for this situation. Perhaps someone can share their experience or suggest a possible solution.

I would be grateful for any assistance or advice!

Thank you in advance!

Hybo · Feb 14, 2024

Hi,

I had a similar problem today.
Our 16 node cluster was upgraded to version 7.4.17 without problems.

Next step was upgrading one node to 8.1.4. During this upgrade 6 nodes was restarted.

Our cluster is used almost 2 years in PROD. Cluster was upraded many times without problems. We have 2 separated network rings for corosync.

I am afraid to proceed with the upgrade of other nodes.

Logs on restarted nodes contains this messages and then was restarted:

Code:

Feb 14 12:03:29 pve1-prg1a corosync[2412]:   [TOTEM ] Token has not been received in 21221 ms
Feb 14 12:03:41 pve1-prg1a corosync[2412]:   [TOTEM ] Token has not been received in 33326 ms
Feb 14 12:03:44 pve1-prg1a corosync[2412]:   [QUORUM] Sync members[9]: 2 3 4 6 8 10 11 12 15
Feb 14 12:03:44 pve1-prg1a corosync[2412]:   [QUORUM] Sync left[7]: 1 5 7 9 13 14 16
Feb 14 12:03:44 pve1-prg1a corosync[2412]:   [TOTEM ] A new membership (2.2b7) was formed. Members left: 1 5 7 9 13 14 16
Feb 14 12:03:44 pve1-prg1a corosync[2412]:   [TOTEM ] Failed to receive the leave message. failed: 1 5 7 9 13 14 16
Feb 14 12:03:44 pve1-prg1a pmxcfs[2260]: [dcdb] notice: members: 2/2270, 3/2260, 4/2273, 6/2259, 8/2338, 10/2340, 11/2312, 12/2330, 15/2340
Feb 14 12:03:44 pve1-prg1a pmxcfs[2260]: [dcdb] notice: starting data syncronisation
Feb 14 12:03:44 pve1-prg1a pmxcfs[2260]: [status] notice: members: 2/2270, 3/2260, 4/2273, 6/2259, 8/2338, 10/2340, 11/2312, 12/2330, 15/2340
Feb 14 12:03:44 pve1-prg1a pmxcfs[2260]: [status] notice: starting data syncronisation
Feb 14 12:03:44 pve1-prg1a pvedaemon[3213573]: <root@pam> successful auth for user 'pve-exporter@pve'
Feb 14 12:03:45 pve1-prg1a pmxcfs[2260]: [status] notice: cpg_send_message retry 10
Feb 14 12:03:46 pve1-prg1a pmxcfs[2260]: [status] notice: cpg_send_message retry 20
Feb 14 12:03:47 pve1-prg1a pmxcfs[2260]: [status] notice: cpg_send_message retry 30
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

Hybo · Feb 15, 2024

Same issue after upgrade second node to 8.1.4.

Vladimir.root · Feb 15, 2024

Hybo said:
Same issue after upgrade second node to 8.1.4.

Сertain pattern observed: the nodes to which machines from the upgradable node migrated are falling.

Hybo · Feb 15, 2024

Vladimir.root said:
Сertain pattern observed: the nodes to which machines from the upgradable node migrated are falling.

not in my case... the second upgraded node was without running VMs (two VMs but was stopped)

rozaq · Oct 24, 2024

Hello,

I had a similar case but it was random on certain hosts, when upgrading Proxmox there were several hosts that left the Corosync cluster, but after restarting the hosts returned to synchronization.

Code:

Oct 24 00:18:07 pve01 corosync[2624]:   [TOTEM ] Retransmit List: d04
Oct 24 00:18:19 pve01 corosync[2624]:   [KNET  ] link: host: 25 link: 0 is down
Oct 24 00:18:19 pve01 corosync[2624]:   [KNET  ] host: host: 25 (passive) best link: 1 (pri: 1)
Oct 24 00:18:19 pve01 corosync[2624]:   [KNET  ] link: host: 25 link: 1 is down
Oct 24 00:18:19 pve01 corosync[2624]:   [KNET  ] host: host: 25 (passive) best link: 1 (pri: 1)
Oct 24 00:18:19 pve01 corosync[2624]:   [KNET  ] host: host: 25 has no active links
Oct 24 00:18:19 pve01 corosync[2624]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Oct 24 00:18:31 pve01 pveproxy[1115819]: proxy detected vanished client connection
Oct 24 00:18:55 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 10
Oct 24 00:18:56 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 20
Oct 24 00:18:57 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 30
Oct 24 00:18:58 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 40
Oct 24 00:18:59 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 50
Oct 24 00:19:00 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 60
Oct 24 00:19:01 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 70
Oct 24 00:19:02 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 80
Oct 24 00:19:02 pve01 corosync[2624]:   [TOTEM ] Retransmit List: 15a
Oct 24 00:19:03 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 90
Oct 24 00:19:04 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 100
Oct 24 00:19:04 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retried 100 times
Oct 24 00:19:04 pve01 pmxcfs[2519]: [status] crit: cpg_send_message failed: 6
Oct 24 00:19:05 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 10
Oct 24 00:19:06 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 20
Oct 24 00:19:07 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 30
Oct 24 00:19:07 pve01 corosync[2624]:   [TOTEM ] Retransmit List: 277 27f
Oct 24 00:19:08 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 40
Oct 24 00:19:09 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 50
Oct 24 00:19:10 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 60
Oct 24 00:19:11 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 70
Oct 24 00:19:12 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 80
Oct 24 00:19:13 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 90
Oct 24 00:19:14 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 100
Oct 24 00:19:14 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retried 100 times
Oct 24 00:19:14 pve01 pmxcfs[2519]: [status] crit: cpg_send_message failed: 6
Oct 24 00:19:15 pve01 corosync[2624]:   [TOTEM ] Retransmit List: 3b4
Oct 24 00:19:15 pve01 pmxcfs[2519]: [status] notice: cpg_send_message retry 10

Code:

it is very critical because the proxmox GUI becomes unresponsive (cannot be accessed), i also found can't write in /etc/pve directory
root@pve01: /etc/pve# for i in test{1..20};do echo "aa" >> test; sleep 1s; done -bash: test: Device or resource busy
-bash: test: Device or resource busy
-bash: test: Device or resource busy

Can anyone provide a solution regarding this?

Thanks

Issue after upgrading Proxmox cluster: disconnected nodes and pve-cluster service restart

Vladimir.root

New Member

Hybo

New Member

Hybo

New Member

Vladimir.root

New Member

Hybo

New Member

rozaq

Member

We value your privacy