Proxmox cluster version: 6.4-6 having unexpected issues

powersupport

Active Member
Jan 18, 2020
277
2
38
30
Hi support,

Recently we faced some critical issues with the proxmox cluster, the cluster is set up with 3 Nodes



Issues are

*) Migrating VMs between the nodes took a long time and sometimes proxmox GUI went down while the migration, also, VMs will be in lock even after the migration completed, please refer to the attachment

*) Upgrading proxmox also have the same issue, after the update, some nodes shows "Red X" mark and the GUI will not be accessible, the SSH will be accessible, but cannot restart services because it hangs, and we need to reboot the nodes to fix the issue with VMs, it creates huge downtime for us since the migration also affected

*)Adding additional nodes also affected, we tried to remove the existing nodes from cluster and tried to re-add it backup after changing some hardware, instantly, the GUI went down after the node added to the cluster, some nodes will be accessible intermittently

*) Sometimes after the GUI went down, we could access each node independently but other nodes show "Red X" marks, it is all a strange issue very hard to explain all



Issues are facing with the latest version 6.4-6, we are not sure about the earliest version because we updated all nodes at first after that, we tried to migrate VMs in order to reboot the nodes for Kernal update, the migrations mostly causes to cluster down,



Regarding the issue of adding additional node we mentioned above, earlier there was two intel CPU nodes and one with AMD, after we saw some forums regarding the proxmox and AMD has some issues, we removed Node 1from the cluster and readded with Intel, we did it to resolve the issue, but the issue still exists.

Cluster details
--------------------------

Proxmox version: virtual Environment 6.4-6

Cluster is set up with 7 OSD each

Ceph version: 14.2.20
The cluster is setup in a private network



Below are some logs we get while adding a node to the cluster

May 25 12:50:20 px-n1 corosync[1659]: [TOTEM ] Retransmit List: f4
May 25 12:50:29 px-n1 corosync[1659]: [TOTEM ] Retransmit List: f5
May 25 12:50:29 px-n1 corosync[1659]: [TOTEM ] Retransmit List: f3
May 25 12:50:57 px-n1 corosync[1659]: [KNET ] link: host: 3 link: 0 is down
May 25 12:50:57 px-n1 corosync[1659]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
May 25 12:50:57 px-n1 corosync[1659]: [KNET ] host: host: 3 has no active links
May 25 12:50:59 px-n1 corosync[1659]: [KNET ] rx: host: 3 link: 0 is up
May 25 12:50:59 px-n1 corosync[1659]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
May 25 12:51:00 px-n1 corosync[1659]: [TOTEM ] Retransmit List: f9
May 25 12:51:10 px-n1 corosync[1659]: [TOTEM ] Retransmit List: fa
May 25 12:51:17 px-n1 corosync[1659]: [KNET ] link: host: 2 link: 0 is down
May 25 12:51:17 px-n1 corosync[1659]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
May 25 12:51:17 px-n1 corosync[1659]: [KNET ] host: host: 2 has no active links
May 25 12:51:19 px-n1 corosync[1659]: [KNET ] rx: host: 2 link: 0 is up
May 25 12:51:19 px-n1 corosync[1659]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
May 25 12:51:21 px-n1 corosync[1659]: [TOTEM ] Retransmit List: fb
May 25 12:51:28 px-n1 corosync[1659]: [TOTEM ] Retransmit List: fc
May 25 12:51:37 px-n1 sshd[2960]: Connection closed by 192.168.33.1 port 59618 [preauth]


root@px-n1:~# service pve-cluster status
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2021-05-25 12:40:39 +08; 2min 36s ago
Process: 1663 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 1666 (pmxcfs)
Tasks: 6 (limit: 4915)
Memory: 24.5M
CGroup: /system.slice/pve-cluster.service
└─1666 /usr/bin/pmxcfs
May 25 12:40:38 px-n1 pmxcfs[1666]: [status] crit: can't initialize service
May 25 12:40:39 px-n1 systemd[1]: Started The Proxmox VE cluster filesystem.
May 25 12:40:44 px-n1 pmxcfs[1666]: [status] notice: update cluster info (cluster name tkpx-cluster, version = 5)
May 25 12:40:44 px-n1 pmxcfs[1666]: [status] notice: node has quorum
May 25 12:40:44 px-n1 pmxcfs[1666]: [dcdb] notice: members: 1/1666, 2/1817, 3/1822
May 25 12:40:44 px-n1 pmxcfs[1666]: [dcdb] notice: starting data syncronisation
May 25 12:40:44 px-n1 pmxcfs[1666]: [dcdb] notice: received sync request (epoch 1/1666/00000001)
May 25 12:40:44 px-n1 pmxcfs[1666]: [status] notice: members: 1/1666, 2/1817, 3/1822
May 25 12:40:44 px-n1 pmxcfs[1666]: [status] notice: starting data synchronisation


May 25 12:52:31 px-n2 kernel: R13: 000055b3dd4feaf0 R14: 000055b3dd32fd60 R15: 00000000000001ff
May 25 12:52:31 px-n2 corosync[1971]: [TOTEM ] Retransmit List: 57 5c 5d 5e 5f 34 3b 3c 3d 3e 3f
May 25 12:52:31 px-n2 corosync[1971]: [TOTEM ] Retransmit List:34 3b 3c 3d 3e 3f
May 25 12:52:32 px-n2 corosync[1971]: [TOTEM ] Retransmit List: 54 57 5c 5d 5e 5f 34 3b 3c 3d 3e 3f
May 25 12:52:32 px-n2 corosync[1971]: [TOTEM ] Retransmit List:34 3b 3c 3d 3e 3f
May 25 12:52:32 px-n2 corosync[1971]: [TOTEM ] Retransmit List: 53 54 57 5c 5d 5e 5f 34 3b 3c 3d 3e 3f
May 25 12:52:35 px-n2 corosync[1971]: [TOTEM ] Token has not been received in 2737 ms
May 25 12:52:36 px-n2 corosync[1971]: [TOTEM ] Retransmit List:34 3b 3c 3d 3e 3f
 

Attachments

  • px-ss1.png
    px-ss1.png
    109.2 KB · Views: 10
Last edited:
Hi,

have you already been able to solve this? Do you have separate networks for Corosync/Cluster, Ceph, Administration etc? Because it sounds like the migrations saturate your network connection, then the Corosync packages do not arrive in time
[quote
May 25 12:52:35 px-n2 corosync[1971]: [TOTEM ] Token has not been received in 2737 ms
[/quote]
and then all sort of bad stuff can happen (like node shutdown).

If you still have the problem, could you please post your Corosync and Ceph configuration?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!