Hi support,
Recently we faced some critical issues with the proxmox cluster, the cluster is set up with 3 Nodes
Issues are
*) Migrating VMs between the nodes took a long time and sometimes proxmox GUI went down while the migration, also, VMs will be in lock even after the migration completed, please refer to the attachment
*) Upgrading proxmox also have the same issue, after the update, some nodes shows "Red X" mark and the GUI will not be accessible, the SSH will be accessible, but cannot restart services because it hangs, and we need to reboot the nodes to fix the issue with VMs, it creates huge downtime for us since the migration also affected
*)Adding additional nodes also affected, we tried to remove the existing nodes from cluster and tried to re-add it backup after changing some hardware, instantly, the GUI went down after the node added to the cluster, some nodes will be accessible intermittently
*) Sometimes after the GUI went down, we could access each node independently but other nodes show "Red X" marks, it is all a strange issue very hard to explain all
Issues are facing with the latest version 6.4-6, we are not sure about the earliest version because we updated all nodes at first after that, we tried to migrate VMs in order to reboot the nodes for Kernal update, the migrations mostly causes to cluster down,
Regarding the issue of adding additional node we mentioned above, earlier there was two intel CPU nodes and one with AMD, after we saw some forums regarding the proxmox and AMD has some issues, we removed Node 1from the cluster and readded with Intel, we did it to resolve the issue, but the issue still exists.
Cluster details
--------------------------
Proxmox version: virtual Environment 6.4-6
Cluster is set up with 7 OSD each
Ceph version: 14.2.20
The cluster is setup in a private network
Below are some logs we get while adding a node to the cluster
May 25 12:50:20 px-n1 corosync[1659]: [TOTEM ] Retransmit List: f4
May 25 12:50:29 px-n1 corosync[1659]: [TOTEM ] Retransmit List: f5
May 25 12:50:29 px-n1 corosync[1659]: [TOTEM ] Retransmit List: f3
May 25 12:50:57 px-n1 corosync[1659]: [KNET ] link: host: 3 link: 0 is down
May 25 12:50:57 px-n1 corosync[1659]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
May 25 12:50:57 px-n1 corosync[1659]: [KNET ] host: host: 3 has no active links
May 25 12:50:59 px-n1 corosync[1659]: [KNET ] rx: host: 3 link: 0 is up
May 25 12:50:59 px-n1 corosync[1659]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
May 25 12:51:00 px-n1 corosync[1659]: [TOTEM ] Retransmit List: f9
May 25 12:51:10 px-n1 corosync[1659]: [TOTEM ] Retransmit List: fa
May 25 12:51:17 px-n1 corosync[1659]: [KNET ] link: host: 2 link: 0 is down
May 25 12:51:17 px-n1 corosync[1659]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
May 25 12:51:17 px-n1 corosync[1659]: [KNET ] host: host: 2 has no active links
May 25 12:51:19 px-n1 corosync[1659]: [KNET ] rx: host: 2 link: 0 is up
May 25 12:51:19 px-n1 corosync[1659]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
May 25 12:51:21 px-n1 corosync[1659]: [TOTEM ] Retransmit List: fb
May 25 12:51:28 px-n1 corosync[1659]: [TOTEM ] Retransmit List: fc
May 25 12:51:37 px-n1 sshd[2960]: Connection closed by 192.168.33.1 port 59618 [preauth]
root@px-n1:~# service pve-cluster status
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2021-05-25 12:40:39 +08; 2min 36s ago
Process: 1663 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 1666 (pmxcfs)
Tasks: 6 (limit: 4915)
Memory: 24.5M
CGroup: /system.slice/pve-cluster.service
└─1666 /usr/bin/pmxcfs
May 25 12:40:38 px-n1 pmxcfs[1666]: [status] crit: can't initialize service
May 25 12:40:39 px-n1 systemd[1]: Started The Proxmox VE cluster filesystem.
May 25 12:40:44 px-n1 pmxcfs[1666]: [status] notice: update cluster info (cluster name tkpx-cluster, version = 5)
May 25 12:40:44 px-n1 pmxcfs[1666]: [status] notice: node has quorum
May 25 12:40:44 px-n1 pmxcfs[1666]: [dcdb] notice: members: 1/1666, 2/1817, 3/1822
May 25 12:40:44 px-n1 pmxcfs[1666]: [dcdb] notice: starting data syncronisation
May 25 12:40:44 px-n1 pmxcfs[1666]: [dcdb] notice: received sync request (epoch 1/1666/00000001)
May 25 12:40:44 px-n1 pmxcfs[1666]: [status] notice: members: 1/1666, 2/1817, 3/1822
May 25 12:40:44 px-n1 pmxcfs[1666]: [status] notice: starting data synchronisation
May 25 12:52:31 px-n2 kernel: R13: 000055b3dd4feaf0 R14: 000055b3dd32fd60 R15: 00000000000001ff
May 25 12:52:31 px-n2 corosync[1971]: [TOTEM ] Retransmit List: 57 5c 5d 5e 5f 34 3b 3c 3d 3e 3f
May 25 12:52:31 px-n2 corosync[1971]: [TOTEM ] Retransmit List:34 3b 3c 3d 3e 3f
May 25 12:52:32 px-n2 corosync[1971]: [TOTEM ] Retransmit List: 54 57 5c 5d 5e 5f 34 3b 3c 3d 3e 3f
May 25 12:52:32 px-n2 corosync[1971]: [TOTEM ] Retransmit List:34 3b 3c 3d 3e 3f
May 25 12:52:32 px-n2 corosync[1971]: [TOTEM ] Retransmit List: 53 54 57 5c 5d 5e 5f 34 3b 3c 3d 3e 3f
May 25 12:52:35 px-n2 corosync[1971]: [TOTEM ] Token has not been received in 2737 ms
May 25 12:52:36 px-n2 corosync[1971]: [TOTEM ] Retransmit List:34 3b 3c 3d 3e 3f
Recently we faced some critical issues with the proxmox cluster, the cluster is set up with 3 Nodes
Issues are
*) Migrating VMs between the nodes took a long time and sometimes proxmox GUI went down while the migration, also, VMs will be in lock even after the migration completed, please refer to the attachment
*) Upgrading proxmox also have the same issue, after the update, some nodes shows "Red X" mark and the GUI will not be accessible, the SSH will be accessible, but cannot restart services because it hangs, and we need to reboot the nodes to fix the issue with VMs, it creates huge downtime for us since the migration also affected
*)Adding additional nodes also affected, we tried to remove the existing nodes from cluster and tried to re-add it backup after changing some hardware, instantly, the GUI went down after the node added to the cluster, some nodes will be accessible intermittently
*) Sometimes after the GUI went down, we could access each node independently but other nodes show "Red X" marks, it is all a strange issue very hard to explain all
Issues are facing with the latest version 6.4-6, we are not sure about the earliest version because we updated all nodes at first after that, we tried to migrate VMs in order to reboot the nodes for Kernal update, the migrations mostly causes to cluster down,
Regarding the issue of adding additional node we mentioned above, earlier there was two intel CPU nodes and one with AMD, after we saw some forums regarding the proxmox and AMD has some issues, we removed Node 1from the cluster and readded with Intel, we did it to resolve the issue, but the issue still exists.
Cluster details
--------------------------
Proxmox version: virtual Environment 6.4-6
Cluster is set up with 7 OSD each
Ceph version: 14.2.20
The cluster is setup in a private network
Below are some logs we get while adding a node to the cluster
May 25 12:50:20 px-n1 corosync[1659]: [TOTEM ] Retransmit List: f4
May 25 12:50:29 px-n1 corosync[1659]: [TOTEM ] Retransmit List: f5
May 25 12:50:29 px-n1 corosync[1659]: [TOTEM ] Retransmit List: f3
May 25 12:50:57 px-n1 corosync[1659]: [KNET ] link: host: 3 link: 0 is down
May 25 12:50:57 px-n1 corosync[1659]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
May 25 12:50:57 px-n1 corosync[1659]: [KNET ] host: host: 3 has no active links
May 25 12:50:59 px-n1 corosync[1659]: [KNET ] rx: host: 3 link: 0 is up
May 25 12:50:59 px-n1 corosync[1659]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
May 25 12:51:00 px-n1 corosync[1659]: [TOTEM ] Retransmit List: f9
May 25 12:51:10 px-n1 corosync[1659]: [TOTEM ] Retransmit List: fa
May 25 12:51:17 px-n1 corosync[1659]: [KNET ] link: host: 2 link: 0 is down
May 25 12:51:17 px-n1 corosync[1659]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
May 25 12:51:17 px-n1 corosync[1659]: [KNET ] host: host: 2 has no active links
May 25 12:51:19 px-n1 corosync[1659]: [KNET ] rx: host: 2 link: 0 is up
May 25 12:51:19 px-n1 corosync[1659]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
May 25 12:51:21 px-n1 corosync[1659]: [TOTEM ] Retransmit List: fb
May 25 12:51:28 px-n1 corosync[1659]: [TOTEM ] Retransmit List: fc
May 25 12:51:37 px-n1 sshd[2960]: Connection closed by 192.168.33.1 port 59618 [preauth]
root@px-n1:~# service pve-cluster status
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2021-05-25 12:40:39 +08; 2min 36s ago
Process: 1663 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 1666 (pmxcfs)
Tasks: 6 (limit: 4915)
Memory: 24.5M
CGroup: /system.slice/pve-cluster.service
└─1666 /usr/bin/pmxcfs
May 25 12:40:38 px-n1 pmxcfs[1666]: [status] crit: can't initialize service
May 25 12:40:39 px-n1 systemd[1]: Started The Proxmox VE cluster filesystem.
May 25 12:40:44 px-n1 pmxcfs[1666]: [status] notice: update cluster info (cluster name tkpx-cluster, version = 5)
May 25 12:40:44 px-n1 pmxcfs[1666]: [status] notice: node has quorum
May 25 12:40:44 px-n1 pmxcfs[1666]: [dcdb] notice: members: 1/1666, 2/1817, 3/1822
May 25 12:40:44 px-n1 pmxcfs[1666]: [dcdb] notice: starting data syncronisation
May 25 12:40:44 px-n1 pmxcfs[1666]: [dcdb] notice: received sync request (epoch 1/1666/00000001)
May 25 12:40:44 px-n1 pmxcfs[1666]: [status] notice: members: 1/1666, 2/1817, 3/1822
May 25 12:40:44 px-n1 pmxcfs[1666]: [status] notice: starting data synchronisation
May 25 12:52:31 px-n2 kernel: R13: 000055b3dd4feaf0 R14: 000055b3dd32fd60 R15: 00000000000001ff
May 25 12:52:31 px-n2 corosync[1971]: [TOTEM ] Retransmit List: 57 5c 5d 5e 5f 34 3b 3c 3d 3e 3f
May 25 12:52:31 px-n2 corosync[1971]: [TOTEM ] Retransmit List:34 3b 3c 3d 3e 3f
May 25 12:52:32 px-n2 corosync[1971]: [TOTEM ] Retransmit List: 54 57 5c 5d 5e 5f 34 3b 3c 3d 3e 3f
May 25 12:52:32 px-n2 corosync[1971]: [TOTEM ] Retransmit List:34 3b 3c 3d 3e 3f
May 25 12:52:32 px-n2 corosync[1971]: [TOTEM ] Retransmit List: 53 54 57 5c 5d 5e 5f 34 3b 3c 3d 3e 3f
May 25 12:52:35 px-n2 corosync[1971]: [TOTEM ] Token has not been received in 2737 ms
May 25 12:52:36 px-n2 corosync[1971]: [TOTEM ] Retransmit List:34 3b 3c 3d 3e 3f
Attachments
Last edited: