PVE upgrade 6 to 7 issue

powersupport · Oct 14, 2021

HI,

Recently we performed an upgrade on our proxmox cluster and we faced an outage. we have 4 node clusters, after completed the upgrade in 3 rd node, we tried to reboot node 3, but what happened is the whole cluster, which means all nodes went rebooted. We suffered a lot due to the downtime, it lasted around 15 minutes after that nodes went up and fine, anyone knows such a scenario? what could be the reason for this issue?

What we did is,

1) upgraded Ceph nautilus to pacific
2Upgraded pve 6 to 7

fabian · Oct 14, 2021

could you post the journal of the third node and one of the other nodes that got rebooted, starting before and ending after the reboots (a few minutes in both directions should be enough).

powersupport · Oct 14, 2021

Please see the attached log below from N3 and N2 respectively, the issue happened during the morning time 2 am to 6 am.
https://easyupload.io/yxomob

https://easyupload.io/ytwjct

Thank you

powersupport · Oct 19, 2021

We performed an upgrade on our proxmox cluster again and we faced the same issue mentioned above. Please see the attached log below.

https://easyupload.io/lxzmx6

Below is the error getting from the front end.

Connection failure. Network error or Proxmox VE services not running?
[12:51 AM] cluster not ready - no quorum? (500

Thank you,

fabian · Oct 20, 2021

that log file does not contain anything out of the ordinary (I see one node going down at 3:15, and coming backup at 3:16, with no visible ill-effects). that being said, we are investigating what seems to be a bug in corosync/kronosnet that can trigger on node reboots/restarts of the corosync service in rare circumstances.

if you can (somewhat) reliably reproduce a full-cluster fence with logs containing lots of cpg_join (on the restarting/rebooted node) / cpg_send_message (on the other nodes) retries in the log, followed by the HA watchdog expiring and nodes fencing themselves, we'd be very interested in more details about your systems and the network setup you use.

powersupport · Oct 23, 2021

Hi,

In the cluster network configuration, we have assigned private IP for Ceph and public IP for VLAN. And the network speed of Ceph is 10Gbps, for public network configured in public is 1Gbps. The private network for Ceph is bonded with the balance-rr algorithom.

Also, for the cpg_join and cpg_send_message, may I know how we can find the same in logs? I have checked the same in both journalctl and syslog, and couldn't see such a log in it.

Thank you.

fabian · Oct 25, 2021

in your first logs you can see them, e.g. n2.txt:

Code:

17801 Oct 14 06:02:41 inpx-sg1-n2 pmxcfs[1772]: [status] notice: cpg_send_message retry 80
17802 Oct 14 06:02:41 inpx-sg1-n2 pmxcfs[1772]: [dcdb] notice: cpg_send_message retry 100
17803 Oct 14 06:02:41 inpx-sg1-n2 pmxcfs[1772]: [dcdb] notice: cpg_send_message retried 100 times
17804 Oct 14 06:02:41 inpx-sg1-n2 pmxcfs[1772]: [dcdb] crit: cpg_send_message failed: 6

Search

Search

PVE upgrade 6 to 7 issue

powersupport

Active Member

fabian

Proxmox Staff Member

powersupport

Active Member

powersupport

Active Member

fabian

Proxmox Staff Member

powersupport

Active Member

fabian

Proxmox Staff Member