PVE upgrade 6 to 7 issue

Jan 18, 2020
183
1
23
27
HI,

Recently we performed an upgrade on our proxmox cluster and we faced an outage. we have 4 node clusters, after completed the upgrade in 3 rd node, we tried to reboot node 3, but what happened is the whole cluster, which means all nodes went rebooted. We suffered a lot due to the downtime, it lasted around 15 minutes after that nodes went up and fine, anyone knows such a scenario? what could be the reason for this issue?


What we did is,

1) upgraded Ceph nautilus to pacific
2Upgraded pve 6 to 7
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
7,873
1,508
164
could you post the journal of the third node and one of the other nodes that got rebooted, starting before and ending after the reboots (a few minutes in both directions should be enough).
 
Jan 18, 2020
183
1
23
27
We performed an upgrade on our proxmox cluster again and we faced the same issue mentioned above. Please see the attached log below.

https://easyupload.io/lxzmx6

Below is the error getting from the front end.

Connection failure. Network error or Proxmox VE services not running?
[12:51 AM] cluster not ready - no quorum? (500

Thank you,
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
7,873
1,508
164
that log file does not contain anything out of the ordinary (I see one node going down at 3:15, and coming backup at 3:16, with no visible ill-effects). that being said, we are investigating what seems to be a bug in corosync/kronosnet that can trigger on node reboots/restarts of the corosync service in rare circumstances.

if you can (somewhat) reliably reproduce a full-cluster fence with logs containing lots of cpg_join (on the restarting/rebooted node) / cpg_send_message (on the other nodes) retries in the log, followed by the HA watchdog expiring and nodes fencing themselves, we'd be very interested in more details about your systems and the network setup you use.
 
Jan 18, 2020
183
1
23
27
Hi,

In the cluster network configuration, we have assigned private IP for Ceph and public IP for VLAN. And the network speed of Ceph is 10Gbps, for public network configured in public is 1Gbps. The private network for Ceph is bonded with the balance-rr algorithom.

Also, for the cpg_join and cpg_send_message, may I know how we can find the same in logs? I have checked the same in both journalctl and syslog, and couldn't see such a log in it.

Thank you.
 
Last edited:

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
7,873
1,508
164
in your first logs you can see them, e.g. n2.txt:

Code:
17801 Oct 14 06:02:41 inpx-sg1-n2 pmxcfs[1772]: [status] notice: cpg_send_message retry 80
17802 Oct 14 06:02:41 inpx-sg1-n2 pmxcfs[1772]: [dcdb] notice: cpg_send_message retry 100
17803 Oct 14 06:02:41 inpx-sg1-n2 pmxcfs[1772]: [dcdb] notice: cpg_send_message retried 100 times
17804 Oct 14 06:02:41 inpx-sg1-n2 pmxcfs[1772]: [dcdb] crit: cpg_send_message failed: 6
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!