[SOLVED] Cluster Fails after one Day - PVE 6.0.4

It didn't help. But: the corosync.conf is changed back after restart... i don't know why the token config disappears
 
It didn't help. But: the corosync.conf is changed back after restart... i don't know why the token config disappears

how did you change it? if your cluster is currently quorate, putting the changed corosync.conf in /etc/pve/corosync.conf should sync it to all nodes. make sure to bump the "config_version" value in addition to any configuration changes you make. we don't touch this value at all, so the only way I can see that happening is if you only edited the local version and it got re-synced from pmxcfs..
 
I cant edit the corosync in the pve folder because of missing rights. i edited the one in /etc/corosync/corosync. Was that wrong?
I missed to increase the version number. that's a good point
 
I cant edit the corosync in the pve folder because of missing rights. i edited the one in /etc/corosync/corosync. Was that wrong?
I missed to increase the version number. that's a good point

if your cluster is already not quorate, you need to fix that first, or very carefully inject an updated corosync.conf into pmxcfs in forced-local mode (disable HA and replication first if enabled, don't do any other operations for the duration of the procedure!):

step-by-step in parallel on all nodes:
Code:
systemctl stop corosync pve-cluster
pmxcfs -l
cp working-corosync.conf /etc/pve/corosync.conf
cp working-corosync.conf /etc/corosync/corosync.conf
killall pmxcfs
systemctl start pve-cluster corosync
 
@Zumpel: could you try installing the following test build on all nodes, and restarting corosync afterwards?

http://download.proxmox.com/temp/libknet1_1.10-pve2~test1_amd64.deb
Code:
$ sha256sum libknet1_1.10-pve2~test1_amd64.deb
64521083486b6b2683826cc95f6d869ab3fde01e9b6d1ae90fae49fcac305478  libknet1_1.10-pve2~test1_amd64.deb

please provide logs again afterwards, we are trying to triage an pmtud flapping issue..
 
there are now also test packages for another pmtud related issue available via the pvetest repository:
Code:
http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/libknet-dev_1.10-pve2_amd64.deb
http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/libknet-doc_1.10-pve2_all.deb
http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/libknet1-dbgsym_1.10-pve2_amd64.deb
http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/libknet1_1.10-pve2_amd64.deb
http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/libnozzle-dev_1.10-pve2_amd64.deb
http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/libnozzle1-dbgsym_1.10-pve2_amd64.deb
http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/libnozzle1_1.10-pve2_amd64.deb

as always, restart corosync after installing the packages.

NOTE: these packages are different from the ones in the previous post (different patches applied for different issues). please test them both separately if possible!
 
The Nodes still loose quorum. I needed to delete some lines of the logs, because they where to big to upload.
I'll now restart both nodes, install the other packages and restart them again. I'll report later if that changes anything
 

Attachments

With the second set of packages the nodes are unstable. This is the third time one of the nodes isn't pingable anymore ans the only way to get it back is a hardreset
 
With the second set of packages the nodes are unstable. This is the third time one of the nodes isn't pingable anymore ans the only way to get it back is a hardreset

just to make sure - these are the -2 packages from pvetest? are you testing with your modified corosync.conf with bumped token timeout, or with the default values? is it always the same node? can you enable persistent journaling ("mkdir /var/log/journal; systemctl restart systemd-journald") so that you can retrieve logs after the loss of connectivity and hard reset?
 
Hello,

we still got problems with corosync:

Code:
192.168.131.20 ii  libknet1:amd64                       1.10-pve2~test1                 amd64        kronosnet core switching implementation
192.168.131.21 ii  libknet1:amd64                       1.10-pve2~test1                 amd64        kronosnet core switching implementation
192.168.131.22 ii  libknet1:amd64                       1.10-pve2~test1                 amd64        kronosnet core switching implementation
192.168.131.23 ii  libknet1:amd64                       1.10-pve2~test1                 amd64        kronosnet core switching implementation
192.168.131.27 ii  libknet1:amd64                       1.10-pve2~test1                 amd64        kronosnet core switching implementation
192.168.131.28 ii  libknet1:amd64                       1.10-pve2~test1                 amd64        kronosnet core switching implementation
192.168.131.29 ii  libknet1:amd64                       1.10-pve2~test1                 amd64        kronosnet core switching implementation

The pmutd messages have disappeared due to the update.

regards
 

Attachments

Yes they where the -2 packages. but there was a ram problem in node2. this is fixed now. the problem with loosing quorum is still present
 
..and can you remove the solved tag? This Problem is not solved as long as many people are facing this issue...
 
I think I have a similiar / same issue here.

Two nodes cluster (new r740 and r340) over one switch loosing quorum within a day or so. Once I restarted corosync.service on one node, today I had to restart them on both.

Ask me anything.
 
I think I have a similiar / same issue here.

Two nodes cluster (new r740 and r340) over one switch loosing quorum within a day or so. Once I restarted corosync.service on one node, today I had to restart them on both.

Ask me anything.

Same again today and restarted both corosync servies to connect the nodes again.
What should I post configuratio wise or regarding status of theses "events"?
 
Same again today and restarted both corosync servies to connect the nodes again.
What should I post configuratio wise or regarding status of theses "events"?

I got to the office and found the same situation again. I can report that whenever the daily backup by pve gets executed (with about 12 VMs) at 00:00 one of the replication jobs gets a timeout at exactly that time. I don't know if that correlates with the quorum lost. Further Replications and the Backup itself continue to work afterwards
 
Are you uptodate on this system, we pushed out a kronosnet (libknet) and kernel update a few days ago, currently still only available through the no-subscription as it's quite recent. Would be great if some people experience those issues could test those packages.
 
Are you uptodate on this system, we pushed out a kronosnet (libknet) and kernel update a few days ago, currently still only available through the no-subscription as it's quite recent. Would be great if some people experience those issues could test those packages.

We did Update yesterday both nodes and this morning we have the same separation.

I send attached the possible start of the issue on node 1 (called vmhost02. node2 is called vmhost03) at about 02:00 I guess.

We do backup at 02:00 every day to our tape system. node1 also replicates vms to node2 every 15 minutes. The backup on node2 get completed. But the two nodes seem to diverge.

Code:
Aug 30 02:16:32 vmhost02 systemd[1]: Started Proxmox VE replication runner.
Aug 30 02:16:32 vmhost02 systemd[1]: Starting Proxmox VE replication runner...
Aug 30 02:16:33 vmhost02 systemd[1]: pvesr.service: Succeeded.
Aug 30 02:16:33 vmhost02 systemd[1]: Started Proxmox VE replication runner.
Aug 30 02:16:56 vmhost02 corosync[3959]:   [KNET  ] link: host: 2 link: 0 is down
Aug 30 02:16:56 vmhost02 corosync[3959]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 30 02:16:56 vmhost02 corosync[3959]:   [KNET  ] host: host: 2 has no active links
Aug 30 02:16:57 vmhost02 corosync[3959]:   [TOTEM ] Token has not been received in 36 ms
Aug 30 02:16:57 vmhost02 corosync[3959]:   [KNET  ] rx: host: 2 link: 0 is up
Aug 30 02:16:57 vmhost02 corosync[3959]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 30 02:17:00 vmhost02 systemd[1]: Starting Proxmox VE replication runner...
Aug 30 02:17:01 vmhost02 systemd[1]: pvesr.service: Succeeded.
Aug 30 02:17:01 vmhost02 systemd[1]: Started Proxmox VE replication runner.
Aug 30 02:17:01 vmhost02 CRON[15092]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 30 02:17:01 vmhost02 CRON[15093]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 30 02:17:01 vmhost02 CRON[15092]: pam_unix(cron:session): session closed for user root
Aug 30 02:17:44 vmhost02 corosync[3959]:   [TOTEM ] Token has not been received in 750 ms
Aug 30 02:17:44 vmhost02 corosync[3959]:   [TOTEM ] A processor failed, forming new configuration.
Aug 30 02:17:45 vmhost02 corosync[3959]:   [TOTEM ] A new membership (1:389760) was formed. Members
Aug 30 02:17:45 vmhost02 corosync[3959]:   [CPG   ] downlist left_list: 0 received
Aug 30 02:17:45 vmhost02 corosync[3959]:   [CPG   ] downlist left_list: 0 received
Aug 30 02:17:45 vmhost02 corosync[3959]:   [QUORUM] Members[2]: 1 2
Aug 30 02:17:45 vmhost02 corosync[3959]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 30 02:17:48 vmhost02 corosync[3959]:   [TOTEM ] Token has not been received in 750 ms
Aug 30 02:17:49 vmhost02 corosync[3959]:   [TOTEM ] A processor failed, forming new configuration.
Aug 30 02:17:50 vmhost02 corosync[3959]:   [TOTEM ] A new membership (1:389764) was formed. Members left: 2
Aug 30 02:17:50 vmhost02 corosync[3959]:   [TOTEM ] Failed to receive the leave message. failed: 2
Aug 30 02:17:50 vmhost02 corosync[3959]:   [CPG   ] downlist left_list: 1 received
Aug 30 02:17:50 vmhost02 pmxcfs[13984]: [dcdb] notice: members: 1/13984
Aug 30 02:17:50 vmhost02 corosync[3959]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Aug 30 02:17:50 vmhost02 corosync[3959]:   [QUORUM] Members[1]: 1
Aug 30 02:17:50 vmhost02 corosync[3959]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 30 02:17:50 vmhost02 pmxcfs[13984]: [status] notice: node lost quorum
Aug 30 02:17:50 vmhost02 pmxcfs[13984]: [status] notice: members: 1/13984
Aug 30 02:18:00 vmhost02 systemd[1]: Starting Proxmox VE replication runner...
Aug 30 02:18:01 vmhost02 pvesr[18172]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:18:02 vmhost02 pvesr[18172]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:18:02 vmhost02 corosync[3959]:   [KNET  ] link: host: 2 link: 0 is down
Aug 30 02:18:02 vmhost02 corosync[3959]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 30 02:18:02 vmhost02 corosync[3959]:   [KNET  ] host: host: 2 has no active links
Aug 30 02:18:03 vmhost02 pvesr[18172]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:18:03 vmhost02 corosync[3959]:   [KNET  ] rx: host: 2 link: 0 is up
Aug 30 02:18:03 vmhost02 corosync[3959]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 30 02:18:04 vmhost02 pvesr[18172]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:18:05 vmhost02 pvesr[18172]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:18:06 vmhost02 pvesr[18172]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:18:07 vmhost02 pvesr[18172]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:18:08 vmhost02 pvesr[18172]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:18:09 vmhost02 pvesr[18172]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:18:10 vmhost02 pvesr[18172]: error with cfs lock 'file-replication_cfg': no quorum!
Aug 30 02:18:10 vmhost02 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Aug 30 02:18:10 vmhost02 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Aug 30 02:18:10 vmhost02 systemd[1]: Failed to start Proxmox VE replication runner.
Aug 30 02:19:00 vmhost02 systemd[1]: Starting Proxmox VE replication runner...
Aug 30 02:19:01 vmhost02 pvesr[25696]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:19:02 vmhost02 pvesr[25696]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:19:03 vmhost02 pvesr[25696]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:19:04 vmhost02 pvesr[25696]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:19:05 vmhost02 pvesr[25696]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:19:06 vmhost02 pvesr[25696]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:19:07 vmhost02 pvesr[25696]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:19:08 vmhost02 pvesr[25696]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:19:09 vmhost02 pvesr[25696]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:19:10 vmhost02 pvesr[25696]: error with cfs lock 'file-replication_cfg': no quorum!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!