[SOLVED] Cluster Fails after one Day - PVE 6.0.4

Zumpel · Aug 1, 2019

It didn't help. But: the corosync.conf is changed back after restart... i don't know why the token config disappears

fabian · Aug 1, 2019

Zumpel said:
It didn't help. But: the corosync.conf is changed back after restart... i don't know why the token config disappears

how did you change it? if your cluster is currently quorate, putting the changed corosync.conf in /etc/pve/corosync.conf should sync it to all nodes. make sure to bump the "config_version" value in addition to any configuration changes you make. we don't touch this value at all, so the only way I can see that happening is if you only edited the local version and it got re-synced from pmxcfs..

Zumpel · Aug 1, 2019

I cant edit the corosync in the pve folder because of missing rights. i edited the one in /etc/corosync/corosync. Was that wrong?
I missed to increase the version number. that's a good point

fabian · Aug 1, 2019

Zumpel said:
I cant edit the corosync in the pve folder because of missing rights. i edited the one in /etc/corosync/corosync. Was that wrong?
I missed to increase the version number. that's a good point

if your cluster is already not quorate, you need to fix that first, or very carefully inject an updated corosync.conf into pmxcfs in forced-local mode (disable HA and replication first if enabled, don't do any other operations for the duration of the procedure!):

step-by-step in parallel on all nodes:

Code:

systemctl stop corosync pve-cluster
pmxcfs -l
cp working-corosync.conf /etc/pve/corosync.conf
cp working-corosync.conf /etc/corosync/corosync.conf
killall pmxcfs
systemctl start pve-cluster corosync

fabian · Aug 2, 2019

@Zumpel: could you try installing the following test build on all nodes, and restarting corosync afterwards?

http://download.proxmox.com/temp/libknet1_1.10-pve2~test1_amd64.deb

Code:

$ sha256sum libknet1_1.10-pve2~test1_amd64.deb
64521083486b6b2683826cc95f6d869ab3fde01e9b6d1ae90fae49fcac305478  libknet1_1.10-pve2~test1_amd64.deb

please provide logs again afterwards, we are trying to triage an pmtud flapping issue..

fabian · Aug 2, 2019

there are now also test packages for another pmtud related issue available via the pvetest repository:

Code:

http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/libknet-dev_1.10-pve2_amd64.deb
http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/libknet-doc_1.10-pve2_all.deb
http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/libknet1-dbgsym_1.10-pve2_amd64.deb
http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/libknet1_1.10-pve2_amd64.deb
http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/libnozzle-dev_1.10-pve2_amd64.deb
http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/libnozzle1-dbgsym_1.10-pve2_amd64.deb
http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/libnozzle1_1.10-pve2_amd64.deb

as always, restart corosync after installing the packages.

NOTE: these packages are different from the ones in the previous post (different patches applied for different issues). please test them both separately if possible!

Zumpel · Aug 2, 2019

I've installed the http://download.proxmox.com/temp/libknet1_1.10-pve2~test1_amd64.deb now with untouched corosync.conf and will report...

Zumpel · Aug 5, 2019

The Nodes still loose quorum. I needed to delete some lines of the logs, because they where to big to upload.
I'll now restart both nodes, install the other packages and restart them again. I'll report later if that changes anything

Zumpel · Aug 5, 2019

This time node 1 totally disapeared. I needed a hardreboot. only logs from node2 can be provided

Zumpel · Aug 6, 2019

With the second set of packages the nodes are unstable. This is the third time one of the nodes isn't pingable anymore ans the only way to get it back is a hardreset

fabian · Aug 7, 2019

Zumpel said:
With the second set of packages the nodes are unstable. This is the third time one of the nodes isn't pingable anymore ans the only way to get it back is a hardreset

just to make sure - these are the -2 packages from pvetest? are you testing with your modified corosync.conf with bumped token timeout, or with the default values? is it always the same node? can you enable persistent journaling ("mkdir /var/log/journal; systemctl restart systemd-journald") so that you can retrieve logs after the loss of connectivity and hard reset?

Fusel · Aug 8, 2019

Hello,

we still got problems with corosync:

Code:

192.168.131.20 ii  libknet1:amd64                       1.10-pve2~test1                 amd64        kronosnet core switching implementation
192.168.131.21 ii  libknet1:amd64                       1.10-pve2~test1                 amd64        kronosnet core switching implementation
192.168.131.22 ii  libknet1:amd64                       1.10-pve2~test1                 amd64        kronosnet core switching implementation
192.168.131.23 ii  libknet1:amd64                       1.10-pve2~test1                 amd64        kronosnet core switching implementation
192.168.131.27 ii  libknet1:amd64                       1.10-pve2~test1                 amd64        kronosnet core switching implementation
192.168.131.28 ii  libknet1:amd64                       1.10-pve2~test1                 amd64        kronosnet core switching implementation
192.168.131.29 ii  libknet1:amd64                       1.10-pve2~test1                 amd64        kronosnet core switching implementation

The pmutd messages have disappeared due to the update.

regards

Zumpel · Aug 14, 2019

Yes they where the -2 packages. but there was a ram problem in node2. this is fixed now. the problem with loosing quorum is still present

Zumpel · Aug 14, 2019

..and can you remove the solved tag? This Problem is not solved as long as many people are facing this issue...

Marcel Lanz · Aug 25, 2019

I think I have a similiar / same issue here.

Two nodes cluster (new r740 and r340) over one switch loosing quorum within a day or so. Once I restarted corosync.service on one node, today I had to restart them on both.

Ask me anything.

Marcel Lanz · Aug 27, 2019

Marcel Lanz said:
I think I have a similiar / same issue here.

Two nodes cluster (new r740 and r340) over one switch loosing quorum within a day or so. Once I restarted corosync.service on one node, today I had to restart them on both.

Ask me anything.

Same again today and restarted both corosync servies to connect the nodes again.
What should I post configuratio wise or regarding status of theses "events"?

Marcel Lanz · Aug 28, 2019

Marcel Lanz said:
Same again today and restarted both corosync servies to connect the nodes again.
What should I post configuratio wise or regarding status of theses "events"?

I got to the office and found the same situation again. I can report that whenever the daily backup by pve gets executed (with about 12 VMs) at 00:00 one of the replication jobs gets a timeout at exactly that time. I don't know if that correlates with the quorum lost. Further Replications and the Backup itself continue to work afterwards

t.lamprecht · Aug 28, 2019

Are you uptodate on this system, we pushed out a kronosnet (libknet) and kernel update a few days ago, currently still only available through the no-subscription as it's quite recent. Would be great if some people experience those issues could test those packages.

Marcel Lanz · Aug 29, 2019

Thanks. I will try these updates. Haven't Updated for over a week now.

Marcel Lanz · Aug 30, 2019

t.lamprecht said:
Are you uptodate on this system, we pushed out a kronosnet (libknet) and kernel update a few days ago, currently still only available through the no-subscription as it's quite recent. Would be great if some people experience those issues could test those packages.

We did Update yesterday both nodes and this morning we have the same separation.

I send attached the possible start of the issue on node 1 (called vmhost02. node2 is called vmhost03) at about 02:00 I guess.

We do backup at 02:00 every day to our tape system. node1 also replicates vms to node2 every 15 minutes. The backup on node2 get completed. But the two nodes seem to diverge.

Code:

Aug 30 02:16:32 vmhost02 systemd[1]: Started Proxmox VE replication runner.
Aug 30 02:16:32 vmhost02 systemd[1]: Starting Proxmox VE replication runner...
Aug 30 02:16:33 vmhost02 systemd[1]: pvesr.service: Succeeded.
Aug 30 02:16:33 vmhost02 systemd[1]: Started Proxmox VE replication runner.
Aug 30 02:16:56 vmhost02 corosync[3959]:   [KNET  ] link: host: 2 link: 0 is down
Aug 30 02:16:56 vmhost02 corosync[3959]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 30 02:16:56 vmhost02 corosync[3959]:   [KNET  ] host: host: 2 has no active links
Aug 30 02:16:57 vmhost02 corosync[3959]:   [TOTEM ] Token has not been received in 36 ms
Aug 30 02:16:57 vmhost02 corosync[3959]:   [KNET  ] rx: host: 2 link: 0 is up
Aug 30 02:16:57 vmhost02 corosync[3959]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 30 02:17:00 vmhost02 systemd[1]: Starting Proxmox VE replication runner...
Aug 30 02:17:01 vmhost02 systemd[1]: pvesr.service: Succeeded.
Aug 30 02:17:01 vmhost02 systemd[1]: Started Proxmox VE replication runner.
Aug 30 02:17:01 vmhost02 CRON[15092]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 30 02:17:01 vmhost02 CRON[15093]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 30 02:17:01 vmhost02 CRON[15092]: pam_unix(cron:session): session closed for user root
Aug 30 02:17:44 vmhost02 corosync[3959]:   [TOTEM ] Token has not been received in 750 ms
Aug 30 02:17:44 vmhost02 corosync[3959]:   [TOTEM ] A processor failed, forming new configuration.
Aug 30 02:17:45 vmhost02 corosync[3959]:   [TOTEM ] A new membership (1:389760) was formed. Members
Aug 30 02:17:45 vmhost02 corosync[3959]:   [CPG   ] downlist left_list: 0 received
Aug 30 02:17:45 vmhost02 corosync[3959]:   [CPG   ] downlist left_list: 0 received
Aug 30 02:17:45 vmhost02 corosync[3959]:   [QUORUM] Members[2]: 1 2
Aug 30 02:17:45 vmhost02 corosync[3959]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 30 02:17:48 vmhost02 corosync[3959]:   [TOTEM ] Token has not been received in 750 ms
Aug 30 02:17:49 vmhost02 corosync[3959]:   [TOTEM ] A processor failed, forming new configuration.
Aug 30 02:17:50 vmhost02 corosync[3959]:   [TOTEM ] A new membership (1:389764) was formed. Members left: 2
Aug 30 02:17:50 vmhost02 corosync[3959]:   [TOTEM ] Failed to receive the leave message. failed: 2
Aug 30 02:17:50 vmhost02 corosync[3959]:   [CPG   ] downlist left_list: 1 received
Aug 30 02:17:50 vmhost02 pmxcfs[13984]: [dcdb] notice: members: 1/13984
Aug 30 02:17:50 vmhost02 corosync[3959]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Aug 30 02:17:50 vmhost02 corosync[3959]:   [QUORUM] Members[1]: 1
Aug 30 02:17:50 vmhost02 corosync[3959]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 30 02:17:50 vmhost02 pmxcfs[13984]: [status] notice: node lost quorum
Aug 30 02:17:50 vmhost02 pmxcfs[13984]: [status] notice: members: 1/13984
Aug 30 02:18:00 vmhost02 systemd[1]: Starting Proxmox VE replication runner...
Aug 30 02:18:01 vmhost02 pvesr[18172]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:18:02 vmhost02 pvesr[18172]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:18:02 vmhost02 corosync[3959]:   [KNET  ] link: host: 2 link: 0 is down
Aug 30 02:18:02 vmhost02 corosync[3959]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 30 02:18:02 vmhost02 corosync[3959]:   [KNET  ] host: host: 2 has no active links
Aug 30 02:18:03 vmhost02 pvesr[18172]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:18:03 vmhost02 corosync[3959]:   [KNET  ] rx: host: 2 link: 0 is up
Aug 30 02:18:03 vmhost02 corosync[3959]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 30 02:18:04 vmhost02 pvesr[18172]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:18:05 vmhost02 pvesr[18172]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:18:06 vmhost02 pvesr[18172]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:18:07 vmhost02 pvesr[18172]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:18:08 vmhost02 pvesr[18172]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:18:09 vmhost02 pvesr[18172]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:18:10 vmhost02 pvesr[18172]: error with cfs lock 'file-replication_cfg': no quorum!
Aug 30 02:18:10 vmhost02 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Aug 30 02:18:10 vmhost02 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Aug 30 02:18:10 vmhost02 systemd[1]: Failed to start Proxmox VE replication runner.
Aug 30 02:19:00 vmhost02 systemd[1]: Starting Proxmox VE replication runner...
Aug 30 02:19:01 vmhost02 pvesr[25696]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:19:02 vmhost02 pvesr[25696]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:19:03 vmhost02 pvesr[25696]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:19:04 vmhost02 pvesr[25696]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:19:05 vmhost02 pvesr[25696]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:19:06 vmhost02 pvesr[25696]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:19:07 vmhost02 pvesr[25696]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:19:08 vmhost02 pvesr[25696]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:19:09 vmhost02 pvesr[25696]: trying to acquire cfs lock 'file-replication_cfg' ...
Aug 30 02:19:10 vmhost02 pvesr[25696]: error with cfs lock 'file-replication_cfg': no quorum!

[SOLVED] Cluster Fails after one Day - PVE 6.0.4

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

Proxmox Staff Member

Proxmox Staff Member

New Member

New Member

Attachments

New Member

Attachments

New Member

Proxmox Staff Member

Member

Attachments

New Member

New Member

New Member

New Member

New Member

Proxmox Staff Member

New Member

New Member

We value your privacy