Hi Tom. Is there a specific thread that is tracking this issue? I looked at the links and there are a bunch of threads that could cover this. Which one should I be reading?
You're right, it did not really help; the whole cluster crashed and I'm in late to recover. I think I might have some problems with the bnx2 driver, since ethernet ports are going down and up.
I've searched and have added intremap=off to the kernel cmdline and after rebooting nodes I'll see how it goes. I'm considering adding the bnx2 parameter disable_msi=1 as well.
You're right, it did not really help; the whole cluster crashed and I'm in late to recover. I think I might have some problems with the bnx2 driver, since ethernet ports are going down and up.
I've searched and have added intremap=off to the kernel cmdline and after rebooting nodes I'll see how it goes. I'm considering adding the bnx2 parameter disable_msi=1 as well.
We had this issue with disintegrating cluster right after upgrading to PVE6. It also seemed to be related to some things happening in other parts of the network (we had a few vlans transported over the management links that span the whole network). But PVE 5 did not have this issue even with it's pain-in-the-behind multicast.
Also it appears that there were 2 separate problems:
- one involving corosync crash that did not seem to affect cluster integrity because it was restarted automatically (i see that this should not have happened but our logs say it was restarted automatically), but it randomly disabled HA on VMs (the workaround was to manually remove them from HA and add them back)
- another was related to corosync failing to maintain cluster integrity. In this case corosync did not crash as far as i know but the cluster somehow got fragmented and nodes rebooted. Workaround was stopping corosync on all nodes then starting it up but good luck with that if you have HA set up.
We moved the corosync traffic to a separate network (split the 4 link management LACP into 2x2) and no issues since.
You're right, it did not really help; the whole cluster crashed and I'm in late to recover. I think I might have some problems with the bnx2 driver, since ethernet ports are going down and up.
(I know - I should have separate cluster fabric and 10Gb when I'm also playing with Ceph, but it used to be rock solid with corosync2)
Next step would be to rule out any issues with switches and cables.
I have no Ethernet problems in my logs, so i dont think that is the main problem, you should fix that anyway.
I also mix corosync with other traffic, but trying to break the bonds now and set up new network gave me other problems, but that is for another thread in future.
Today i installed a new pve-cluster 6.0-7, i hope this fixes some problems.
EDIT:
Got these right now:
Sep 4 10:23:29 server1 corosync[2290116]: [KNET ] link: host: 6 link: 0 is down
Sep 4 10:23:29 server1 corosync[2290116]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Sep 4 10:23:29 server1 corosync[2290116]: [KNET ] host: host: 6 has no active links
Sep 4 10:23:31 server1 corosync[2290116]: [KNET ] rx: host: 6 link: 0 is up
Sep 4 10:25:03 server1 corosync[2290116]: [KNET ] link: host: 8 link: 0 is down
Sep 4 10:25:03 server1 corosync[2290116]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Sep 4 10:25:03 server1 corosync[2290116]: [KNET ] host: host: 8 has no active links
Sep 4 10:25:05 server1 corosync[2290116]: [KNET ] rx: host: 8 link: 0 is up
Sep 4 10:25:05 server1 corosync[2290116]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Both these host 6 and 8 are Ceph nodes, but luckily Ceph has not chrashed at all to my knowledge. I have not lost any data. (except for the servers going down when cluster fails once and twice a week atleast). Minor cluster problems every day.
Hehe, right. Guess I was a bit quick on the zpool upgrade command. Just a test machine on 4.15 so far, with nothing in the pool, but yes, the machine came up but zpool-import failed as expected.
Code:
root@osl108pve:~# /sbin/zpool import -aN -d /dev/disk/by-id -o cachefile=none
This pool uses the following feature(s) not supported by this system:
org.zfsonlinux:project_quota (space/object accounting based on project ID.)
com.delphix:spacemap_v2 (Space maps representing large segments are more efficient.)
All unsupported features are only required for writing to the pool.
The pool can be imported using '-o readonly=on'.
cannot import 'zfs-mirror-disk0': unsupported version or feature
I have a problem that seem related. I updated libknet on all my host (mix of 6.0 and 5.4 with upgraded corosync) but I still have problems with my cluster. My cluster is lost each night. I found that if I disable scheduled vzdump of my VMs, the cluster survives.
I have a problem that seem related. I updated libknet on all my host (mix of 6.0 and 5.4 with upgraded corosync) but I still have problems with my cluster. My cluster is lost each night. I found that if I disable scheduled vzdump of my VMs, the cluster survives.
I have a single gigabit lan, so it can be somewhat loaded on backup time, but not more than before.
Also, I didn't checked since I upgraded libknet, but before that, I found it was consuming good amount of RAM when the cluster is broken. Everything is back to normal once I restart corosync on all hosts.
I know we are waiting to upgrade to PVE 6 until we see a resolution on this as I haven't seen anyone state that a certain combination of hardware of configuration has been determined to cause this.
My understanding of this thread is that there are/were 3 issues:
- Corosync issues resolved with updated knetlib
- Corosync crashing (segmentation fault)
- Intermittent pauses, perhaps specific to bnx2
We manage 7 PVE 6 clusters which are now all stable, except for one that uses bnx2 network interfaces. Nothing is logged indicating an issue for the bnx2 network cards, which someone else was experiencing.
Our problematic cluster was perfect on PVE 4 and 5 and started having problems the moment we upgraded to 6.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.