Hi,
I have two separate PVE clusters: one hosts my Ceph storage, while the other hosts only the guests. The PVE nodes do have 2 1GbE and 2 10GbE interfaces, where the 10GbE ones are configured as a LACP bond. I had all the communication run over different VLANs on that bonds and this led to some performance/stability issues with corosync when the backup was taking up too much of the resources of those bonds. The result was that some nodes experienced fencing while the backup was running.
However, what's interesting is, that these fences always occurred at around the same time - which surprised me since I'd imagined, that these fences should occur at different times, in regards of how saturated the LAPC link was, but … it's almost about 02:09 in the morning.
I then introduced a 2nd ring to the corosync config to provide some redundancy. The 2nd ring runs over a 1 GbE active/backup bond, since the switches in the blade chassis don't support VPC to our Cisco Nexus switches, but anyway. Now the fencing stopped, but nonetheless when the last backup ran, two of my PVE nodes rebooted again around 02:10 in the morning, however without being actively fenced by corosync. There is simply nothing in the debug or syslog.
Any Idea of how this can be diagnosed? Since these PVE nodes are only Ceph clients (RDB), ceph itself seems out of the picture, but if neither Ceph nor corosync causes the reboots, what does?
Thanks,
budy
I have two separate PVE clusters: one hosts my Ceph storage, while the other hosts only the guests. The PVE nodes do have 2 1GbE and 2 10GbE interfaces, where the 10GbE ones are configured as a LACP bond. I had all the communication run over different VLANs on that bonds and this led to some performance/stability issues with corosync when the backup was taking up too much of the resources of those bonds. The result was that some nodes experienced fencing while the backup was running.
However, what's interesting is, that these fences always occurred at around the same time - which surprised me since I'd imagined, that these fences should occur at different times, in regards of how saturated the LAPC link was, but … it's almost about 02:09 in the morning.
I then introduced a 2nd ring to the corosync config to provide some redundancy. The 2nd ring runs over a 1 GbE active/backup bond, since the switches in the blade chassis don't support VPC to our Cisco Nexus switches, but anyway. Now the fencing stopped, but nonetheless when the last backup ran, two of my PVE nodes rebooted again around 02:10 in the morning, however without being actively fenced by corosync. There is simply nothing in the debug or syslog.
Any Idea of how this can be diagnosed? Since these PVE nodes are only Ceph clients (RDB), ceph itself seems out of the picture, but if neither Ceph nor corosync causes the reboots, what does?
Thanks,
budy