[SOLVED] Can a netwok loop reboot all cluster nodes?

mailinglists · Mar 11, 2021

Hi,

a coworker made a network loop for 60 seconds and it turned out that some PM cluster which uses WAN side for clustering rebooted itself - all nodes.

It made me wonder, can network loop cause PM to reboot itself?
Will it reboot, even if no HA resources are defined on cluster?
Under what circumstances would PM reboot itself?
Surely loss of quorum should not trigger reboot, when there are no HA resources defined, right?

Here are some relevant logs I got from the admins of this cluster.

Code:

Time42 serverXYZ kernel: [89005319.786954] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time42 serverXYZ kernel: [89005319.837135] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time42 serverXYZ kernel: [89005319.887327] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time42 serverXYZ kernel: [89005319.937518] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time42 serverXYZ kernel: [89005319.987698] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time42 serverXYZ kernel: [89005320.037872] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time42 serverXYZ kernel: [89005320.088052] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time43 serverXYZ kernel: [89005320.138132] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time43 serverXYZ kernel: [89005320.188331] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time43 serverXYZ kernel: [89005320.238454] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time43 serverXYZ corosync[32844]: warning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Time43 serverXYZ corosync[32844]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Time44 serverXYZ corosync[32844]: warning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Time44 serverXYZ corosync[32844]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Time46 serverXYZ corosync[32844]: warning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Time46 serverXYZ corosync[32844]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Time47 serverXYZ kernel: [89005324.838662] net_ratelimit: 91 callbacks suppressed
Time47 serverXYZ kernel: [89005324.838699] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time47 serverXYZ kernel: [89005324.888853] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time47 serverXYZ kernel: [89005324.939026] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time47 serverXYZ kernel: [89005324.981935] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time47 serverXYZ kernel: [89005325.032129] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time47 serverXYZ corosync[32844]: warning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Time47 serverXYZ corosync[32844]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Time47 serverXYZ kernel: [89005325.082298] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time48 serverXYZ kernel: [89005325.132463] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time48 serverXYZ kernel: [89005325.182608] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time48 serverXYZ kernel: [89005325.232796] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time48 serverXYZ kernel: [89005325.282962] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time32 serverXYZ systemd-modules-load[634]: Inserted module 'iscsi_tcp'
Time32 serverXYZ kernel: [ 0.000000] Linux version 4.15.18-30-pve (root@nora) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.18-58 (Fri, 12 Jun 2020 13:53:01 +0200) ()
Time32 serverXYZ systemd-modules-load[634]: Inserted module 'ib_iser'
Time32 serverXYZ kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-30-pve root=/dev/mapper/pve-root ro quiet

mailinglists · Mar 17, 2021

Bump.

fabian · Mar 17, 2021

mailinglists said:
Hi,

a coworker made a network loop for 60 seconds and it turned out that some PM cluster which uses WAN side for clustering rebooted itself - all nodes.

It made me wonder, can network loop cause PM to reboot itself?

on its own it shouldn't. but of course it CAN cause a loss of quorum, which CAN trigger a reboot

mailinglists said:
Will it reboot, even if no HA resources are defined on cluster?

if HA was never active, a node should not get fenced even when quorum is lost.

mailinglists said:
Under what circumstances would PM reboot itself?

mailinglists said:
Surely loss of quorum should not trigger reboot, when there are no HA resources defined, right?

if HA was ever active on a node, a watchdog will be running. if that watchdog expires, the node will fence itself (and reboot, if it still manages to do that). the usual reason for the watchdog expiring is that pmxcfs is not writable, usually because of loss of quorum.

mailinglists said:

Here are some relevant logs I got from the admins of this cluster.

Code:

Time42 serverXYZ kernel: [89005319.786954] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time42 serverXYZ kernel: [89005319.837135] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time42 serverXYZ kernel: [89005319.887327] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time42 serverXYZ kernel: [89005319.937518] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time42 serverXYZ kernel: [89005319.987698] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time42 serverXYZ kernel: [89005320.037872] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time42 serverXYZ kernel: [89005320.088052] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time43 serverXYZ kernel: [89005320.138132] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time43 serverXYZ kernel: [89005320.188331] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time43 serverXYZ kernel: [89005320.238454] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time43 serverXYZ corosync[32844]: warning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Time43 serverXYZ corosync[32844]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Time44 serverXYZ corosync[32844]: warning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Time44 serverXYZ corosync[32844]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Time46 serverXYZ corosync[32844]: warning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Time46 serverXYZ corosync[32844]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Time47 serverXYZ kernel: [89005324.838662] net_ratelimit: 91 callbacks suppressed
Time47 serverXYZ kernel: [89005324.838699] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time47 serverXYZ kernel: [89005324.888853] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time47 serverXYZ kernel: [89005324.939026] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time47 serverXYZ kernel: [89005324.981935] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time47 serverXYZ kernel: [89005325.032129] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time47 serverXYZ corosync[32844]: warning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Time47 serverXYZ corosync[32844]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
Time47 serverXYZ kernel: [89005325.082298] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time48 serverXYZ kernel: [89005325.132463] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time48 serverXYZ kernel: [89005325.182608] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time48 serverXYZ kernel: [89005325.232796] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time48 serverXYZ kernel: [89005325.282962] vmbr0: received packet on eno1 with own address as source address (addr:MA:C:AD:DD:ES, vlan:0)
Time32 serverXYZ systemd-modules-load[634]: Inserted module 'iscsi_tcp'
Time32 serverXYZ kernel: [ 0.000000] Linux version 4.15.18-30-pve (root@nora) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.18-58 (Fri, 12 Jun 2020 13:53:01 +0200) ()
Time32 serverXYZ systemd-modules-load[634]: Inserted module 'ib_iser'
Time32 serverXYZ kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-30-pve root=/dev/mapper/pve-root ro quiet

that just documents that quorum was lost followed by a new start - we'd need more information to verify whether it was HA that triggered a fence or some other error condition that caused a reboot/crash.

mailinglists · Mar 22, 2021

Thank you for all your answers.

So if HA was ever defined, even if the HA resources have since been deleted, there is a watchdog running which will reboot node with loss of quorum. Do I understand correct?

If that is so, how can I check if such watch dog is running and how can I disable it (after deleting all HA resources)?

fabian · Mar 22, 2021

mailinglists said:
Thank you for all your answers.

So if HA was ever defined, even if the HA resources have since been deleted, there is a watchdog running which will reboot node with loss of quorum. Do I understand correct?

yes

mailinglists said:
If that is so, how can I check if such watch dog is running and how can I disable it (after deleting all HA resources)?

ha-manager status on each node. IFF no HA resources are configured, restarting the pve-ha-lrm service should disable the watchdog.

mailinglists · Mar 22, 2021

Thank you for your answer.

Will restarting pve-ha-lrm result in restarting of any virtual instances or host itself?

So, after restart it should look like this:

Code:

root@p35:~# ha-manager status
quorum OK
root@p35:~# ha-manager config
root@p35:~#

And not have any lrm or any other type resources listed, correct?

fabian · Mar 23, 2021

mailinglists said:
Thank you for your answer.

Will restarting pve-ha-lrm result in restarting of any virtual instances or host itself?

no (it happens on every upgrade of the HA packages!)

mailinglists said:
So, after restart it should look like this:

Code:

root@p35:~# ha-manager status quorum OK root@p35:~# ha-manager config root@p35:~#

And not have any lrm or any other type resources listed, correct?

HA active with an active resource:

Code:

$ ha-manager status
quorum OK
master nora (active, Tue Mar 23 09:02:40 2021)
lrm nora (active, Tue Mar 23 09:02:40 2021)
service ct:100023 (nora, starting)

HA (and watchdog!) active but all resources removed:

Code:

$ ha-manager status
quorum OK
master nora (active, Tue Mar 23 09:02:50 2021)
lrm nora (active, Tue Mar 23 09:02:50 2021)

HA idle after LRM restart:

Code:

$ ha-manager status
quorum OK
master nora (active, Tue Mar 23 09:05:00 2021)
lrm nora (idle, Tue Mar 23 09:05:02 2021)

xiaolin0199 · Jul 12, 2022

Is this issue fixed? The watchdog is still active after deleting the HA resource。

fabian · Jul 12, 2022

there is no issue to be fixed. if you enable HA on a node (by having a HA-enabled resource active there) it will be "armed" until the node is rebooted or the HA services restarted.

xiaolin0199 · Jul 19, 2022

fabian said:
there is no issue to be fixed. if you enable HA on a node (by having a HA-enabled resource active there) it will be "armed" until the node is rebooted or the HA services restarted.

I found that the problem has been fixed。

2021-07-02	Thomas Lamprecht	LRM: release lock and close watchdog if no service...
2021-07-02	Thomas Lamprecht	LRM: factor out closing watchdog local helper

Search

Search

[SOLVED] Can a netwok loop reboot all cluster nodes?

mailinglists

Renowned Member

mailinglists

Renowned Member

fabian

Proxmox Staff Member

mailinglists

Renowned Member

fabian

Proxmox Staff Member

mailinglists

Renowned Member

fabian

Proxmox Staff Member

xiaolin0199

Well-Known Member

fabian

Proxmox Staff Member

xiaolin0199

Well-Known Member

We value your privacy