[SOLVED] cluster nodes reboot when one node fails

IIEP_IT · Aug 26, 2024

Greetings,

Not sure what is missing from our setup. We have a 3 node cluster and it has been working well.

Setting maintenance mode on a node allows us to restart it for updates or hardware changes and
the cluster hums along and pvecm indicates it remains quorate with 2 votes.

However, when one node crashed due to a RAM hardware issue, the other 2 nodes also rebooted.
I was expecting the cluster to continue as is as it would've remained quorate in spite of the other
node being dead.

Our cluster setup:
- 2 standalone 1Gbps switches
- each node has two dedicated network cards for cluster communications, so isolated from management, storage and all the other network traffic.
- corosync uses knet with these redundant links, one with priority 5 and the other with 10

Any ideas what I could be missing?

Regards,
Franz STREBEL

UdoB · Aug 26, 2024

IIEP_IT said:
Any ideas what I could be missing?

High Availability is enabled, correct? The usual suspicion is losing Quorum and Nodes fence themself by rebooting...

Probably all three nodes are up now? What is the output of pvecm status and corosync-cfgtool -n ?

esi_y · Aug 26, 2024

IIEP_IT said:
However, when one node crashed due to a RAM hardware issue, the other 2 nodes also rebooted.

Instantly? Can you provide e.g. journalctl -n 1000 -b -1 (i.e. last 1000 entries of the last boot-to-crash log, change -1 accordingly, or check journalctl --list-boots), preferably from all nodes from the same time?

IIEP_IT said:
Our cluster setup:
- 2 standalone 1Gbps switches
- each node has two dedicated network cards for cluster communications, so isolated from management, storage and all the other network traffic.
- corosync uses knet with these redundant links, one with priority 5 and the other with 10

Are the redundant links each over the separate NIC and switch? Do you have any MLAG + LACP set on those links?

IIEP_IT · Aug 26, 2024

Thanks for your messages. HA is indeed enabled on the cluster.

Code:

pvecm status

Cluster information
-------------------
Name:             paris
Config Version:   8
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Aug 26 15:18:15 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000002
Ring ID:          2.a6
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      3
Quorum:           3
Flags:            Quorate

Just ignore expected votes 4 as that was the old server that died and in the meantime we added a new server.

Here is the log from when that old server crashed.

Otherwise the redundant links are on two separate NICs, each going to a separate switch and the switches have no MLAG nor LACP. Each node has 2 x 1Gbps network ports for the cluster communications.

Code:

Aug 23 18:02:45 par-pve-02 corosync[3936]:   [KNET  ] link: host: 1 link: 0 is down
Aug 23 18:02:45 par-pve-02 corosync[3936]:   [KNET  ] link: host: 1 link: 1 is down
Aug 23 18:02:45 par-pve-02 corosync[3936]:   [KNET  ] host: host: 1 (passive) best link: 1 (pri: 10)
Aug 23 18:02:45 par-pve-02 corosync[3936]:   [KNET  ] host: host: 1 has no active links
Aug 23 18:02:45 par-pve-02 corosync[3936]:   [KNET  ] host: host: 1 (passive) best link: 1 (pri: 10)
Aug 23 18:02:45 par-pve-02 corosync[3936]:   [KNET  ] host: host: 1 has no active links
Aug 23 18:02:46 par-pve-02 corosync[3936]:   [TOTEM ] Token has not been received in 2737 ms
Aug 23 18:02:47 par-pve-02 corosync[3936]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for c>
Aug 23 18:02:48 par-pve-02 audit[3938590]: NETFILTER_CFG table=filter family=7 entries=0 op=xt_replace pid=3938590 subj=unconfined comm="ebtables-r>
Aug 23 18:02:48 par-pve-02 audit[3938590]: SYSCALL arch=c000003e syscall=54 success=yes exit=0 a0=3 a1=0 a2=80 a3=63d0d8ab0e60 items=0 ppid=3959 pi>
Aug 23 18:02:48 par-pve-02 audit: PROCTITLE proctitle="ebtables-restore"
Aug 23 18:02:51 par-pve-02 corosync[3936]:   [QUORUM] Sync members[2]: 2 3
Aug 23 18:02:51 par-pve-02 corosync[3936]:   [QUORUM] Sync left[1]: 1
Aug 23 18:02:51 par-pve-02 corosync[3936]:   [TOTEM ] A new membership (2.80) was formed. Members left: 1
Aug 23 18:02:51 par-pve-02 corosync[3936]:   [TOTEM ] Failed to receive the leave message. failed: 1
Aug 23 18:02:51 par-pve-02 pmxcfs[3303]: [dcdb] notice: members: 2/3303, 3/3613
Aug 23 18:02:51 par-pve-02 pmxcfs[3303]: [dcdb] notice: starting data syncronisation
Aug 23 18:02:51 par-pve-02 corosync[3936]:   [QUORUM] Members[2]: 2 3
Aug 23 18:02:51 par-pve-02 corosync[3936]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 23 18:02:51 par-pve-02 pmxcfs[3303]: [dcdb] notice: cpg_send_message retried 1 times
Aug 23 18:02:51 par-pve-02 pmxcfs[3303]: [status] notice: members: 2/3303, 3/3613
Aug 23 18:02:51 par-pve-02 pmxcfs[3303]: [status] notice: starting data syncronisation
Aug 23 18:02:51 par-pve-02 pmxcfs[3303]: [dcdb] notice: received sync request (epoch 2/3303/00000006)
Aug 23 18:02:51 par-pve-02 pmxcfs[3303]: [status] notice: starting data syncronisation
Aug 23 18:02:51 par-pve-02 pmxcfs[3303]: [dcdb] notice: received sync request (epoch 2/3303/00000006)
Aug 23 18:02:51 par-pve-02 pmxcfs[3303]: [dcdb] notice: received sync request (epoch 2/3303/00000006)
Aug 23 18:02:45 par-pve-02 corosync[3936]:   [KNET  ] host: host: 1 has no active links
Aug 23 18:02:45 par-pve-02 corosync[3936]:   [KNET  ] host: host: 1 (passive) best link: 1 (pri: 10)
Aug 23 18:02:45 par-pve-02 corosync[3936]:   [KNET  ] host: host: 1 has no active links
Aug 23 18:02:46 par-pve-02 corosync[3936]:   [TOTEM ] Token has not been received in 2737 ms
Aug 23 18:02:47 par-pve-02 corosync[3936]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for c>
Aug 23 18:02:48 par-pve-02 audit[3938590]: NETFILTER_CFG table=filter family=7 entries=0 op=xt_replace pid=3938590 subj=unconfined comm="ebtables-r>
Aug 23 18:02:48 par-pve-02 audit[3938590]: SYSCALL arch=c000003e syscall=54 success=yes exit=0 a0=3 a1=0 a2=80 a3=63d0d8ab0e60 items=0 ppid=3959 pi>
Aug 23 18:02:48 par-pve-02 audit: PROCTITLE proctitle="ebtables-restore"
Aug 23 18:02:51 par-pve-02 corosync[3936]:   [QUORUM] Sync members[2]: 2 3
Aug 23 18:02:51 par-pve-02 corosync[3936]:   [QUORUM] Sync left[1]: 1
Aug 23 18:02:51 par-pve-02 corosync[3936]:   [TOTEM ] A new membership (2.80) was formed. Members left: 1
Aug 23 18:02:51 par-pve-02 corosync[3936]:   [TOTEM ] Failed to receive the leave message. failed: 1
Aug 23 18:02:51 par-pve-02 pmxcfs[3303]: [dcdb] notice: members: 2/3303, 3/3613
Aug 23 18:02:51 par-pve-02 pmxcfs[3303]: [dcdb] notice: starting data syncronisation
Aug 23 18:02:51 par-pve-02 corosync[3936]:   [QUORUM] Members[2]: 2 3
Aug 23 18:02:51 par-pve-02 corosync[3936]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 23 18:02:51 par-pve-02 pmxcfs[3303]: [dcdb] notice: cpg_send_message retried 1 times
Aug 23 18:02:51 par-pve-02 pmxcfs[3303]: [status] notice: members: 2/3303, 3/3613
Aug 23 18:02:51 par-pve-02 pmxcfs[3303]: [status] notice: starting data syncronisation
Aug 23 18:02:51 par-pve-02 pmxcfs[3303]: [dcdb] notice: received sync request (epoch 2/3303/00000006)
Aug 23 18:02:51 par-pve-02 pmxcfs[3303]: [status] notice: received sync request (epoch 2/3303/00000006)
Aug 23 18:02:58 par-pve-02 audit[3938819]: NETFILTER_CFG table=filter family=7 entries=0 op=xt_replace pid=3938819 subj=unconfined comm="ebtables-r>
Aug 23 18:02:58 par-pve-02 audit[3938819]: SYSCALL arch=c000003e syscall=54 success=yes exit=0 a0=3 a1=0 a2=80 a3=581dc3384e60 items=0 ppid=3959 pi>
Aug 23 18:02:58 par-pve-02 audit: PROCTITLE proctitle="ebtables-restore"
Aug 23 18:03:02 par-pve-02 postfix/qmgr[2018]: D2E48221862: from=<root@par-pve-02.iiep.unesco.org>, size=8506, nrcpt=1 (queue active)
Aug 23 18:03:02 par-pve-02 postfix/qmgr[2018]: 61F78221740: from=<root@par-pve-02.iiep.unesco.org>, size=8886, nrcpt=1 (queue active)
Aug 23 18:03:02 par-pve-02 postfix/qmgr[2018]: A75F9221738: from=<root@par-pve-02.iiep.unesco.org>, size=8567, nrcpt=1 (queue active)
Aug 23 18:03:02 par-pve-02 postfix/qmgr[2018]: 05C632217B3: from=<root@par-pve-02.iiep.unesco.org>, size=8506, nrcpt=1 (queue active)
Aug 23 18:03:02 par-pve-02 postfix/qmgr[2018]: 8A8F92217A5: from=<root@par-pve-02.iiep.unesco.org>, size=8570, nrcpt=1 (queue active)
Aug 23 18:03:02 par-pve-02 postfix/qmgr[2018]: 336B42218A2: from=<root@par-pve-02.iiep.unesco.org>, size=8481, nrcpt=1 (queue active)
Aug 23 18:03:02 par-pve-02 postfix/qmgr[2018]: 3A9C12217FD: from=<root@par-pve-02.iiep.unesco.org>, size=8293, nrcpt=1 (queue active)
Aug 23 18:03:02 par-pve-02 postfix/error[3938852]: D2E48221862: to=<it-unit@iiep.unesco.org>, relay=none, delay=157470, delays=157470/0.04/0/0.02, >
Aug 23 18:03:02 par-pve-02 postfix/error[3938853]: 61F78221740: to=<it-unit@iiep.unesco.org>, relay=none, delay=409472, delays=409471/0.04/0/0, dsn>
Aug 23 18:03:02 par-pve-02 postfix/error[3938852]: A75F9221738: to=<it-unit@iiep.unesco.org>, relay=none, delay=359071, delays=359071/0.04/0/0, dsn>
Aug 23 18:03:02 par-pve-02 postfix/error[3938853]: 05C632217B3: to=<it-unit@iiep.unesco.org>, relay=none, delay=258270, delays=258270/0.04/0/0, dsn>
Aug 23 18:03:02 par-pve-02 postfix/error[3938852]: 8A8F92217A5: to=<it-unit@iiep.unesco.org>, relay=none, delay=308670, delays=308670/0.04/0/0, dsn>
Aug 23 18:03:02 par-pve-02 postfix/error[3938853]: 336B42218A2: to=<it-unit@iiep.unesco.org>, relay=none, delay=56671, delays=56671/0.04/0/0, dsn=4>
Aug 23 18:03:02 par-pve-02 postfix/error[3938852]: 3A9C12217FD: to=<it-unit@iiep.unesco.org>, relay=none, delay=207871, delays=207871/0.04/0/0, dsn>
Aug 23 18:03:08 par-pve-02 audit[3939057]: NETFILTER_CFG table=filter family=7 entries=0 op=xt_replace pid=3939057 subj=unconfined comm="ebtables-r>
Aug 23 18:03:08 par-pve-02 audit[3939057]: SYSCALL arch=c000003e syscall=54 success=yes exit=0 a0=3 a1=0 a2=80 a3=613aed439e60 items=0 ppid=3959 pi>
Aug 23 18:03:08 par-pve-02 audit: PROCTITLE proctitle="ebtables-restore"
Aug 23 18:03:12 par-pve-02 sshd[3939084]: Connection closed by 172.25.255.233 port 40134 [preauth]
Aug 23 18:03:18 par-pve-02 audit[3939299]: NETFILTER_CFG table=filter family=7 entries=0 op=xt_replace pid=3939299 subj=unconfined comm="ebtables-r>
Aug 23 18:03:18 par-pve-02 audit[3939299]: SYSCALL arch=c000003e syscall=54 success=yes exit=0 a0=3 a1=0 a2=80 a3=55d497fbae60 items=0 ppid=3959 pi>
Aug 23 18:03:18 par-pve-02 audit: PROCTITLE proctitle="ebtables-restore"
Aug 23 18:03:21 par-pve-02 monit[1744]: Cannot connect to [mmonit.iiep.unesco.org]:443 -- Connection timed out
Aug 23 18:03:21 par-pve-02 monit[1744]: M/Monit: cannot open a connection to https://[mmonit.iiep.unesco.org]:443/collector
Aug 23 18:03:24 par-pve-02 corosync[3936]:   [KNET  ] link: host: 3 link: 1 is down
Aug 23 18:03:24 par-pve-02 corosync[3936]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 5)
Aug 23 18:03:26 par-pve-02 corosync[3936]:   [KNET  ] rx: host: 3 link: 1 is up
Aug 23 18:03:26 par-pve-02 corosync[3936]:   [KNET  ] link: Resetting MTU for link 1 because host 3 joined
Aug 23 18:03:26 par-pve-02 corosync[3936]:   [KNET  ] host: host: 3 (passive) best link: 1 (pri: 10)
Aug 23 18:03:26 par-pve-02 corosync[3936]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Aug 23 18:03:28 par-pve-02 audit[3939523]: NETFILTER_CFG table=filter family=7 entries=0 op=xt_replace pid=3939523 subj=unconfined comm="ebtables-r>
Aug 23 18:03:28 par-pve-02 audit[3939523]: SYSCALL arch=c000003e syscall=54 success=yes exit=0 a0=3 a1=0 a2=80 a3=6397bdd48e60 items=0 ppid=3959 pi>
Aug 23 18:03:28 par-pve-02 audit: PROCTITLE proctitle="ebtables-restore"
Aug 23 18:03:35 par-pve-02 watchdog-mux[1501]: client watchdog expired - disable watchdog updates
Aug 23 18:03:38 par-pve-02 audit[3939755]: NETFILTER_CFG table=filter family=7 entries=0 op=xt_replace pid=3939755 subj=unconfined comm="ebtables-r>
Aug 23 18:03:38 par-pve-02 audit[3939755]: SYSCALL arch=c000003e syscall=54 success=yes exit=0 a0=3 a1=0 a2=80 a3=61a0cafbde60 items=0 ppid=3959 pi>
Aug 23 18:03:38 par-pve-02 audit: PROCTITLE proctitle="ebtables-restore"

esi_y · Aug 26, 2024

IIEP_IT said:
Code:

Expected votes: 4 Highest expected: 4 Total votes: 3 Quorum: 3 Flags: Quorate

Just ignore expected votes 4 as that was the old server that died and in the meantime we added a new server.

But that's likely your issue, you need 3 (out of expected 4) to have quorum at all times. So if you actually only have 3 at most...

IIEP_IT said:
Not sure what is missing from our setup. We have a 3 node cluster and it has been working well.

Any single one down causes every other to lose quorum, exactly as you observed ...

Whenever you see, at the end of a rebooted node ..

IIEP_IT said:

This was the enforced reboot due to HA is on.

You need to fix your /etc/pve/corosync.conf to only contain nodes that you actually have.

esi_y · Aug 26, 2024

BTW ... if you see that a node is in "non-primary component" it basically thinks it is in a minority of nodes that cannot have quorum. So a node that can see only one other node knowing / thinking it is in the total of 4 ... is completely correctly assuming it does not have quorum with that one other node and shall act accordingly.

IIEP_IT · Aug 26, 2024

When the crash happened last Friday, 23 August, we only had three nodes. pvecm is showing 4 right now because I just finished adding a new server to the cluster. I am looking how to delete the dead node to re-establish only 3 member nodes.

UdoB · Aug 26, 2024

IIEP_IT said:
I am looking how to delete the dead node

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node

IIEP_IT · Aug 26, 2024

UdoB said:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node

Thanks, worked well and even cleaned up the GUI. I continue to look for the problem with the quorum now, when a node crashes.

esi_y · Aug 26, 2024

IIEP_IT said:
When the crash happened last Friday, 23 August, we only had three nodes. pvecm is showing 4 right now because I just finished adding a new server to the cluster. I am looking how to delete the dead node to re-establish only 3 member nodes.

You can use --since and --until switches on the journalctl or export full boot log (you only included trimmed <100 lines), something like journalctl --since YYYY-MM-DD --until YYYY-MM-DD > nodeX.log and attach it here but I would need to see it from ALL the nodes from the same period, so three logs.

That could help having a second look at the issue.

IIEP_IT · Aug 26, 2024

esi_y said:
You can use --since and --until switches on the journalctl or export full boot log (you only included trimmed <100 lines), something like journalctl --since YYYY-MM-DD --until YYYY-MM-DD > nodeX.log and attach it here but I would need to see it from ALL the nodes from the same period, so three logs.

That could help having a second look at the issue.

Thank you. Once we receive the replacement RAM modules, I will be able to bring up the crashed node. It will be out of the cluster but I should be able to recover the logs and we can look.

Thanks again to everyone who chimed in. Will post here when I get the info we need.

IIEP_IT · Aug 26, 2024

OK, here are at least the logs from the 2 nodes. Node 01 crashed due to the memory error and I see that the other nodes see that around 18:02:44-45. Then it seems to be continuing fine and then I see on node 2:

Code:

watchdog-mux[1501]: client watchdog expired - disable watchdog updates

I assume the watchdog expiration caused node 2 to fence itself.

And after that happened, node 3 was all alone and had to fence itself as clearly seen in its logs:

Code:

Aug 23 18:03:52 par-pve-03 pmxcfs[3613]: [status] notice: node lost quorum

esi_y · Aug 27, 2024

IIEP_IT said:
OK, here are at least the logs from the 2 nodes. Node 01 crashed due to the memory error and I see that the other nodes see that around 18:02:44-45. Then it seems to be continuing fine and then I see on node 2:

Code:

watchdog-mux[1501]: client watchdog expired - disable watchdog updates

I assume the watchdog expiration caused node 2 to fence itself.

And after that happened, node 3 was all alone and had to fence itself as clearly seen in its logs:

Code:

Aug 23 18:03:52 par-pve-03 pmxcfs[3613]: [status] notice: node lost quorum

I kind of hoped to see much more of the corosync logs (there's no question it then goes on to emergency reboot upon lost quorum), not just the few minutes. I understand it might be massive for the entire bootup-till-crash period, but you can e.g. include at least (where $BOOTID is identifying the boot in which it crashed):

Code:

journalctl -b $BOOTID -u pveproxy -u pvedaemon -u pmxcfs -u corosync -u pve-cluster -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux > n2.bootnumber.log

If you do limit it with --since --until further yet, but please include at least full 24 hour period prior. In fact I prefer to use either full boot log, or multi-day since/until to see what changed after a particular reboot.

BTW The part of n2 where:

Code:

Aug 23 18:02:51 par-pve-02 corosync[3936]:   [TOTEM ] A new membership (2.80) was formed. Members left: 1
Aug 23 18:02:51 par-pve-02 corosync[3936]:   [TOTEM ] Failed to receive the leave message. failed: 1

... is understandable based on the background you provided, what is strange is ...

Code:

Aug 23 18:03:24 par-pve-02 corosync[3936]:   [KNET  ] link: host: 3 link: 1 is down
Aug 23 18:03:24 par-pve-02 corosync[3936]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 5)
Aug 23 18:03:26 par-pve-02 corosync[3936]:   [KNET  ] rx: host: 3 link: 1 is up
Aug 23 18:03:26 par-pve-02 corosync[3936]:   [KNET  ] link: Resetting MTU for link 1 because host 3 joined
Aug 23 18:03:26 par-pve-02 corosync[3936]:   [KNET  ] host: host: 3 (passive) best link: 1 (pri: 10)
Aug 23 18:03:26 par-pve-02 corosync[3936]:   [KNET  ] pmtud: Global data MTU changed to: 1397

IIEP_IT · Aug 27, 2024

Thanks for your reply and suggestion. I've now attached multiday logs for those services for nodes 2 and 3.

Thanks for your help.

esi_y · Aug 27, 2024

Interesting, can you add also outputs (from any single one of the good nodes now):

Code:

corosync-cfgtool -sb
corosync-cmapctl

IIEP_IT · Aug 27, 2024

Here they are. Thanks.

esi_y · Aug 27, 2024

IIEP_IT said:
Otherwise the redundant links are on two separate NICs, each going to a separate switch and the switches have no MLAG nor LACP. Each node has 2 x 1Gbps network ports for the cluster communications.

And what is this?

Code:

Local node ID 2, transport knet
LINK ID 0 udp
    addr    = 172.25.251.33
    status    = n33
LINK ID 1 udp
    addr    = 172.25.251.233
    status    = n33

So you have same subnet on each of the two NICs?

IIEP_IT · Aug 27, 2024

esi_y said:
And what is this?

Code:

Local node ID 2, transport knet LINK ID 0 udp addr = 172.25.251.33 status = n33 LINK ID 1 udp addr = 172.25.251.233 status = n33

So you have same subnet on each of the two NICs?

The network interfaces are using a mask of /25 for those. They are two subnets, passing through two separate switches.

Should I be putting the mask as well in the corosync.conf file?

esi_y · Aug 27, 2024

IIEP_IT said:
The network interfaces are using a mask of /25 for those. They are two subnets, passing through two separate switches.

~~172.25.251.{33,233}/25 it's one network.~~

IIEP_IT said:
Should I be putting the mask as well in the corosync.conf file?

No no, I am literally thinking now what consequences this has in terms of corosync and individual NIC failures. The standard setup would be to have different networks for each. E.g. 10.20.30.0/24 for one and 10.40.50.0/24 for another. What does your routing table even looks like on such node?

IIEP_IT · Aug 27, 2024

esi_y said:
172.25.251.{33,233}/25 it's one network.

No no, I am literally thinking now what consequences this has in terms of corosync and individual NIC failures. The standard setup would be to have different networks for each. E.g. 10.20.30.0/24 for one and 10.40.50.0/24 for another. What does your routing table even looks like on such node?

It is not one network, the /25 mask splits that into two subnets:

172.25.251.33/25 means the network is 172.25.251.0/25 and the hosts go from 251.1-126 and broadcast is 251.127

172.25.251.233/25 means the network is 172.25.251.128/25 and the hosts go from 251.129-254 and broadcast is 251.255

And of course the routing table is fine as you can see on the 4th and 5th lines of the output below:

Code:

~ # ip route
default via 172.25.252.1 dev vmbr0 proto kernel onlink 
172.25.112.0/22 dev bond0.112 proto kernel scope link src 172.25.112.233 
172.25.128.0/22 dev bond0.128 proto kernel scope link src 172.25.128.233 
172.25.251.0/25 dev eno3 proto kernel scope link src 172.25.251.33 
172.25.251.128/25 dev enx9cebe84b45c6 proto kernel scope link src 172.25.251.233 
172.25.252.0/22 dev vmbr0 proto kernel scope link src 172.25.255.233

From the corosync help pages, I have never seen it put the mask in the IP. You simply tell it which IP to use for communications as as configured on the interfaces.

Not sure if anything is amiss with the corosync.conf file but here it is:

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: par-pve-02
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 172.25.251.33
    ring1_addr: 172.25.251.233
  }
  node {
    name: par-pve-03
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 172.25.251.35
    ring1_addr: 172.25.251.235
  }
  node {
    name: par-pve-04
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 172.25.251.37
    ring1_addr: 172.25.251.237
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: paris
  config_version: 9
  interface {
    knet_link_priority: 5
    linknumber: 0
  }
  interface {
    knet_link_priority: 10
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

[SOLVED] cluster nodes reboot when one node fails

New Member

Distinguished Member

Renowned Member

New Member

Renowned Member

Renowned Member

New Member

Distinguished Member

New Member

Renowned Member

New Member

New Member

Attachments

Renowned Member

New Member

Attachments

Renowned Member

New Member

Attachments

Renowned Member

New Member

Renowned Member

New Member

We value your privacy