Hi everyone,
I'm experiencing an issue where NODE1 in my Proxmox VE cluster is restarting when both NODE2 and NODE3 are powered off, even after I've made several adjustments to the corosync.conf configuration. Below is a summary of the setup and the steps I've taken:
Thanks in advance for your help.
I'm experiencing an issue where NODE1 in my Proxmox VE cluster is restarting when both NODE2 and NODE3 are powered off, even after I've made several adjustments to the corosync.conf configuration. Below is a summary of the setup and the steps I've taken:
Configuration Overview:
- NODE1: Primary node with quorum_votes: 2.
- NODE2: Backup node with quorum_votes: 1, intended to take over if NODE1 fails.
- NODE3: Acts as a quorum arbiter with quorum_votes: 1.
- no_quorum_policy is set to ignore to allow NODE1 to continue operating even if other nodes are down.
- expected_votes is set to 4 in the quorum section.
Key Configuration Snippets:
Code:
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: NODE1
nodeid: 1
quorum_votes: 2
ring0_addr: 10.10.1.XX
ring1_addr: 192.168.0.XX
}
node {
name: NODE2
nodeid: 2
quorum_votes: 1
ring0_addr: 10.10.1.XX
ring1_addr: 192.168.0.XX
}
node {
name: NODE3
nodeid: 3
quorum_votes: 1
ring0_addr: 10.10.1.XX
ring1_addr: 192.168.0.XX
}
}
quorum {
provider: corosync_votequorum
expected_votes: 4
last_man_standing: 1
last_man_standing_window: 10000
no_quorum_policy: ignore
}
totem {
cluster_name: CLUSTER_NAME
config_version: 5
interface {
linknumber: 0
}
interface {
linknumber: 1
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
token: 5000
consensus: 120000
}
Logs Prior to Restart:
Code:
Aug 25 12:09:59 NODE1 corosync[PID]: [KNET ] link: host: 3 link: 0 is down
Aug 25 12:09:59 NODE1 corosync[PID]: [KNET ] link: host: 3 link: 1 is down
Aug 25 12:09:59 NODE1 corosync[PID]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 25 12:09:59 NODE1 corosync[PID]: [KNET ] host: host: 3 has no active links
Aug 25 12:09:59 NODE1 corosync[PID]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 25 12:09:59 NODE1 corosync[PID]: [KNET ] host: host: 3 has no active links
Aug 25 12:10:00 NODE1 corosync[PID]: [TOTEM ] Token has not been received in 4237 ms
Aug 25 12:10:01 NODE1 corosync[PID]: [TOTEM ] A processor failed, forming new configuration: token timed out (5650ms), waiting 120000ms for consensus.
Aug 25 12:10:55 NODE1 watchdog-mux[PID]: client watchdog expired - disable watchdog updates
Aug 25 12:10:55 NODE1 watchdog-mux[PID]: client watchdog expired - disable watchdog updates
Issue:
Despite these configurations, NODE1 is still restarting when both NODE2 and NODE3 are powered off. I'm trying to ensure that NODE1 remains operational even when it is the only active node, but it seems like the node is still being reset, likely due to watchdog timeout.My Questions:
- What could be the possible reasons for these restarts of NODE1 when NODE2 and NODE3 are powered off despite the current configuration?
- Are there any additional settings or logs I should investigate to better understand why the node is restarting?
- Could there be any issues related to KNET or TOTEM that I might be overlooking in this context?
Thanks in advance for your help.