Node1 Unexpected Restarts Despite Configuration Adjustments - HA

adriano1995

Member
Feb 25, 2021
1
0
6
29
Hi everyone,

I'm experiencing an issue where NODE1 in my Proxmox VE cluster is restarting when both NODE2 and NODE3 are powered off, even after I've made several adjustments to the corosync.conf configuration. Below is a summary of the setup and the steps I've taken:

Configuration Overview:​

  • NODE1: Primary node with quorum_votes: 2.
  • NODE2: Backup node with quorum_votes: 1, intended to take over if NODE1 fails.
  • NODE3: Acts as a quorum arbiter with quorum_votes: 1.
  • no_quorum_policy is set to ignore to allow NODE1 to continue operating even if other nodes are down.
  • expected_votes is set to 4 in the quorum section.

Key Configuration Snippets:​

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: NODE1
    nodeid: 1
    quorum_votes: 2
    ring0_addr: 10.10.1.XX
    ring1_addr: 192.168.0.XX
  }
  node {
    name: NODE2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.1.XX
    ring1_addr: 192.168.0.XX
  }
  node {
    name: NODE3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.1.XX
    ring1_addr: 192.168.0.XX
  }
}

quorum {
  provider: corosync_votequorum
  expected_votes: 4
  last_man_standing: 1
  last_man_standing_window: 10000
  no_quorum_policy: ignore
}

totem {
  cluster_name: CLUSTER_NAME
  config_version: 5
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
  token: 5000
  consensus: 120000
}

Logs Prior to Restart:​


Code:
Aug 25 12:09:59 NODE1 corosync[PID]:   [KNET  ] link: host: 3 link: 0 is down
Aug 25 12:09:59 NODE1 corosync[PID]:   [KNET  ] link: host: 3 link: 1 is down
Aug 25 12:09:59 NODE1 corosync[PID]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 25 12:09:59 NODE1 corosync[PID]:   [KNET  ] host: host: 3 has no active links
Aug 25 12:09:59 NODE1 corosync[PID]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 25 12:09:59 NODE1 corosync[PID]:   [KNET  ] host: host: 3 has no active links
Aug 25 12:10:00 NODE1 corosync[PID]:   [TOTEM ] Token has not been received in 4237 ms
Aug 25 12:10:01 NODE1 corosync[PID]:   [TOTEM ] A processor failed, forming new configuration: token timed out (5650ms), waiting 120000ms for consensus.
Aug 25 12:10:55 NODE1 watchdog-mux[PID]: client watchdog expired - disable watchdog updates
Aug 25 12:10:55 NODE1 watchdog-mux[PID]: client watchdog expired - disable watchdog updates

Issue:​

Despite these configurations, NODE1 is still restarting when both NODE2 and NODE3 are powered off. I'm trying to ensure that NODE1 remains operational even when it is the only active node, but it seems like the node is still being reset, likely due to watchdog timeout.

My Questions:​

  1. What could be the possible reasons for these restarts of NODE1 when NODE2 and NODE3 are powered off despite the current configuration?
  2. Are there any additional settings or logs I should investigate to better understand why the node is restarting?
  3. Could there be any issues related to KNET or TOTEM that I might be overlooking in this context?
Any insights or suggestions would be greatly appreciated!

Thanks in advance for your help.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!