Delete disconnected node from a 3 nodes cluster challenge

proxman4 · Aug 22, 2022

Hi

i need some help with a problem regarding a cluster (server providers).

First i need to tell you the story.

I got a cluster with 2 pve6 and one recently added pve7.

10.17.254.5 OP1 (pve6 hardware nic failure)
10.17.254.6 OP0 (recently pve6to7)
10.17.254.7 OPX (healthy pve7)

I planned to upgrade the 2 nodes (OP0 and OP1) from pve6 to 7 because of the end of support of debian/proxmox .

I did all what is described in the very clear documentation and upgrade seems well for the first node (OP1)

I waited to do the second day (night) to do the upgrade of the last pve6 node. At the same time i got some kind of hardware failure (NIC or SFP module provider is not clear)) on the network card of node OP0 and It takes very long time from provider to fix the failure.

So i decided to temporarily break the cluster to be able to continue operations. Not a big deal to break it because it's only for crossed backups via nfs. It's not a H.A

So what i wanted was to reduce the cluster to only two nodes. the two on pve7 with no hardware failure (OPX and OP0). Seems logic

I tried to do :

Code:

pvecm expected 2

But there is a strange behavior. One of the node (OP0) after some times cannot communicate properly within the cluster and stop operations.

It looses the vmbr0 bond which is the public ip link interface.

vmbr1 wich is the cluster dedicated interface continue to work.

Here is the log entrie of the culprit on the failing node OP0 :

Code:

août 22 21:38:08 OP0 pmxcfs[1792]: [quorum] crit: quorum_initialize failed: 2                                                                                                             
août 22 21:38:08 OP0 pmxcfs[1792]: [quorum] crit: can't initialize service
août 22 21:38:08 OP0 pmxcfs[1792]: [confdb] crit: cmap_initialize failed: 2
août 22 21:38:08 OP0 pmxcfs[1792]: [confdb] crit: can't initialize service
août 22 21:38:08 OP0 pmxcfs[1792]: [dcdb] crit: cpg_initialize failed: 2
août 22 21:38:08 OP0 pmxcfs[1792]: [dcdb] crit: can't initialize service
août 22 21:38:08 OP0 pmxcfs[1792]: [status] crit: cpg_initialize failed: 2
août 22 21:38:08 OP0 pmxcfs[1792]: [status] crit: can't initialize service

If i reboot it works for half an hour approximately and it looses vmbr0 again.

Here are the nodes with pvecm status

Code:

Cluster information
-------------------
Name:             CLUSTER-OVH
Config Version:   19
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Aug 22 22:33:36 2022
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000004
Ring ID:          1.18f59
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.17.254.7
0x00000004          1 10.17.254.6 (local)

We can see the expected votes are at 2.

But corosync logs continue to give this message :

Code:

août 22 22:35:10 OP0 corosync[1870]:   [KNET  ] udp: Received ICMP error from 10.17.254.6: No route to host 10.17.254.5

and vmrb0 after some time cannot communicate.

At some point i checked the /etc/pve/corosync.conf file and noticed that the bindnetaddress was : 10.17.254.5 (OP1 the one with hardware nic failure)

So i decided (maybe wrongly) to change the ip adress replacing it by the opx (10.17.254.7) adress (the most healthy node)

Code:

logging {
  debug: on
  to_syslog: yes
}

nodelist {
  node {
    name: OP0
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.17.254.6
  }
  node {
    name: OP1
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.17.254.5
  }
  node {
    name: OPX
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.17.254.7
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: CLUSTER-OVH
  config_version: 19
  interface {
    bindnetaddr: 10.17.254.7
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

i restarted the services and at some point both nodes

Code:

août 22 21:38:09 OP0 corosync[1870]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
août 22 21:38:09 OP0 corosync[1870]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] totemip_parse: IPv4 address of 10.17.254.6 resolved as 10.17.254.6
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] totemip_parse: IPv4 address of 10.17.254.6 resolved as 10.17.254.6
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] totemip_parse: IPv4 address of 10.17.254.5 resolved as 10.17.254.5
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] totemip_parse: IPv4 address of 10.17.254.7 resolved as 10.17.254.7
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] Configuring link 0 params
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] totemip_parse: IPv4 address of 10.17.254.6 resolved as 10.17.254.6
août 22 21:38:09 OP0 corosync[1870]:   [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
août 22 21:38:09 OP0 corosync[1870]:   [MAIN  ] Please migrate config file to nodelist.
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] waiting_trans_ack changed to 1
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] Token Timeout (3650 ms) retransmit timeout (869 ms)
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] Token warning every 2737 ms (75% of Token Timeout)
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] token hold (685 ms) retransmits before loss (4 retrans)
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] join (50 ms) send_join (0 ms) consensus (4380 ms) merge (200 ms)
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] downcheck (1000 ms) fail to recv const (2500 msgs)
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] seqno unchanged const (30 rotations) Maximum network MTU 65446
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] window size per rotation (50 messages) maximum messages per rotation (17 messages)
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] missed count const (5 messages)
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] send threads (0 threads)
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] heartbeat_failures_allowed (0)
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] max_network_delay (50 ms)
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] Initializing transport (Kronosnet).
août 22 21:38:09 OP0 kernel: sctp: Hash tables configured (bind 1024/1024)
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] knet_enable access list: 1
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] Configuring crypto nss/aes256/sha256 on index 1
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] Knet pMTU change: 421
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] totemknet initialized
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] sctp: Size of struct sctp_event_subscribe is 14 in kernel, 14 in user space
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Updated status for thread SCTP_LISTEN to registered
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Updated status for thread SCTP_CONN to registered
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Updated status for thread SCTP_LISTEN to started
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Updated status for thread PMTUD to registered
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Updated status for thread SCTP_CONN to started
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Updated status for thread DST_LINK to registered
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Updated status for thread PMTUD to started
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Updated status for thread TX to registered
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Updated status for thread DST_LINK to started
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Updated status for thread RX to registered
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Updated status for thread TX to started
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Updated status for thread HB to started
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Checking thread: TX status: started req: started
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Checking thread: RX status: started req: started
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Checking thread: HB status: started req: started
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Checking thread: PMTUD status: started req: started
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Checking thread: DST_LINK status: started req: started
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Checking thread: SCTP_LISTEN status: started req: started
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Checking thread: SCTP_CONN status: started req: started
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Links access lists are enabled
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: PMTUd interval set to: 30 seconds
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: dst_host_filter_fn enabled
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: sock_notify_fn enabled
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] host: host_status_change_notify_fn enabled
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: pmtud_notify_fn enabled
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] crypto: Initializing crypto module [nss/aes256/sha256]
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] nsscrypto: Initializing nss crypto module [aes256/sha256]
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] crypto: PMTUd has been reset to default
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] crypto: Notifying PMTUd to rerun
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] crypto: Only crypto traffic allowed for RX
août 22 21:38:09 OP0 corosync[1870]:   [KNET  ] handle: Data forwarding is enabled
août 22 21:38:09 OP0 corosync[1870]:   [TOTEM ] Created or loaded sequence id 4.18f4a for this ring.
août 22 21:38:09 OP0 corosync[1870]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
août 22 21:38:09 OP0 corosync[1870]:   [MAIN  ] Initializing IPC on cmap [0]
août 22 21:38:09 OP0 corosync[1870]:   [MAIN  ] No configured system.qb_ipc_type. Using native ipc
août 22 21:38:09 OP0 corosync[1870]:   [QB    ] server name: cmap
août 22 21:38:09 OP0 corosync[1870]:   [SERV  ] Service engine loaded: corosync configuration service [1]
août 22 21:38:09 OP0 corosync[1870]:   [MAIN  ] Initializing IPC on cfg [1]
août 22 21:38:09 OP0 corosync[1870]:   [MAIN  ] No configured system.qb_ipc_type. Using native ipc
août 22 21:38:09 OP0 corosync[1870]:   [QB    ] server name: cfg
août 22 21:38:09 OP0 corosync[1870]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
août 22 21:38:09 OP0 corosync[1870]:   [MAIN  ] Initializing IPC on cpg [2]
août 22 21:38:09 OP0 corosync[1870]:   [MAIN  ] No configured system.qb_ipc_type. Using native ipc
août 22 21:38:09 OP0 corosync[1870]:   [QB    ] server name: cpg
août 22 21:38:09 OP0 corosync[1870]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
août 22 21:38:09 OP0 corosync[1870]:   [MAIN  ] NOT Initializing IPC on pload [4]
août 22 21:38:09 OP0 corosync[1870]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]

The documentation tells to do pvecm delnode op1

but node is not detected.

Now i'm stucked with a dysfonctionnal cluster i want to break and i need to ask to you good fellows.
My purpose is :

either detroying the cluster to have 3 idenpendant pve or limiting the cluster with only the two nodes OP0 and OPX.

This community and forum are goldmine of informations and solutions i'm pretty sure i missed something obvious and another eye will see the solution.

I advance thanks.

shrdlicka · Aug 23, 2022

Hi,

the whole thing is a bit confusing to me you write

proxman4 said:
So what i wanted was to reduce the cluster to only two nodes. the two on pve7 with no hardware failure (OPX and OP0)

But at the top you write:

proxman4 said:
10.17.254.5 OP1 (recentrly pve6to7)
10.17.254.6 OP0 (pve6) (hardware nic failure)
10.17.254.7 OPX (healthy pve7)

I'm guessing you switched something in the numbering OP0/1? I'm not exactly sure which one.

proxman4 · Aug 23, 2022

Thanks @shrdlicka

You're right i switched. My bad. Edited.

proxman4 · Aug 23, 2022

So i git some news. The provider finally fixed the hardware failure and my cluster network can finally communicate.

Now i got this from op1 :

Code:

août 23 17:01:20 OP1 corosync[1583]:   [KNET  ] udp: Received ICMP error from 10.17.254.6: Connection refused 10.17.254.6

This from op0 :

Code:

août 23 17:00:24 OP0 pveproxy[2217]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1943.

and this from opx : pvecm status

Code:

root@OPX:~# pvecm status
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Unable to load access control list: Connection refused

journalctl

Code:

Aug 23 15:03:00 OPX pveproxy[1291140]: ipcc_send_rec[1] failed: Connection refused
Aug 23 15:03:00 OPX pveproxy[1291140]: ipcc_send_rec[2] failed: Connection refused
Aug 23 15:03:00 OPX pveproxy[1291140]: ipcc_send_rec[3] failed: Connection refused
Aug 23 15:03:00 OPX pvescheduler[1302357]: replication: Connection refused
Aug 23 15:03:00 OPX pvescheduler[1302358]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Aug 23 15:03:01 OPX pveproxy[664226]: ipcc_send_rec[1] failed: Connection refused
Aug 23 15:03:01 OPX pveproxy[664226]: ipcc_send_rec[2] failed: Connection refused
Aug 23 15:03:01 OPX pveproxy[664226]: ipcc_send_rec[3] failed: Connection refused
Aug 23 15:03:01 OPX cron[1670]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Aug 23 15:03:02 OPX pve-ha-lrm[1780]: updating service status from manager failed: Connection refused
Aug 23 15:03:06 OPX pvestatd[1687]: ipcc_send_rec[1] failed: Connection refused
Aug 23 15:03:06 OPX pvestatd[1687]: ipcc_send_rec[2] failed: Connection refused
Aug 23 15:03:06 OPX pvestatd[1687]: ipcc_send_rec[3] failed: Connection refused
Aug 23 15:03:06 OPX pvestatd[1687]: ipcc_send_rec[4] failed: Connection refused
Aug 23 15:03:06 OPX pvestatd[1687]: status update error: Connection refused
Aug 23 15:03:07 OPX pve-ha-lrm[1780]: updating service status from manager failed: Connection refused

It seems there are problem to communicate in the qorum (ssh) because i can't ssh from opx and op1 to op0.

Is it ok to rebuild the cluster from your point of view ?

proxman4 · Aug 24, 2022

Ok i messed up a little more so now the cluster looks in a total failing state. OP1 and OPX still continue operation but OP0 stop working after approx 30m. I decided to break the whole cluster. I just need to do it properly.

proxman4 · Aug 31, 2022

So regarding all the failures i rent a new dedicated server and need to complete detroying of the cluster.

The plan is to migrate OP0 and OP1 to an new OPY.

The new situation is :

OP0 (all backed up)
OP1 ( deleted from cluster and fully reinstalled)
OPX ( cannot access webui and doing basic operation as vzdump but the vm/lxc are working)
OPY (plain new proxmox)

The problem is OPX still has big problems.

I need some help with this last ~~healthy~~ unhealthy node.

i cannot acces the webui (not a big deal) but i cannot do vzdumd to move the vm to another healty node.

how can i delete this node from a non-existing cluster - i broke delete the two other nodes from the cluster without breaking everything ?

I tried the official documentations's procedure :

https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node

i cannot restart corosync anymore : systemctl restart corosync journalctl -xe

Code:

-- The unit corosync.service has entered the 'failed' state with result 'exit-code'.
Aug 30 22:14:29 OPX systemd[1]: Failed to start Corosync Cluster Engine.
-- Subject: A start job for unit corosync.service has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- A start job for unit corosync.service has finished with a failure.
--
-- The job identifier is 13223 and the job result is failed.
Aug 30 22:14:35 OPX pmxcfs[1787893]: [quorum] crit: quorum_initialize failed: 2
Aug 30 22:14:35 OPX pmxcfs[1787893]: [confdb] crit: cmap_initialize failed: 2
Aug 30 22:14:35 OPX pmxcfs[1787893]: [dcdb] crit: cpg_initialize failed: 2
Aug 30 22:14:35 OPX pmxcfs[1787893]: [status] crit: cpg_initialize failed: 2

pvecm status

Cluster information
-------------------
Name:             CLUSTER-OVH
Config Version:   19
Transport:        knet
Secure auth:      on

Cannot initialize CMAP service

I got critical services running on this node not backed up properly (because i can't use vzdump anymore . What can i do to fix it please ?

proxman4 · Sep 8, 2022

I managed to edit /etc/pve/corosync.conf and break all the cluster and resintalled all servers.

Thank you backups.

Search

Search

Delete disconnected node from a 3 nodes cluster challenge

proxman4

Member

shrdlicka

Proxmox Retired Staff

proxman4

Member

proxman4

Member

proxman4

Member

proxman4

Member

proxman4

Member