Hi
i need some help with a problem regarding a cluster (server providers).
First i need to tell you the story.
I got a cluster with 2 pve6 and one recently added pve7.
10.17.254.5 OP1 (pve6 hardware nic failure)
10.17.254.6 OP0 (recently pve6to7)
10.17.254.7 OPX (healthy pve7)
I planned to upgrade the 2 nodes (OP0 and OP1) from pve6 to 7 because of the end of support of debian/proxmox .
I did all what is described in the very clear documentation and upgrade seems well for the first node (OP1)
I waited to do the second day (night) to do the upgrade of the last pve6 node. At the same time i got some kind of hardware failure (NIC or SFP module provider is not clear)) on the network card of node OP0 and It takes very long time from provider to fix the failure.
So i decided to temporarily break the cluster to be able to continue operations. Not a big deal to break it because it's only for crossed backups via nfs. It's not a H.A
So what i wanted was to reduce the cluster to only two nodes. the two on pve7 with no hardware failure (OPX and OP0). Seems logic
I tried to do :
But there is a strange behavior. One of the node (OP0) after some times cannot communicate properly within the cluster and stop operations.
It looses the vmbr0 bond which is the public ip link interface.
vmbr1 wich is the cluster dedicated interface continue to work.
Here is the log entrie of the culprit on the failing node OP0 :
If i reboot it works for half an hour approximately and it looses vmbr0 again.
Here are the nodes with pvecm status
We can see the expected votes are at 2.
But corosync logs continue to give this message :
and vmrb0 after some time cannot communicate.
At some point i checked the /etc/pve/corosync.conf file and noticed that the bindnetaddress was : 10.17.254.5 (OP1 the one with hardware nic failure)
So i decided (maybe wrongly) to change the ip adress replacing it by the opx (10.17.254.7) adress (the most healthy node)
i restarted the services and at some point both nodes
The documentation tells to do pvecm delnode op1
but node is not detected.
Now i'm stucked with a dysfonctionnal cluster i want to break and i need to ask to you good fellows.
My purpose is :
either detroying the cluster to have 3 idenpendant pve or limiting the cluster with only the two nodes OP0 and OPX.
This community and forum are goldmine of informations and solutions i'm pretty sure i missed something obvious and another eye will see the solution.
I advance thanks.
i need some help with a problem regarding a cluster (server providers).
First i need to tell you the story.
I got a cluster with 2 pve6 and one recently added pve7.
10.17.254.5 OP1 (pve6 hardware nic failure)
10.17.254.6 OP0 (recently pve6to7)
10.17.254.7 OPX (healthy pve7)
I planned to upgrade the 2 nodes (OP0 and OP1) from pve6 to 7 because of the end of support of debian/proxmox .
I did all what is described in the very clear documentation and upgrade seems well for the first node (OP1)
I waited to do the second day (night) to do the upgrade of the last pve6 node. At the same time i got some kind of hardware failure (NIC or SFP module provider is not clear)) on the network card of node OP0 and It takes very long time from provider to fix the failure.
So i decided to temporarily break the cluster to be able to continue operations. Not a big deal to break it because it's only for crossed backups via nfs. It's not a H.A
So what i wanted was to reduce the cluster to only two nodes. the two on pve7 with no hardware failure (OPX and OP0). Seems logic
I tried to do :
Code:
pvecm expected 2
But there is a strange behavior. One of the node (OP0) after some times cannot communicate properly within the cluster and stop operations.
It looses the vmbr0 bond which is the public ip link interface.
vmbr1 wich is the cluster dedicated interface continue to work.
Here is the log entrie of the culprit on the failing node OP0 :
Code:
août 22 21:38:08 OP0 pmxcfs[1792]: [quorum] crit: quorum_initialize failed: 2
août 22 21:38:08 OP0 pmxcfs[1792]: [quorum] crit: can't initialize service
août 22 21:38:08 OP0 pmxcfs[1792]: [confdb] crit: cmap_initialize failed: 2
août 22 21:38:08 OP0 pmxcfs[1792]: [confdb] crit: can't initialize service
août 22 21:38:08 OP0 pmxcfs[1792]: [dcdb] crit: cpg_initialize failed: 2
août 22 21:38:08 OP0 pmxcfs[1792]: [dcdb] crit: can't initialize service
août 22 21:38:08 OP0 pmxcfs[1792]: [status] crit: cpg_initialize failed: 2
août 22 21:38:08 OP0 pmxcfs[1792]: [status] crit: can't initialize service
If i reboot it works for half an hour approximately and it looses vmbr0 again.
Here are the nodes with pvecm status
Code:
Cluster information
-------------------
Name: CLUSTER-OVH
Config Version: 19
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Mon Aug 22 22:33:36 2022
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000004
Ring ID: 1.18f59
Quorate: Yes
Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.17.254.7
0x00000004 1 10.17.254.6 (local)
We can see the expected votes are at 2.
But corosync logs continue to give this message :
Code:
août 22 22:35:10 OP0 corosync[1870]: [KNET ] udp: Received ICMP error from 10.17.254.6: No route to host 10.17.254.5
and vmrb0 after some time cannot communicate.
At some point i checked the /etc/pve/corosync.conf file and noticed that the bindnetaddress was : 10.17.254.5 (OP1 the one with hardware nic failure)
So i decided (maybe wrongly) to change the ip adress replacing it by the opx (10.17.254.7) adress (the most healthy node)
Code:
logging {
debug: on
to_syslog: yes
}
nodelist {
node {
name: OP0
nodeid: 4
quorum_votes: 1
ring0_addr: 10.17.254.6
}
node {
name: OP1
nodeid: 3
quorum_votes: 1
ring0_addr: 10.17.254.5
}
node {
name: OPX
nodeid: 1
quorum_votes: 1
ring0_addr: 10.17.254.7
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: CLUSTER-OVH
config_version: 19
interface {
bindnetaddr: 10.17.254.7
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}
i restarted the services and at some point both nodes
Code:
août 22 21:38:09 OP0 corosync[1870]: [MAIN ] Corosync Cluster Engine 3.1.5 starting up
août 22 21:38:09 OP0 corosync[1870]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] totemip_parse: IPv4 address of 10.17.254.6 resolved as 10.17.254.6
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] totemip_parse: IPv4 address of 10.17.254.6 resolved as 10.17.254.6
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] totemip_parse: IPv4 address of 10.17.254.5 resolved as 10.17.254.5
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] totemip_parse: IPv4 address of 10.17.254.7 resolved as 10.17.254.7
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] Configuring link 0 params
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] totemip_parse: IPv4 address of 10.17.254.6 resolved as 10.17.254.6
août 22 21:38:09 OP0 corosync[1870]: [MAIN ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
août 22 21:38:09 OP0 corosync[1870]: [MAIN ] Please migrate config file to nodelist.
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] waiting_trans_ack changed to 1
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] Token Timeout (3650 ms) retransmit timeout (869 ms)
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] Token warning every 2737 ms (75% of Token Timeout)
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] token hold (685 ms) retransmits before loss (4 retrans)
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] join (50 ms) send_join (0 ms) consensus (4380 ms) merge (200 ms)
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] downcheck (1000 ms) fail to recv const (2500 msgs)
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] seqno unchanged const (30 rotations) Maximum network MTU 65446
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] window size per rotation (50 messages) maximum messages per rotation (17 messages)
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] missed count const (5 messages)
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] send threads (0 threads)
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] heartbeat_failures_allowed (0)
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] max_network_delay (50 ms)
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] Initializing transport (Kronosnet).
août 22 21:38:09 OP0 kernel: sctp: Hash tables configured (bind 1024/1024)
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] knet_enable access list: 1
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] Configuring crypto nss/aes256/sha256 on index 1
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] Knet pMTU change: 421
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] totemknet initialized
août 22 21:38:09 OP0 corosync[1870]: [KNET ] sctp: Size of struct sctp_event_subscribe is 14 in kernel, 14 in user space
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Updated status for thread SCTP_LISTEN to registered
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Updated status for thread SCTP_CONN to registered
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Updated status for thread SCTP_LISTEN to started
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Updated status for thread PMTUD to registered
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Updated status for thread SCTP_CONN to started
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Updated status for thread DST_LINK to registered
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Updated status for thread PMTUD to started
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Updated status for thread TX to registered
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Updated status for thread DST_LINK to started
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Updated status for thread RX to registered
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Updated status for thread TX to started
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Updated status for thread HB to started
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Checking thread: TX status: started req: started
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Checking thread: RX status: started req: started
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Checking thread: HB status: started req: started
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Checking thread: PMTUD status: started req: started
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Checking thread: DST_LINK status: started req: started
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Checking thread: SCTP_LISTEN status: started req: started
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Checking thread: SCTP_CONN status: started req: started
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Links access lists are enabled
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: PMTUd interval set to: 30 seconds
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: dst_host_filter_fn enabled
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: sock_notify_fn enabled
août 22 21:38:09 OP0 corosync[1870]: [KNET ] host: host_status_change_notify_fn enabled
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: pmtud_notify_fn enabled
août 22 21:38:09 OP0 corosync[1870]: [KNET ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
août 22 21:38:09 OP0 corosync[1870]: [KNET ] crypto: Initializing crypto module [nss/aes256/sha256]
août 22 21:38:09 OP0 corosync[1870]: [KNET ] nsscrypto: Initializing nss crypto module [aes256/sha256]
août 22 21:38:09 OP0 corosync[1870]: [KNET ] crypto: PMTUd has been reset to default
août 22 21:38:09 OP0 corosync[1870]: [KNET ] crypto: Notifying PMTUd to rerun
août 22 21:38:09 OP0 corosync[1870]: [KNET ] crypto: Only crypto traffic allowed for RX
août 22 21:38:09 OP0 corosync[1870]: [KNET ] handle: Data forwarding is enabled
août 22 21:38:09 OP0 corosync[1870]: [TOTEM ] Created or loaded sequence id 4.18f4a for this ring.
août 22 21:38:09 OP0 corosync[1870]: [SERV ] Service engine loaded: corosync configuration map access [0]
août 22 21:38:09 OP0 corosync[1870]: [MAIN ] Initializing IPC on cmap [0]
août 22 21:38:09 OP0 corosync[1870]: [MAIN ] No configured system.qb_ipc_type. Using native ipc
août 22 21:38:09 OP0 corosync[1870]: [QB ] server name: cmap
août 22 21:38:09 OP0 corosync[1870]: [SERV ] Service engine loaded: corosync configuration service [1]
août 22 21:38:09 OP0 corosync[1870]: [MAIN ] Initializing IPC on cfg [1]
août 22 21:38:09 OP0 corosync[1870]: [MAIN ] No configured system.qb_ipc_type. Using native ipc
août 22 21:38:09 OP0 corosync[1870]: [QB ] server name: cfg
août 22 21:38:09 OP0 corosync[1870]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
août 22 21:38:09 OP0 corosync[1870]: [MAIN ] Initializing IPC on cpg [2]
août 22 21:38:09 OP0 corosync[1870]: [MAIN ] No configured system.qb_ipc_type. Using native ipc
août 22 21:38:09 OP0 corosync[1870]: [QB ] server name: cpg
août 22 21:38:09 OP0 corosync[1870]: [SERV ] Service engine loaded: corosync profile loading service [4]
août 22 21:38:09 OP0 corosync[1870]: [MAIN ] NOT Initializing IPC on pload [4]
août 22 21:38:09 OP0 corosync[1870]: [SERV ] Service engine loaded: corosync resource monitoring service [6]
The documentation tells to do pvecm delnode op1
but node is not detected.
Now i'm stucked with a dysfonctionnal cluster i want to break and i need to ask to you good fellows.
My purpose is :
either detroying the cluster to have 3 idenpendant pve or limiting the cluster with only the two nodes OP0 and OPX.
This community and forum are goldmine of informations and solutions i'm pretty sure i missed something obvious and another eye will see the solution.
I advance thanks.
Last edited: