Cluster malfunction

franconovik

Member
Dec 12, 2024
39
0
6
Hello I had a Proxmox cluster with all nodes update to 8.3.3
Today frpm dashboard I noticed that only one node had a red cross and the other 4 ones where green.
So I started investigating.
On all nodes I get the following
pvecm status
root@pve03:/etc# pvecm status
Cluster information
-------------------
Name: US01
Config Version: 7
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Apr 17 17:24:38 2025
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000004
Ring ID: 1.161f5
Quorate: Yes

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 4
Quorum: 3
Flags: 2Node Quorate WaitForAll

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 182.xx.y.227
0x00000002 1 182.xx.y.228
0x00000003 1 182.xx.y.229
0x00000004 1 182.xx.y.230 (local)
root@pve03:/etc#


systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
Active: active (running) since Thu 2025-02-27 16:00:17 CET; 1 month 18 days ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 52581 (corosync)
Tasks: 9 (limit: 154510)
Memory: 146.4M
CPU: 1d 21h 2min 26.287s
CGroup: /system.slice/corosync.service
└─52581 /usr/sbin/corosync -f

Apr 17 07:23:25 pve03 corosync[52581]: [KNET ] link: host: 5 link: 0 is down
Apr 17 07:23:25 pve03 corosync[52581]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Apr 17 07:23:25 pve03 corosync[52581]: [KNET ] host: host: 5 has no active links <===== how to active ?
Apr 17 07:23:27 pve03 corosync[52581]: [TOTEM ] Token has not been received in 3712 ms
Apr 17 07:23:34 pve03 corosync[52581]: [QUORUM] Sync members[4]: 1 2 3 4
Apr 17 07:23:34 pve03 corosync[52581]: [QUORUM] Sync left[1]: 5
Apr 17 07:23:34 pve03 corosync[52581]: [TOTEM ] A new membership (1.161f5) was formed. Members left: 5
Apr 17 07:23:34 pve03 corosync[52581]: [TOTEM ] Failed to receive the leave message. failed: 5
Apr 17 07:23:34 pve03 corosync[52581]: [QUORUM] Members[4]: 1 2 3 4
Apr 17 07:23:34 pve03 corosync[52581]: [MAIN ] Completed service synchronization, ready to provide service.
root@pve03:~#


root@pve04:/etc/pve# pvecm status
Can't use an undefined value as a HASH reference at /usr/share/perl5/PVE/CLI/pvecm.pm line 496, <DATA> line 960.



root@pve04:/etc/pve# systemctl status corosync.service
× corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Thu 2025-04-17 16:17:50 CEST; 40min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 1316 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=8)
Main PID: 1316 (code=exited, status=8)
CPU: 10ms

Apr 17 16:17:50 pve04 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Apr 17 16:17:50 pve04 corosync[1316]: parser error: /etc/corosync/corosync.conf:54: Missing closing brace
Apr 17 16:17:50 pve04 systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
Apr 17 16:17:50 pve04 systemd[1]: corosync.service: Failed with result 'exit-code'.
Apr 17 16:17:50 pve04 systemd[1]: Failed to start corosync.service - Corosync Cluster Engine.


On node red crossed I noticed file corosync missing a bracket but I am not to update it
file to update is
/etc/pve/corosync.conf


How can I update it ?

Thanks
/Franco
 
Hello I know where to add the brace but when I edit the file and try to save it but the system answers me file-system full.
I tried to apply what's in here https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_edit_corosync_conf but when I execute
cp /etc/pve/corosyncconf /etc/pve/corosync.conf.old I get the answer :
cp: cannot create regular file 'corosync.conf.old': Permission denied

So how can I write and update the file ?

root@pve04:/etc/pve# cat corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: pve00
nodeid: 1
quorum_votes: 1
ring0_addr: 182.xx.y.227
}
node {
name: pve01
nodeid: 2
quorum_votes: 1
ring0_addr: 182.xx.y.228
}
node {
name: pve02
nodeid: 3
quorum_votes: 1
ring0_addr: 182.xx.y.229
}
node {
name: pve03
nodeid: 4
quorum_votes: 1
ring0_addr: 182.xx.y.230
}
node {
name: pve04
nodeid: 5
quorum_votes: 1
ring0_addr: 182.xx.y.231
}

quorum {
provider: corosync_votequorum
two_node: 1
}

totem {
cluster_name: US01
config_version: 7
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}

I have to add a brace after line: ring0_addr: 182.xx.y.231

/thanks
 
Hello I restarted corosync after updating and increasing version on all nodes.
On the 4 nodes ok. I get:

root@pve00:/etc# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
Active: active (running) since Fri 2025-04-18 09:06:22 CEST; 5min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 3434012 (corosync)
Tasks: 9 (limit: 48156)
Memory: 136.5M
CPU: 8.860s
CGroup: /system.slice/corosync.service
└─3434012 /usr/sbin/corosync -f

Apr 18 09:06:26 pve00 corosync[3434012]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr 18 09:06:26 pve00 corosync[3434012]: [QUORUM] Sync members[4]: 1 2 3 4
Apr 18 09:06:26 pve00 corosync[3434012]: [QUORUM] Sync joined[3]: 2 3 4
Apr 18 09:06:26 pve00 corosync[3434012]: [TOTEM ] A new membership (1.16219) was formed. Members joined: 2 3 4
Apr 18 09:06:26 pve00 corosync[3434012]: [VOTEQ ] Waiting for all cluster members. Current votes: 2 expected_votes: 5
Apr 18 09:06:26 pve00 corosync[3434012]: [QUORUM] This node is within the primary component and will provide service.
Apr 18 09:06:26 pve00 corosync[3434012]: [QUORUM] Members[4]: 1 2 3 4
Apr 18 09:06:26 pve00 corosync[3434012]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 18 09:06:26 pve00 corosync[3434012]: [KNET ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Apr 18 09:06:26 pve00 corosync[3434012]: [KNET ] pmtud: Global data MTU changed to: 1397

This is strange as the 5th node is not considered.




On the red crossed node I get:
root@pve04:/etc# systemctl start corosync
Job for corosync.service failed because the control process exited with error code.
See "systemctl status corosync.service" and "journalctl -xeu corosync.service" for details.


root@pve04:/etc# systemctl status corosync
× corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Fri 2025-04-18 09:10:36 CEST; 18s ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 1411168 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=8)
Main PID: 1411168 (code=exited, status=8)
CPU: 21ms

Apr 18 09:10:36 pve04 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Apr 18 09:10:36 pve04 corosync[1411168]: [MAIN ] Corosync Cluster Engine starting up
Apr 18 09:10:36 pve04 corosync[1411168]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle >
Apr 18 09:10:36 pve04 corosync[1411168]: [MAIN ] Could not open /etc/corosync/authkey: No such file or directory
Apr 18 09:10:36 pve04 corosync[1411168]: [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1428.
Apr 18 09:10:36 pve04 systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
Apr 18 09:10:36 pve04 systemd[1]: corosync.service: Failed with result 'exit-code'.
Apr 18 09:10:36 pve04 systemd[1]: Failed to start corosync.service - Corosync Cluster Engine.


So how to fix ?
/thanks
 
copy the key from another node.. is the disk still full? the key should not just disappear..