disband and recreate cluster that will never fully reconnect

Aug 20, 2021
25
0
6
34
I had a cluster of 3 proxmox servers and added a 4th, however each time i try i can barely login to the GUI and they barely connect to each other.
3 of them have existing VMs i cannot simply delete. Is there a way to delete the cluster and start again without losing the VMs? I remember that in order to join a cluster the host needs to have no VMs, is there a way around this?
The cluster is very unstable for some reason with hosts keep dropping in and out. Hardware is fine, SSH works fine but proxmox cluster stability is so bad that theres no way for me to move the VMs to one host.

I noticed proxmox has a limit of 1000 hosts but how is it crapping out so easily with just a few hosts? I thought proxmox was also used in production/commercial as well?

I use the cluster to reduce configuration with network drives and i use different networking/vlans for VMs. Reinstalling is time consuming and not an option especially that i have work and cannot delete the VMs. As it is i cant even use the cluster or any host to do anything.
 
Last edited:
The instability you're experiencing is not normal for Proxmox, even with just a few nodes. This suggests there may be underlying network, configuration, or resource issues.

I'm not sure the right answer is to tear apart your cluster and start anew, but if you absolutely must I would simply backup each of my VMs on the three nodes that have critical VMs to the network drives you mentioned. Alternatively, you can backup and restore/replicate to the 4th new node that doesn't have anything on it and keep them there as a placeholder while you rebuild your cluster.

Any errors in syslog? What are the results of pveperf? You can try looking through the kern.log file or searching through the PVE tasks folders like below.

cat /var/log/kern.log
find /var/log/pve/tasks -type f -print0 | xargs -0 grep -i error

I use the below for some of my own tasks, feel free to use it if it helps bring something to light.

Bash:
for log in /var/log/pve-cluster/corosync.log /var/log/pveproxy/access.log /var/log/pveproxy/error.log /var/log/syslog /var/log/kern.log /var/log/auth.log /var/log/pvedaemon.log; do
    echo "=== $log ===" >> proxmox_diagnostics.txt
    tail -n 1000 $log >> proxmox_diagnostics.txt
    echo >> proxmox_diagnostics.txt
done

echo "=== Network Interfaces ===" >> proxmox_diagnostics.txt
ip a >> proxmox_diagnostics.txt

echo "=== Firewall Rules ===" >> proxmox_diagnostics.txt
iptables -L -v -n >> proxmox_diagnostics.txt

echo "=== Cluster Status ===" >> proxmox_diagnostics.txt
pvecm status >> proxmox_diagnostics.txt
pvecm nodes >> proxmox_diagnostics.txt

echo "=== Resource Usage ===" >> proxmox_diagnostics.txt
top -b -n 1 >> proxmox_diagnostics.txt
df -h >> proxmox_diagnostics.txt

echo "=== Configuration Files ===" >> proxmox_diagnostics.txt
cat /etc/pve/corosync.conf >> proxmox_diagnostics.txt
cat /etc/pve/storage.cfg >> proxmox_diagnostics.txt

echo "=== Hardware Info ===" >> proxmox_diagnostics.txt
lscpu >> proxmox_diagnostics.txt
free -m >> proxmox_diagnostics.txt

echo "=== Service Status ===" >> proxmox_diagnostics.txt
systemctl status pve-cluster >> proxmox_diagnostics.txt
systemctl status corosync >> proxmox_diagnostics.txt
systemctl status pvedaemon >> proxmox_diagnostics.txt

echo "=== VM and Container List ===" >> proxmox_diagnostics.txt
qm list >> proxmox_diagnostics.txt
pct list >> proxmox_diagnostics.txt
 
The instability you're experiencing is not normal for Proxmox, even with just a few nodes. This suggests there may be underlying network, configuration, or resource issues.

I'm not sure the right answer is to tear apart your cluster and start anew, but if you absolutely must I would simply backup each of my VMs on the three nodes that have critical VMs to the network drives you mentioned. Alternatively, you can backup and restore/replicate to the 4th new node that doesn't have anything on it and keep them there as a placeholder while you rebuild your cluster.

Any errors in syslog? What are the results of pveperf? You can try looking through the kern.log file or searching through the PVE tasks folders like below.

cat /var/log/kern.log
find /var/log/pve/tasks -type f -print0 | xargs -0 grep -i error

I use the below for some of my own tasks, feel free to use it if it helps bring something to light.

Bash:
for log in /var/log/pve-cluster/corosync.log /var/log/pveproxy/access.log /var/log/pveproxy/error.log /var/log/syslog /var/log/kern.log /var/log/auth.log /var/log/pvedaemon.log; do
    echo "=== $log ===" >> proxmox_diagnostics.txt
    tail -n 1000 $log >> proxmox_diagnostics.txt
    echo >> proxmox_diagnostics.txt
done

echo "=== Network Interfaces ===" >> proxmox_diagnostics.txt
ip a >> proxmox_diagnostics.txt

echo "=== Firewall Rules ===" >> proxmox_diagnostics.txt
iptables -L -v -n >> proxmox_diagnostics.txt

echo "=== Cluster Status ===" >> proxmox_diagnostics.txt
pvecm status >> proxmox_diagnostics.txt
pvecm nodes >> proxmox_diagnostics.txt

echo "=== Resource Usage ===" >> proxmox_diagnostics.txt
top -b -n 1 >> proxmox_diagnostics.txt
df -h >> proxmox_diagnostics.txt

echo "=== Configuration Files ===" >> proxmox_diagnostics.txt
cat /etc/pve/corosync.conf >> proxmox_diagnostics.txt
cat /etc/pve/storage.cfg >> proxmox_diagnostics.txt

echo "=== Hardware Info ===" >> proxmox_diagnostics.txt
lscpu >> proxmox_diagnostics.txt
free -m >> proxmox_diagnostics.txt

echo "=== Service Status ===" >> proxmox_diagnostics.txt
systemctl status pve-cluster >> proxmox_diagnostics.txt
systemctl status corosync >> proxmox_diagnostics.txt
systemctl status pvedaemon >> proxmox_diagnostics.txt

echo "=== VM and Container List ===" >> proxmox_diagnostics.txt
qm list >> proxmox_diagnostics.txt
pct list >> proxmox_diagnostics.txt
thanks, kern.log from node2
Code:
Aug 16 13:22:26 proxe2 kernel: [ 4713.878022] Call Trace:
Aug 16 13:22:26 proxe2 kernel: [ 4713.878038]  __schedule+0x2e6/0x700
Aug 16 13:22:26 proxe2 kernel: [ 4713.878046]  ? filename_parentat.isra.55.part.56+0xf7/0x180
Aug 16 13:22:26 proxe2 kernel: [ 4713.878050]  schedule+0x33/0xa0
Aug 16 13:22:26 proxe2 kernel: [ 4713.878055]  rwsem_down_write_slowpath+0x2ed/0x4a0
Aug 16 13:22:26 proxe2 kernel: [ 4713.878060]  down_write+0x3d/0x40
Aug 16 13:22:26 proxe2 kernel: [ 4713.878063]  filename_create+0x8e/0x180
Aug 16 13:22:26 proxe2 kernel: [ 4713.878069]  do_mkdirat+0x59/0x110
Aug 16 13:22:26 proxe2 kernel: [ 4713.878073]  __x64_sys_mkdir+0x1b/0x20
Aug 16 13:22:26 proxe2 kernel: [ 4713.878080]  do_syscall_64+0x57/0x190
Aug 16 13:22:26 proxe2 kernel: [ 4713.878088]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Aug 16 13:22:26 proxe2 kernel: [ 4713.878092] RIP: 0033:0x7f57638e7f87
Aug 16 13:22:26 proxe2 kernel: [ 4713.878104] Code: Bad RIP value.
Aug 16 13:22:26 proxe2 kernel: [ 4713.878107] RSP: 002b:00007ffdacf64118 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
Aug 16 13:22:26 proxe2 kernel: [ 4713.878112] RAX: ffffffffffffffda RBX: 000055a23becd260 RCX: 00007f57638e7f87
Aug 16 13:22:26 proxe2 kernel: [ 4713.878114] RDX: 000055a23af483d4 RSI: 00000000000001ff RDI: 000055a2400ba2b0
Aug 16 13:22:26 proxe2 kernel: [ 4713.878116] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007
Aug 16 13:22:26 proxe2 kernel: [ 4713.878118] R10: 0000000000000000 R11: 0000000000000246 R12: 000055a23d315638
Aug 16 13:22:26 proxe2 kernel: [ 4713.878119] R13: 000055a2400ba2b0 R14: 000055a2400309c0 R15: 00000000000001ff
Aug 16 13:29:26 proxe2 kernel: [ 5134.477946] fwbr101i0: port 2(tap101i0) entered disabled state
Aug 16 13:29:26 proxe2 kernel: [ 5134.500757] fwbr101i0: port 1(fwln101i0) entered disabled state
Aug 16 13:29:26 proxe2 kernel: [ 5134.500964] vmbr0: port 2(fwpr101p0) entered disabled state
Aug 16 13:29:26 proxe2 kernel: [ 5134.502223] device fwln101i0 left promiscuous mode
Aug 16 13:29:26 proxe2 kernel: [ 5134.502235] fwbr101i0: port 1(fwln101i0) entered disabled state
Aug 16 13:29:26 proxe2 kernel: [ 5134.521153] device fwpr101p0 left promiscuous mode
Aug 16 13:29:26 proxe2 kernel: [ 5134.521162] vmbr0: port 2(fwpr101p0) entered disabled state
Aug 16 14:05:50 proxe2 kernel: [ 7318.775104] reconnect tcon failed rc = -11
Aug 16 14:07:14 proxe2 kernel: [ 7402.548333] Status code returned 0xc000006d STATUS_LOGON_FAILURE
Aug 16 14:07:14 proxe2 kernel: [ 7402.548354] CIFS VFS: \\192.168.87.1 Send error in SessSetup = -13

task error on node 4

Code:
/var/log/pve/tasks/C/UPID:pve4:000092C8:0002312F:66BBAEDC:clusterjoin::root@pam::waiting for quorum...TASK ERROR: received interrupt
/var/log/pve/tasks/F/UPID:pve4:0000070C:00006179:66BC488F:clusterjoin::root@pam::140130710029632:error:0909006C:PEM routines:get_name:no start line:../crypto/pem/pem_lib.c:745:Expecting: ANY PRIVATE KEY
/var/log/pve/tasks/F/UPID:pve4:0000070C:00006179:66BC488F:clusterjoin::root@pam::TASK ERROR: unable to generate pve certificate request: command 'openssl req -batch -new -config /tmp/pvesslconf-1804.tmp -key /etc/pve/nodes/pve4/pve-ssl.key -out /tmp/pvecertreq-1804.tmp' failed: exit code 1
task error on node 1
Code:
/var/log/pve/tasks/C/UPID:pve4:000092C8:0002312F:66BBAEDC:clusterjoin::root@pam::waiting for quorum...TASK ERROR: received interrupt
-bash: /var/log/pve/tasks/C/UPID:pve4:000092C8:0002312F:66BBAEDC:clusterjoin::root@pam::waiting: No such file or directory
root@proxe1:~# /var/log/pve/tasks/F/UPID:pve4:0000070C:00006179:66BC488F:clusterjoin::root@pam::140130710029632:error:0909006C:PEM routines:get_name:no start line:../crypto/pem/pem_lib.c:745:Expecting: ANY PRIVATE KEY
-bash: /var/log/pve/tasks/F/UPID:pve4:0000070C:00006179:66BC488F:clusterjoin::root@pam::140130710029632:error:0909006C:PEM: No such file or directory
root@proxe1:~# /var/log/pve/tasks/F/UPID:pve4:0000070C:00006179:66BC488F:clusterjoin::root@pam::TASK ERROR: unable to generate pve certificate request: command 'openssl req -batch -new -config /tmp/pvesslconf-1804.tmp -key /etc/pve/nodes/pve4/pve-ssl.key -out /tmp/pvecertreq-1804.tmp' failed: exit code 1
-bash: /var/log/pve/tasks/F/UPID:pve4:0000070C:00006179:66BC488F:clusterjoin::root@pam::TASK: No such file or directory

other errors are just SMB and apt errors, usual stuff for proxmox. I can fix the SMB or network file system errors very easily and they only store ISO, backups and templates, nothing tied to the VMs, while the VMs are locally stored in each node. The task error i think was because i tried running updatecert after failing to login via gui. GUI sometimes works but takes so long to login when it does work and by the time i can do anything i get connection errors.
 
Something is affecting the proper working of the pve-cluster service, which provides /etc/pve in each PVE host. Would be interesting to run in each node:

- pveversion -v
- pvecm status
- corosync-cfgtool -n

is there any kind of firewall running in the PVE hosts? Also, solve that issue with the storage as it clutters the logs unnecessarily (you may simply disable it in Datacenter, Storage if you don't use it at the moment).

@sva take a look at the pvereport command, as it includes the output of many of the commands you use to get info. Also remember that PVE8 by default doesn't come with rsyslog enabled and uses journald, so things like /var/log/syslog no longer exist and you have to use journalctl instead.
 
Last edited:
  • Like
Reactions: sva
Something is affecting the proper working of the pve-cluster service, which provides /etc/pve in each PVE host. Would be interesting to run in each node:

- pveversion -v
- pvecm status
- corosync-cfgtool -n

is there any kind of firewall running in the PVE hosts? Also, solve that issue with the storage as it clutters the logs unnecessarily (you may simply disable it in Datacenter, Storage if you don't use it at the moment).

@sva take a look at the pvereport command, as it includes the output of many of the commands you use to get info. Also remember that PVE8 by default doesn't come with rsyslog enabled and uses journald, so things like /var/log/syslog no longer exist and you have to use journalctl instead.

Thanks for the feedback and information! I had no idea that pvereport existed. I suspect I'm doing a lot of things "the hard way" and need to take a moment and sift through admin docs and learn all the things that PVE facilitates.
 
Something is affecting the proper working of the pve-cluster service, which provides /etc/pve in each PVE host. Would be interesting to run in each node:

- pveversion -v
- pvecm status
- corosync-cfgtool -n

is there any kind of firewall running in the PVE hosts? Also, solve that issue with the storage as it clutters the logs unnecessarily (you may simply disable it in Datacenter, Storage if you don't use it at the moment).

@sva take a look at the pvereport command, as it includes the output of many of the commands you use to get info. Also remember that PVE8 by default doesn't come with rsyslog enabled and uses journald, so things like /var/log/syslog no longer exist and you have to use journalctl instead.
You are correct that the services arent running properly. I figured this before but i couldnt figure out how to fix it that just dismantling the cluster and remaking it seemed like the only option.
i cant login to them. they either fail and if i try updatecert they take forever to login then times out on gui. Cant solve the storage issues without logging in. I can easily solve it in GUI. Yes they clutter the logs but i did browse through them.
no firewall on the hosts.

just some basic network info, 2 of the nodes are connected to switch via bonding which works fine. default vlan/network for management, 2 other vlans and ip networks for VMs (using separate bridge not management). Network is fine except for the router periodically crashing which also hosts adguard container and SMB with ISO, template and backup. I am raising the instability issue with mikrotik but that only affects DHCP and apt. Thats the summary for my homelab/network.
node 1 - .16
node 2 - .15 ( VM disk full)
node 3 - .20 (VM error due to missing local disk if VM auto launch)
node 4 - .17 (new, empty)
node 5 - .14 (individual, not clustered)
node 5 and more in the future wont be clustered since they arent low powered like the first 4 nodes as those i keep on 24/7 given the cluster isnt disaster resistant. in 2 months im adding another low powered PC to the cluster. except for node 2 which is the best of intel's atoms all other nodes are much faster sporting some of the higher end mobile CPUs with plenty of ram.
All VMs are locally stored with node3 disk available but not auto mounted by default due to the hot swap chassis. I get ghost VM started when they should all be stopped as i did make sure to disable auto start for all the VMs.

Node 1:
Code:
root@proxe1:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.4-4 (running version: 6.4-4/337d6701)
pve-kernel-5.4: 6.4-1
pve-kernel-helper: 6.4-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-2
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-1
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-3
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-1
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1
root@proxe1:~# pvecm status
Cluster information
-------------------
Name:             me-prox-cluster
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Aug 16 20:21:39 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.31cf6
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      3
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.87.16 (local)
0x00000002          1 192.168.87.15
0x00000003          1 192.168.87.20
root@proxe1:~# pvecm status
Cluster information
-------------------
Name:             me-prox-cluster
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Aug 16 20:21:39 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.31cf6
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      3
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.87.16 (local)
0x00000002          1 192.168.87.15
0x00000003          1 192.168.87.20
root@proxe1:~# corosync-cfgtool -n
Local node ID 1, transport knet
nodeid: 2 reachable
   LINK: 0 (192.168.87.16->192.168.87.15) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 (192.168.87.16->192.168.87.20) enabled connected mtu: 1397

node2: (slow output)
Code:
root@proxe2:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.4-4 (running version: 6.4-4/337d6701)
pve-kernel-5.4: 6.4-1
pve-kernel-helper: 6.4-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 12.2.11+dfsg1-2.1+deb10u1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-2
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-1
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-3
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-1
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1
root@proxe2:~# pvecm status
Cluster information
-------------------
Name:             me-prox-cluster
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Aug 16 20:39:11 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000002
Ring ID:          1.32042
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      3
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.87.16
0x00000002          1 192.168.87.15 (local)
0x00000003          1 192.168.87.20
root@proxe2:~# corosync-cfgtool -n
Local node ID 2, transport knet
nodeid: 1 reachable
   LINK: 0 (192.168.87.15->192.168.87.16) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 (192.168.87.15->192.168.87.20) enabled connected mtu: 1397

nodeid: 4 reachable
   LINK: 0 (192.168.87.15->192.168.87.17) enabled connected mtu: 1397

node 3:
Code:
root@pve3:~# pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.15.102-1-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-5.15: 7.3-3
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-3
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-1
libpve-rs-perl: 0.7.5
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.3.3-1
proxmox-backup-file-restore: 2.3.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.6.3
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20221111-1
pve-firewall: 4.3-1
pve-firmware: 3.6-4
pve-ha-manager: 3.6.0
pve-i18n: 2.11-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1
root@pve3:~# pvecm status
Cluster information
-------------------
Name:             me-prox-cluster
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Aug 16 20:41:30 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000003
Ring ID:          1.320ca
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      3
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.87.16
0x00000002          1 192.168.87.15
0x00000003          1 192.168.87.20 (local)
root@pve3:~# corosync-cfgtool -n
Local node ID 3, transport knet
nodeid: 1 reachable
   LINK: 0 udp (192.168.87.20->192.168.87.16) enabled connected mtu: 1397

nodeid: 2 reachable
   LINK: 0 udp (192.168.87.20->192.168.87.15) enabled connected mtu: 1397

nodeid: 4 reachable
   LINK: 0 udp (192.168.87.20->192.168.87.17) enabled connected mtu: 1397

node 4:
Code:
root@pve4:~# pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.15.102-1-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-5.15: 7.3-3
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-3
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-1
libpve-rs-perl: 0.7.5
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.3.3-1
proxmox-backup-file-restore: 2.3.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.6.3
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20221111-1
pve-firewall: 4.3-1
pve-firmware: 3.6-4
pve-ha-manager: 3.6.0
pve-i18n: 2.11-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1
root@pve4:~# pvecm status
Cluster information
-------------------
Name:             me-prox-cluster
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Aug 16 20:43:30 2024
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000004
Ring ID:          4.3214f
Quorate:          No

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      1
Quorum:           3 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 192.168.87.17 (local)
root@pve4:~# corosync-cfgtool -n
Local node ID 4, transport knet
nodeid: 2 reachable
   LINK: 0 udp (192.168.87.17->192.168.87.15) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 udp (192.168.87.17->192.168.87.20) enabled connected mtu: 1397

All proxmox nodes are connected to the same fully managed switch so i can do different network configurations but i think networking is fine and they are all reachable as they all have static IPs configured. They all communicate fine with my PC for SSH, but web gui barely works.
 
Nodes 1 and 2 use PVE6.3, while nodes 3 and 4 are running PVE7.4. That is unsupported from the point of view that it is untested and there might be interoperability issues.

Until PVE7.2 IIRC, you could not log into the webUI if you had no quorum.

The pvecm status and corosync-cfgtool -n output looks weird: all nodes know the cluster has 4 members, but on some pvecm status do no show some node(s) even if corosync-cfgtool -s shows them as connected. Make absolutely sure that you can reach every node from every node in their 192.168.87.x address. Also use tcpdump to check that corosync packets reach every server (udp/5405-5412).

Syslog should be showing corosync related events.

Also check that /etc/corosync/corosync.conf and /etc/pve/corosync.conf is exactly the same file in all hosts. Post it if possible.
 
Nodes 1 and 2 use PVE6.3, while nodes 3 and 4 are running PVE7.4. That is unsupported from the point of view that it is untested and there might be interoperability issues.

Until PVE7.2 IIRC, you could not log into the webUI if you had no quorum.

The pvecm status and corosync-cfgtool -n output looks weird: all nodes know the cluster has 4 members, but on some pvecm status do no show some node(s) even if corosync-cfgtool -s shows them as connected. Make absolutely sure that you can reach every node from every node in their 192.168.87.x address. Also use tcpdump to check that corosync packets reach every server (udp/5405-5412).

Syslog should be showing corosync related events.

Also check that /etc/corosync/corosync.conf and /etc/pve/corosync.conf is exactly the same file in all hosts. Post it if possible.
the stability issues started after adding node 4, been running that configuration for more than a year. i wanted to upgrade from 6 but couldnt find a guide for it. version 7 i find is good and lets me switch to different apt providers instead of the commercial subscribed proxmox one.

Sorry for the delay, i checked both files on all 4 nodes which were exactly the same, but i also found that node 1 and 4 cant ping each other which is weird. Both my PC And switch can ping all nodes fine, but the router cant ping node 1. The switch pings from CPU but the configuration is all done by switch chip so the switch would appear like any other device.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!