[SOLVED] Proxmox 8.1.4 - Adding 3rd node to cluster causes havoc.

hymsan

New Member
Dec 5, 2022
9
0
1
Hi,

I've had an existing cluster running with 2 nodes for quite some time without much issue. Today I tried to add a freshly installed 3rd node, and every time I add it to the cluster, all hell breaks loose.

The addition log says:
Task viewer: Join Cluster


Establishing API connection with host '192.XX.XX.XX'
Login succeeded.
check cluster join API version
No cluster network links passed explicitly, fallback to local node IP '108.XX.XX.XX.XX'
Request addition of this node

It shows up in the GUI of the cluster, but for some reason, the pve-ssl.key and .pem are not synced to the rest of the cluster for DAL2 (new node.)

2024-02-05T07:29:53.856860+00:00 QC3 pvedaemon[1488]: <root@pam> adding node DAL2 to cluster
2024-02-05T07:40:40.830839+00:00 QC3 pveproxy[1572]: '/etc/pve/nodes/DAL2/pve-ssl.pem' does not exist!#012
2024-02-05T07:40:41.009924+00:00 QC3 pveproxy[1572]: '/etc/pve/nodes/DAL2/pve-ssl.pem' does not exist!#012
2024-02-05T07:40:42.697662+00:00 QC3 pveproxy[1572]: '/etc/pve/nodes/DAL2/pve-ssl.pem' does not exist!#012
2024-02-05T07:40:42.704920+00:00 QC3 pveproxy[49180]: '/etc/pve/nodes/DAL2/pve-ssl.pem' does not exist!#012

If I manually move them over to the master/QC3 node, it will sync up, and I can basically use the new node from the gui, but it demands a password login each time.

But running updatecerts breaks it, and without manually copying over the .key and .pem for the new node, the Master/QC3 node starts spamming the syslog with:

Feb 05 08:35:08 QC3 corosync[1442]: [TOTEM ] Retransmit List: [redacted]
Feb 05 08:35:09 QC3 corosync[1442]: [TOTEM ] Retransmit List: [redacted]
Feb 05 08:35:09 QC3 corosync[1442]: [TOTEM ] Retransmit List: [redacted]
Feb 05 08:35:09 QC3 corosync[1442]: [TOTEM ] Retransmit List: [redacted]
Feb 05 08:35:09 QC3 corosync[1442]: [TOTEM ] Retransmit List: [redacted]
Feb 05 08:35:10 QC3 corosync[1442]: [TOTEM ] Retransmit List: [redacted]

And the entire cluster begins to have degraded performance, to the point where basically nothing can be accessed. (can't even ssh or restart any proxmox services)

But the moment I disable DAL2 (new node) - cluster performance returns to normal and operates properly.

I have reinstalled DAL2 (new node) fresh from ISO 5 times, the only thing I've run after installing Proxmox is apt-update and apt-upgrade to make sure it was the same version as my other nodes.

At one point, I let the degraded performance continue for a bit to see if it would ever recover. Eventually it just seemed to crash and I had to delnode the new node again. (log of that here)

Code:
Feb 05 07:32:59 QC3 kernel: INFO: task pveproxy worker:1572 blocked for more than 120 seconds.
Feb 05 07:32:59 QC3 kernel:       Tainted: P           O       6.5.11-8-pve #1
Feb 05 07:32:59 QC3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 05 07:32:59 QC3 kernel: task:pveproxy worker state:D stack:0     pid:1572  ppid:1569   flags:0x00004002
Feb 05 07:32:59 QC3 kernel: Call Trace:
Feb 05 07:32:59 QC3 kernel:  <TASK>
Feb 05 07:32:59 QC3 kernel:  __schedule+0x3fd/0x1450
Feb 05 07:32:59 QC3 kernel:  schedule+0x63/0x110
Feb 05 07:32:59 QC3 kernel:  schedule_preempt_disabled+0x15/0x30
Feb 05 07:32:59 QC3 kernel:  rwsem_down_read_slowpath+0x284/0x4d0
Feb 05 07:32:59 QC3 kernel:  down_read+0x48/0xc0
Feb 05 07:32:59 QC3 kernel:  walk_component+0x108/0x190
Feb 05 07:32:59 QC3 kernel:  path_lookupat+0x67/0x1a0
Feb 05 07:32:59 QC3 kernel:  filename_lookup+0xe4/0x200
Feb 05 07:32:59 QC3 kernel:  vfs_statx+0xa1/0x180
Feb 05 07:32:59 QC3 kernel:  vfs_fstatat+0x58/0x80
Feb 05 07:32:59 QC3 kernel:  __do_sys_newfstatat+0x44/0x90
Feb 05 07:32:59 QC3 kernel:  __x64_sys_newfstatat+0x1c/0x30
Feb 05 07:32:59 QC3 kernel:  do_syscall_64+0x5b/0x90
Feb 05 07:32:59 QC3 kernel:  ? exit_to_user_mode_prepare+0xa5/0x190
Feb 05 07:32:59 QC3 kernel:  ? syscall_exit_to_user_mode+0x37/0x60
Feb 05 07:32:59 QC3 kernel:  ? do_syscall_64+0x67/0x90
Feb 05 07:32:59 QC3 kernel:  ? syscall_exit_to_user_mode+0x37/0x60
Feb 05 07:32:59 QC3 kernel:  ? do_syscall_64+0x67/0x90
Feb 05 07:32:59 QC3 kernel:  ? do_syscall_64+0x67/0x90
Feb 05 07:32:59 QC3 kernel:  ? syscall_exit_to_user_mode+0x37/0x60
Feb 05 07:32:59 QC3 kernel:  ? do_syscall_64+0x67/0x90
Feb 05 07:32:59 QC3 kernel:  ? do_syscall_64+0x67/0x90
Feb 05 07:32:59 QC3 kernel:  ? do_syscall_64+0x67/0x90
Feb 05 07:32:59 QC3 kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Feb 05 07:32:59 QC3 kernel: RIP: 0033:0x7f2b9805475a
Feb 05 07:32:59 QC3 kernel: RSP: 002b:00007ffd8150e098 EFLAGS: 00000246 ORIG_RAX: 0000000000000106
Feb 05 07:32:59 QC3 kernel: RAX: ffffffffffffffda RBX: 000055a8a46e92a0 RCX: 00007f2b9805475a
Feb 05 07:32:59 QC3 kernel: RDX: 000055a8a46e94a8 RSI: 000055a8ac65fe60 RDI: 00000000ffffff9c
Feb 05 07:32:59 QC3 kernel: RBP: 000055a8ac750180 R08: 0000000000000000 R09: 000055a8ac67dcf0
Feb 05 07:32:59 QC3 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 000055a8ac65fe60
Feb 05 07:32:59 QC3 kernel: R13: 000055a8a2ba223b R14: 0000000000000000 R15: 00007f2b98259020
Feb 05 07:32:59 QC3 kernel:  </TASK>

There are no hardware issues, when new node is separated from cluster, it performs perfectly, as does the cluster. When new node joins the cluster, everything breaks down.

QC3 ("Master")

proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-6
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
pve-kernel-5.15.116-1-pve: 5.15.116-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
frr-pythontools: 8.5.2-1+pve1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
openvswitch-switch: 3.1.0-2
proxmox-backup-client: 3.1.3-1
proxmox-backup-file-restore: 3.1.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.2-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1
LA1 (Existing node)

proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
frr-pythontools: 8.5.2-1+pve1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
openvswitch-switch: 3.1.0-2
proxmox-backup-client: 3.1.3-1
proxmox-backup-file-restore: 3.1.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.2-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1
DAL2 (New node)

proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4
ceph-fuse: 17.2.7-pve1
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.3-1
proxmox-backup-file-restore: 3.1.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.4
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.2-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1
 
Use a different, not-yet-used in this cluster name, when you add a node. There are some issues that need to be sorted out manually if reusing a host name [1].

[1] https://forum.proxmox.com/threads/s...s-how-to-bypass-ssh-known_hosts-bug-s.137809/

I'm aware of that bug with naming, that's why my node is currently named DAL2 instead of DAL1. I just tried again with DAL3. Same issue, the cluster join seems to fail because the keys aren't being shared.

QC3 ("master") shows:

Feb 05 15:03:50 QC3 pveproxy[589910]: '/etc/pve/nodes/DAL3/pve-ssl.pem' does not exist!
Feb 05 15:03:50 QC3 pveproxy[589912]: '/etc/pve/nodes/DAL3/pve-ssl.pem' does not exist!

I keep googling and finding more and more forum threads for this, but either nobody responded to the thread, or the responses just aren't useful.

I don't understand why these files wouldn't be automatically transferred during the cluster join.



And then QC3 starts spamming:

Feb 05 15:26:58 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:00 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:00 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:00 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:00 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:00 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:00 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:00 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:00 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:01 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:02 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:02 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:03 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41

I have to have run into some sort of recurring bug on this version or something. The amount of times I've reinstalled this new node with a fresh copy and it continues to fail to join, regardless of hostname, dns, or IP address, is baffling.
 
Last edited:
Hi,

I've had an existing cluster running with 2 nodes for quite some time without much issue. Today I tried to add a freshly installed 3rd node, and every time I add it to the cluster, all hell breaks loose.

Can you show pvecm status from each node AFTER you attempt to add the new node?

The addition log says:


It shows up in the GUI of the cluster, but for some reason, the pve-ssl.key and .pem are not synced to the rest of the cluster for DAL2 (new node.)

Are just the files missing, or the whole directory is missing?

If I manually move them over to the master/QC3 node, it will sync up, and I can basically use the new node from the gui, but it demands a password login each time.

Do NOT move these certs manually.

And the entire cluster begins to have degraded performance, to the point where basically nothing can be accessed. (can't even ssh or restart any proxmox services)

You would need to be more specific, e.g. from where you try to connect to which nodes. Ideally, for the troubleshooting, do not use GUI, but SSH in from yet separate (non-cluster) machine, could be your workstation.

But the moment I disable DAL2 (new node) - cluster performance returns to normal and operates properly.

How do you "disable" it?

I have reinstalled DAL2 (new node) fresh from ISO 5 times, the only thing I've run after installing Proxmox is apt-update and apt-upgrade to make sure it was the same version as my other nodes.

At one point, I let the degraded performance continue for a bit to see if it would ever recover. Eventually it just seemed to crash and I had to delnode the new node again. (log of that here)

If it's easily reproducible, journalctl -u pveproxy could be of help on the said node, perhaps.

There are no hardware issues, when new node is separated from cluster, it performs perfectly, as does the cluster. When new node joins the cluster, everything breaks down.

QC3 ("Master")

proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-6
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
pve-kernel-5.15.116-1-pve: 5.15.116-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
frr-pythontools: 8.5.2-1+pve1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
openvswitch-switch: 3.1.0-2
proxmox-backup-client: 3.1.3-1
proxmox-backup-file-restore: 3.1.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.2-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1
LA1 (Existing node)

proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
frr-pythontools: 8.5.2-1+pve1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
openvswitch-switch: 3.1.0-2
proxmox-backup-client: 3.1.3-1
proxmox-backup-file-restore: 3.1.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.2-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1
DAL2 (New node)

proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4
ceph-fuse: 17.2.7-pve1
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.3-1
proxmox-backup-file-restore: 3.1.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.4
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.2-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1

All these nodes are on one local network?
 
Last edited:
Can you show pvecm status from each node AFTER you attempt to add the new node?

When the new node (DAL) is added, the entire cluster goes unresponsive. You cannot run any pve commands until you disable the network on new node.



Are just the files missing, or the whole directory is missing?

The files exist on the new node, but they do not seem to get synced onto the other nodes in the cluster during join.


Do NOT move these certs manually.

I only moved those files as a test to see if it would bring the node up without issue. Sometimes it works, sometimes it doesn't.


Would want to see the retransmit list, actually. If you want to rename or readdress your nodes before public posting, feel free e.g. 1.1.1.1, 2.2.2.2, etc.

It doesn't vary, the spam is just the same line repeating with the same numeric values after it.


You would need to be more specific, e.g. from where you try to connect to which nodes. Ideally, for the troubleshooting, do not use GUI, but SSH in from yet separate (non-cluster) machine, could be your workstation.

I've tried the gui join, I've tried clik pvecm add, force add, it all results in the same behavior.


How do you "disable" it?

Shutting down the new node or simply disabling it's network.


If it's easily reproducible, journalctl -u pveproxy could be of help on the said node, perhaps.

Here's the journal for the most recent attempt:

Feb 05 08:57:58 DAL5 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Feb 05 08:57:58 DAL5 systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Feb 05 08:57:58 DAL5 pmxcfs[2103]: [main] notice: resolved node name 'DAL5' to '108.xxx.xxx.xxx' for default node IP address
Feb 05 08:57:58 DAL5 pmxcfs[2103]: [main] notice: resolved node name 'DAL5' to '108.xxx.xxx.xxx' for default node IP address
Feb 05 08:57:58 DAL5 corosync[2102]: [MAIN ] Corosync Cluster Engine starting up
Feb 05 08:57:58 DAL5 corosync[2102]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
Feb 05 08:57:58 DAL5 pmxcfs[2104]: [quorum] crit: quorum_initialize failed: 2
Feb 05 08:57:58 DAL5 pmxcfs[2104]: [quorum] crit: can't initialize service
Feb 05 08:57:58 DAL5 pmxcfs[2104]: [confdb] crit: cmap_initialize failed: 2
Feb 05 08:57:58 DAL5 pmxcfs[2104]: [confdb] crit: can't initialize service
Feb 05 08:57:58 DAL5 pmxcfs[2104]: [dcdb] crit: cpg_initialize failed: 2
Feb 05 08:57:58 DAL5 pmxcfs[2104]: [dcdb] crit: can't initialize service
Feb 05 08:57:58 DAL5 pmxcfs[2104]: [status] crit: cpg_initialize failed: 2
Feb 05 08:57:58 DAL5 pmxcfs[2104]: [status] crit: can't initialize service
Feb 05 08:57:58 DAL5 corosync[2102]: [TOTEM ] Initializing transport (Kronosnet).
Feb 05 08:57:58 DAL5 kernel: sctp: Hash tables configured (bind 256/256)
Feb 05 08:57:58 DAL5 corosync[2102]: [TOTEM ] totemknet initialized
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] pmtud: MTU manually set to: 0
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Feb 05 08:57:58 DAL5 corosync[2102]: [SERV ] Service engine loaded: corosync configuration map access [0]
Feb 05 08:57:58 DAL5 corosync[2102]: [QB ] server name: cmap
Feb 05 08:57:58 DAL5 corosync[2102]: [SERV ] Service engine loaded: corosync configuration service [1]
Feb 05 08:57:58 DAL5 corosync[2102]: [QB ] server name: cfg
Feb 05 08:57:58 DAL5 corosync[2102]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Feb 05 08:57:58 DAL5 corosync[2102]: [QB ] server name: cpg
Feb 05 08:57:58 DAL5 corosync[2102]: [SERV ] Service engine loaded: corosync profile loading service [4]
Feb 05 08:57:58 DAL5 corosync[2102]: [SERV ] Service engine loaded: corosync resource monitoring service [6]
Feb 05 08:57:58 DAL5 corosync[2102]: [WD ] Watchdog not enabled by configuration
Feb 05 08:57:58 DAL5 corosync[2102]: [WD ] resource load_15min missing a recovery key.
Feb 05 08:57:58 DAL5 corosync[2102]: [WD ] resource memory_used missing a recovery key.
Feb 05 08:57:58 DAL5 corosync[2102]: [WD ] no resources configured.
Feb 05 08:57:58 DAL5 corosync[2102]: [SERV ] Service engine loaded: corosync watchdog service [7]
Feb 05 08:57:58 DAL5 corosync[2102]: [QUORUM] Using quorum provider corosync_votequorum
Feb 05 08:57:58 DAL5 corosync[2102]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Feb 05 08:57:58 DAL5 corosync[2102]: [QB ] server name: votequorum
Feb 05 08:57:58 DAL5 corosync[2102]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Feb 05 08:57:58 DAL5 corosync[2102]: [QB ] server name: quorum
Feb 05 08:57:58 DAL5 corosync[2102]: [TOTEM ] Configuring link 0
Feb 05 08:57:58 DAL5 corosync[2102]: [TOTEM ] Configured link number 0: local addr: 108.xxx.xxx.xxx, port=5405
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 1 has no active links
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 1 has no active links
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 1 has no active links
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 2 has no active links
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 2 has no active links
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 2 has no active links
Feb 05 08:57:58 DAL5 corosync[2102]: [QUORUM] Sync members[1]: 3
Feb 05 08:57:58 DAL5 corosync[2102]: [QUORUM] Sync joined[1]: 3
Feb 05 08:57:58 DAL5 corosync[2102]: [TOTEM ] A new membership (3.5) was formed. Members joined: 3
Feb 05 08:57:58 DAL5 corosync[2102]: [QUORUM] Members[1]: 3
Feb 05 08:57:58 DAL5 corosync[2102]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 05 08:57:58 DAL5 systemd[1]: Started corosync.service - Corosync Cluster Engine.
Feb 05 08:57:59 DAL5 systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.
Feb 05 08:58:00 DAL5 corosync[2102]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Feb 05 08:58:00 DAL5 corosync[2102]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 05 08:58:00 DAL5 corosync[2102]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Feb 05 08:58:00 DAL5 corosync[2102]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 05 08:58:00 DAL5 corosync[2102]: [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Feb 05 08:58:00 DAL5 corosync[2102]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Feb 05 08:58:00 DAL5 corosync[2102]: [KNET ] pmtud: Global data MTU changed to: 1397
Feb 05 08:58:01 DAL5 pvescheduler[2142]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Feb 05 08:58:01 DAL5 pvescheduler[2141]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Feb 05 08:58:01 DAL5 corosync[2102]: [QUORUM] Sync members[3]: 1 2 3
Feb 05 08:58:01 DAL5 corosync[2102]: [QUORUM] Sync joined[2]: 1 2
Feb 05 08:58:01 DAL5 corosync[2102]: [TOTEM ] A new membership (1.39e) was formed. Members joined: 1 2
Feb 05 08:58:01 DAL5 cron[1230]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Feb 05 08:58:01 DAL5 corosync[2102]: [QUORUM] This node is within the primary component and will provide service.
Feb 05 08:58:01 DAL5 corosync[2102]: [QUORUM] Members[3]: 1 2 3
Feb 05 08:58:01 DAL5 corosync[2102]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 05 08:58:04 DAL5 pmxcfs[2104]: [status] notice: update cluster info (cluster name Quebec, version = 21)
Feb 05 08:58:04 DAL5 pmxcfs[2104]: [status] notice: node has quorum
Feb 05 08:58:04 DAL5 pve-ha-lrm[1291]: unable to write lrm status file - unable to open file '/etc/pve/nodes/DAL5/lrm_status.tmp.1291' - No such file or directory
Feb 05 08:58:57 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 35
Feb 05 08:58:58 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 33 34
Feb 05 08:59:06 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 36 37 38 39
Feb 05 08:59:07 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 31 32 34 35
Feb 05 08:59:08 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 36 37 38
Feb 05 08:59:08 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 37 38 3a
Feb 05 08:59:16 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 35 36 37 3a 3b 3c
Feb 05 08:59:17 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 2c 2e 2f 31 32 34
Feb 05 08:59:19 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 35 36 37 3a 3b 3c
Feb 05 08:59:19 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 32 34
Feb 05 08:59:20 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3b 3c 3d
Feb 05 08:59:20 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3a
Feb 05 08:59:22 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3a 3b 3d
Feb 05 08:59:27 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3a 3b 3d 3e 3f
Feb 05 08:59:36 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3a 3b 3d 3e 3f 40
Feb 05 08:59:37 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 31 32 34 35 36 37
Feb 05 08:59:38 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 34 35 36 37
Feb 05 08:59:39 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3d 3e 3f 40 41
Feb 05 08:59:40 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 35 36 37 3a 3b
Feb 05 08:59:41 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3d 3e 3f 40 41
Feb 05 08:59:42 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 35 36 37 3a 3b
Feb 05 08:59:43 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3d 3e 3f 40 41
Feb 05 08:59:44 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 35 36 37 3a 3b
Feb 05 08:59:45 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3d 3e 3f 40 41
Feb 05 08:59:46 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 35 36 37 3a 3b
Feb 05 08:59:47 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3d 3e 3f 40 41
Feb 05 08:59:48 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 37 3a 3b
Feb 05 08:59:48 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3f 40 41
Feb 05 08:59:48 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3b 3d 3e
Feb 05 08:59:48 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 40 41
Feb 05 09:01:53 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 49
Feb 05 09:01:54 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 48
Feb 05 09:01:55 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 49
Feb 05 09:01:55 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 48
Feb 05 09:01:56 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 49
Feb 05 09:01:56 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 48
Feb 05 09:01:57 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 49
Feb 05 09:01:57 DAL5 kernel: INFO: task cron:1230 blocked for more than 120 seconds.
Feb 05 09:01:57 DAL5 kernel: Tainted: P O 6.5.11-8-pve #1
Feb 05 09:01:57 DAL5 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 05 09:01:57 DAL5 kernel: task:cron state:D stack:0 pid:1230 ppid:1 flags:0x00000002
Feb 05 09:01:57 DAL5 kernel: Call Trace:
Feb 05 09:01:57 DAL5 kernel: <TASK>
Feb 05 09:01:57 DAL5 kernel: __schedule+0x3fd/0x1450
Feb 05 09:01:57 DAL5 kernel: ? mutex_lock+0x12/0x50
Feb 05 09:01:57 DAL5 kernel: ? rrw_exit+0x72/0x170 [zfs]
Feb 05 09:01:57 DAL5 kernel: schedule+0x63/0x110
Feb 05 09:01:57 DAL5 kernel: schedule_preempt_disabled+0x15/0x30
Feb 05 09:01:57 DAL5 kernel: rwsem_down_read_slowpath+0x284/0x4d0
Feb 05 09:01:57 DAL5 kernel: down_read+0x48/0xc0
Feb 05 09:01:57 DAL5 kernel: walk_component+0x108/0x190
Feb 05 09:01:57 DAL5 kernel: path_lookupat+0x67/0x1a0
Feb 05 09:01:57 DAL5 kernel: ? rrm_exit+0x4c/0xa0 [zfs]
Feb 05 09:01:57 DAL5 kernel: filename_lookup+0xe4/0x200
Feb 05 09:01:57 DAL5 kernel: ? __pfx_zpl_put_link+0x10/0x10 [zfs]
Feb 05 09:01:57 DAL5 kernel: ? strncpy_from_user+0x50/0x170
Feb 05 09:01:57 DAL5 kernel: vfs_statx+0xa1/0x180
Feb 05 09:01:57 DAL5 kernel: vfs_fstatat+0x58/0x80
Feb 05 09:01:57 DAL5 kernel: __do_sys_newfstatat+0x44/0x90
Feb 05 09:01:57 DAL5 kernel: __x64_sys_newfstatat+0x1c/0x30
Feb 05 09:01:57 DAL5 kernel: do_syscall_64+0x5b/0x90
Feb 05 09:01:57 DAL5 kernel: ? syscall_exit_to_user_mode+0x37/0x60
Feb 05 09:01:57 DAL5 kernel: ? do_syscall_64+0x67/0x90
Feb 05 09:01:57 DAL5 kernel: entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Feb 05 09:01:57 DAL5 kernel: RIP: 0033:0x7f18e87ae75a
Feb 05 09:01:57 DAL5 kernel: RSP: 002b:00007ffddebd7a48 EFLAGS: 00000246 ORIG_RAX: 0000000000000106
Feb 05 09:01:57 DAL5 kernel: RAX: ffffffffffffffda RBX: 0000564bfbde1186 RCX: 00007f18e87ae75a
Feb 05 09:01:57 DAL5 kernel: RDX: 00007ffddebd7c50 RSI: 00007ffddebd7de0 RDI: 00000000ffffff9c
Feb 05 09:01:57 DAL5 kernel: RBP: 00007ffddebd7de0 R08: 0000000000000000 R09: 0000000000000073
Feb 05 09:01:57 DAL5 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000564bfbde1251
Feb 05 09:01:57 DAL5 kernel: R13: 0000564bfcbeaa30 R14: 0000564bfcbed420 R15: 00007ffddebd9e40
Feb 05 09:01:57 DAL5 kernel: </TASK>
Feb 05 09:01:57 DAL5 kernel: INFO: task pvestatd:1245 blocked for more than 120 seconds.
Feb 05 09:01:57 DAL5 kernel: Tainted: P O 6.5.11-8-pve #1
Feb 05 09:01:57 DAL5 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 05 09:01:57 DAL5 kernel: task:pvestatd state:D stack:0 pid:1245 ppid:1 flags:0x00000002
Feb 05 09:01:57 DAL5 kernel: Call Trace:
Feb 05 09:01:57 DAL5 kernel: <TASK>
Feb 05 09:01:57 DAL5 kernel: __schedule+0x3fd/0x1450
Feb 05 09:01:57 DAL5 kernel: schedule+0x63/0x110
Feb 05 09:01:57 DAL5 kernel: schedule_preempt_disabled+0x15/0x30
Feb 05 09:01:57 DAL5 kernel: rwsem_down_read_slowpath+0x284/0x4d0
Feb 05 09:01:57 DAL5 kernel: down_read+0x48/0xc0
Feb 05 09:01:57 DAL5 kernel: walk_component+0x108/0x190

All these nodes are on one local network?

No local network. Joining is being done over internet. (Existing nodes have firewall ipset to whitelist all traffic from each node's ip. New node has no firewall rules at all.) SSH connection works fine between nodes, etc.
 
No local network. Joining is being done over internet. (Existing nodes have firewall ipset to whitelist all traffic from each node's ip. New node has no firewall rules at all.) SSH connection works fine between nodes, etc.

I suspected this from the node names already. :) So I am not sure if this applies to the two of your existing nodes already, but with the third node (in yet another location?) you got enough that you needed to make it all manifest.

Corosync is not really built for anything other than low-latency local networking. You could have some additional issue going on (e.g. port blocked despite your best knowledge) somewhere inbetween. It's definitely not meant to go over public internet, but even if you had e.g. VPN there the latency/jitter and each node being different makes it exactly the wrong kind of setup for corosync-based cluster.

If you could spin up a new node on the same network to test this first it could be of some help. If that works with no issues, you know it's the solely the network.
 
Last edited:
I suspected this from the node names already. :) So I am not sure if this applies to the two of your existing nodes already (but with the third node in yet another location you got enough that you needed to make it all manifest)

Corosync is not really built for anything other than low-latency local networking. You could have some additional issue going on (e.g. port blocked despite your best knowledge) somewhere inbetween. It's definitely not meant to go over public internet, but even if you had e.g. VPN there the latency/jitter and each node being different makes it exactly the wrong kind of setup for corosync-based cluster.

If you could spin up a new node on the same network to test this first it could be of some help. If that works with no issues, you know it's the solely the network.

I should have provided a bit more detail. I own the IP space, the networking equipment, and the AS number of the network at each location. There's zero chance this is related to a port blocking/network issue.

Something is busted, I don't yet know what it is. But I'm certain it's not the network causing issues.
 
I should have provided a bit more detail. I own the IP space, the networking equipment, and the AS number of the network at each location. There's zero chance this is related to a port blocking/network issue.

Something is busted, I don't yet know what it is. But I'm certain it's not the network causing issues.

Understood, but even if it's one AS, as you say these are they are different locations. Ideally this would be all in one datacentre. These are separate AS's, so I assume there's routing going on at the least and it's not exactly homogenous network. Corosync needs very low latency network. This is all completely unrelated to firewall issues.

What's your latency node to node currently, for example?
 
Understood, but even if it's one AS, as you say these are they are different locations. Ideally this would be all in one datacentre. These are separate AS's, so I assume there's routing going on at the least and it's not exactly homogenous network. Corosync needs very low latency network. This is all completely unrelated to firewall issues.

What's your latency node to node currently, for example?

Just ran an MTR

QC3 -> LA1: 63ms
QC3 -> DAL: 45ms
LA1 -> DAL: 29ms

QC3 and LA1 have been running together for around 2 years now. Since Proxmox 7.2. And DAL has a lower latency connection to both endpoints.
 
Have a look at the PVE recommendation [1] (they probably ran some tests and do not wish to have to support anything above that out of empirical experience):

Network Requirements​

The Proxmox VE cluster stack requires a reliable network with latencies under 5 milliseconds (LAN performance) between all nodes to operate stably. While on setups with a small node count a network with higher latencies may work, this is not guaranteed and gets rather unlikely with more than three nodes and latencies above around 10 ms.

I happen to be the last one who likes to take arbitrary PVE "specs" at face value, but consider that corosync itself is meant to expect 50ms by default as maximum [2].

The reason it worked so far is basically that there were just 2 nodes and that the network was basically symmetrical by definition. Now you introduced third which is disrupting it. You can out of curiosity start up two extra nodes at one of the two existing sites, I suspect they would join just fine.

NB If this is production, do NOT do the tests with the actual nodes, make a test setup.

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network
[2] https://manpages.debian.org/testing/corosync/corosync.conf.5.en.html#max_network_delay
 
  • Like
Reactions: hymsan
When the new node (DAL) is added, the entire cluster goes unresponsive. You cannot run any pve commands until you disable the network on new node.

I will just add, if you want to test what's happening, get yourself ssh-connected (do not use GUI tty) to all three nodes, then you can have a go at each node with:

Code:
corosync-cfgtool -s
corosync-cmapctl
 
Have a look at the PVE recommendation [1] (they probably ran some tests and do not wish to have to support anything above that out of empirical experience):



I happen to be the last one who likes to take arbitrary PVE "specs" at face value, but consider that corosync itself is meant to expect 50ms by default as maximum [2].

The reason it worked so far is basically that there were just 2 nodes and that the network was basically symmetrical by definition. Now you introduced third which is disrupting it. You can out of curiosity start up two extra nodes at one of the two existing sites, I suspect they would join just fine.

NB If this is production, do NOT do the tests with the actual nodes, make a test setup.

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network
[2] https://manpages.debian.org/testing/corosync/corosync.conf.5.en.html#max_network_delay

Thank you, these links lead me down the corosync rabbithole. I was able to bring up the third node with minor adjustments to the corosync.conf.

totem {
cluster_name: Quebec
config_version: 49
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
token_retransmit: 300
token: 40000
}
 
Thank you, these links lead me down the corosync rabbithole. I was able to bring up the third node with minor adjustments to the corosync.conf.

totem {
cluster_name: Quebec
config_version: 49
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
token_retransmit: 300
token: 40000
}
Token 40000 ... you are NOT using any of the HA features, correct?
 
Nope, no HA, just have them clustered so I can migrate between hosts easily.

Did it really require 40 secs or you just arbitrarily chose so? I mean, it's a math at the end of the day, even the default was increased not that long ago [1], but I am rather sure someone from PVE team would now have something to say about it. :) The issue (from my point of view) is that lots of things in PVE code depend on some of the expected values, so ... I would keep an eye on what's happening. Or not run a cluster.

You might want to have a look at qm remote migrate [2] instead. There's pct remote-migrate as well. I think it's still considered "technology preview" though.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1870449
[2] https://pve.proxmox.com/pve-docs/qm.1.html
 
It didn't require 40 seconds, but I'll be adding another node to this cluster in the near future. I'm sure ~10000 will probably work. I'll circle back to it at some point.
 
It didn't require 40 seconds, but I'll be adding another node to this cluster in the near future. I'm sure ~10000 will probably work. I'll circle back to it at some point.
Ideally, you should not need to modify both token and the token_retransmit. Just saying. ;)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!