[SOLVED] Proxmox 8.1.4 - Adding 3rd node to cluster causes havoc.

hymsan · Feb 5, 2024

Hi,

I've had an existing cluster running with 2 nodes for quite some time without much issue. Today I tried to add a freshly installed 3rd node, and every time I add it to the cluster, all hell breaks loose.

The addition log says:

Task viewer: Join Cluster

Establishing API connection with host '192.XX.XX.XX'
Login succeeded.
check cluster join API version
No cluster network links passed explicitly, fallback to local node IP '108.XX.XX.XX.XX'
Request addition of this node

It shows up in the GUI of the cluster, but for some reason, the pve-ssl.key and .pem are not synced to the rest of the cluster for DAL2 (new node.)

2024-02-05T07:29:53.856860+00:00 QC3 pvedaemon[1488]: <root@pam> adding node DAL2 to cluster
2024-02-05T07:40:40.830839+00:00 QC3 pveproxy[1572]: '/etc/pve/nodes/DAL2/pve-ssl.pem' does not exist!#012
2024-02-05T07:40:41.009924+00:00 QC3 pveproxy[1572]: '/etc/pve/nodes/DAL2/pve-ssl.pem' does not exist!#012
2024-02-05T07:40:42.697662+00:00 QC3 pveproxy[1572]: '/etc/pve/nodes/DAL2/pve-ssl.pem' does not exist!#012
2024-02-05T07:40:42.704920+00:00 QC3 pveproxy[49180]: '/etc/pve/nodes/DAL2/pve-ssl.pem' does not exist!#012

If I manually move them over to the master/QC3 node, it will sync up, and I can basically use the new node from the gui, but it demands a password login each time.

But running updatecerts breaks it, and without manually copying over the .key and .pem for the new node, the Master/QC3 node starts spamming the syslog with:

Feb 05 08:35:08 QC3 corosync[1442]: [TOTEM ] Retransmit List: [redacted]
Feb 05 08:35:09 QC3 corosync[1442]: [TOTEM ] Retransmit List: [redacted]
Feb 05 08:35:09 QC3 corosync[1442]: [TOTEM ] Retransmit List: [redacted]
Feb 05 08:35:09 QC3 corosync[1442]: [TOTEM ] Retransmit List: [redacted]
Feb 05 08:35:09 QC3 corosync[1442]: [TOTEM ] Retransmit List: [redacted]
Feb 05 08:35:10 QC3 corosync[1442]: [TOTEM ] Retransmit List: [redacted]

And the entire cluster begins to have degraded performance, to the point where basically nothing can be accessed. (can't even ssh or restart any proxmox services)

But the moment I disable DAL2 (new node) - cluster performance returns to normal and operates properly.

I have reinstalled DAL2 (new node) fresh from ISO 5 times, the only thing I've run after installing Proxmox is apt-update and apt-upgrade to make sure it was the same version as my other nodes.

At one point, I let the degraded performance continue for a bit to see if it would ever recover. Eventually it just seemed to crash and I had to delnode the new node again. (log of that here)

Code:

Feb 05 07:32:59 QC3 kernel: INFO: task pveproxy worker:1572 blocked for more than 120 seconds.
Feb 05 07:32:59 QC3 kernel:       Tainted: P           O       6.5.11-8-pve #1
Feb 05 07:32:59 QC3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 05 07:32:59 QC3 kernel: task:pveproxy worker state:D stack:0     pid:1572  ppid:1569   flags:0x00004002
Feb 05 07:32:59 QC3 kernel: Call Trace:
Feb 05 07:32:59 QC3 kernel:  <TASK>
Feb 05 07:32:59 QC3 kernel:  __schedule+0x3fd/0x1450
Feb 05 07:32:59 QC3 kernel:  schedule+0x63/0x110
Feb 05 07:32:59 QC3 kernel:  schedule_preempt_disabled+0x15/0x30
Feb 05 07:32:59 QC3 kernel:  rwsem_down_read_slowpath+0x284/0x4d0
Feb 05 07:32:59 QC3 kernel:  down_read+0x48/0xc0
Feb 05 07:32:59 QC3 kernel:  walk_component+0x108/0x190
Feb 05 07:32:59 QC3 kernel:  path_lookupat+0x67/0x1a0
Feb 05 07:32:59 QC3 kernel:  filename_lookup+0xe4/0x200
Feb 05 07:32:59 QC3 kernel:  vfs_statx+0xa1/0x180
Feb 05 07:32:59 QC3 kernel:  vfs_fstatat+0x58/0x80
Feb 05 07:32:59 QC3 kernel:  __do_sys_newfstatat+0x44/0x90
Feb 05 07:32:59 QC3 kernel:  __x64_sys_newfstatat+0x1c/0x30
Feb 05 07:32:59 QC3 kernel:  do_syscall_64+0x5b/0x90
Feb 05 07:32:59 QC3 kernel:  ? exit_to_user_mode_prepare+0xa5/0x190
Feb 05 07:32:59 QC3 kernel:  ? syscall_exit_to_user_mode+0x37/0x60
Feb 05 07:32:59 QC3 kernel:  ? do_syscall_64+0x67/0x90
Feb 05 07:32:59 QC3 kernel:  ? syscall_exit_to_user_mode+0x37/0x60
Feb 05 07:32:59 QC3 kernel:  ? do_syscall_64+0x67/0x90
Feb 05 07:32:59 QC3 kernel:  ? do_syscall_64+0x67/0x90
Feb 05 07:32:59 QC3 kernel:  ? syscall_exit_to_user_mode+0x37/0x60
Feb 05 07:32:59 QC3 kernel:  ? do_syscall_64+0x67/0x90
Feb 05 07:32:59 QC3 kernel:  ? do_syscall_64+0x67/0x90
Feb 05 07:32:59 QC3 kernel:  ? do_syscall_64+0x67/0x90
Feb 05 07:32:59 QC3 kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Feb 05 07:32:59 QC3 kernel: RIP: 0033:0x7f2b9805475a
Feb 05 07:32:59 QC3 kernel: RSP: 002b:00007ffd8150e098 EFLAGS: 00000246 ORIG_RAX: 0000000000000106
Feb 05 07:32:59 QC3 kernel: RAX: ffffffffffffffda RBX: 000055a8a46e92a0 RCX: 00007f2b9805475a
Feb 05 07:32:59 QC3 kernel: RDX: 000055a8a46e94a8 RSI: 000055a8ac65fe60 RDI: 00000000ffffff9c
Feb 05 07:32:59 QC3 kernel: RBP: 000055a8ac750180 R08: 0000000000000000 R09: 000055a8ac67dcf0
Feb 05 07:32:59 QC3 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 000055a8ac65fe60
Feb 05 07:32:59 QC3 kernel: R13: 000055a8a2ba223b R14: 0000000000000000 R15: 00007f2b98259020
Feb 05 07:32:59 QC3 kernel:  </TASK>

There are no hardware issues, when new node is separated from cluster, it performs perfectly, as does the cluster. When new node joins the cluster, everything breaks down.

QC3 ("Master")

proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-6
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
pve-kernel-5.15.116-1-pve: 5.15.116-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
frr-pythontools: 8.5.2-1+pve1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
openvswitch-switch: 3.1.0-2
proxmox-backup-client: 3.1.3-1
proxmox-backup-file-restore: 3.1.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.2-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1

LA1 (Existing node)

proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
frr-pythontools: 8.5.2-1+pve1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
openvswitch-switch: 3.1.0-2
proxmox-backup-client: 3.1.3-1
proxmox-backup-file-restore: 3.1.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.2-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1

DAL2 (New node)

proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4
ceph-fuse: 17.2.7-pve1
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.3-1
proxmox-backup-file-restore: 3.1.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.4
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.2-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1

VictorSTS · Feb 5, 2024

Use a different, not-yet-used in this cluster name, when you add a node. There are some issues that need to be sorted out manually if reusing a host name [1].

[1] https://forum.proxmox.com/threads/s...s-how-to-bypass-ssh-known_hosts-bug-s.137809/

hymsan · Feb 5, 2024

VictorSTS said:
Use a different, not-yet-used in this cluster name, when you add a node. There are some issues that need to be sorted out manually if reusing a host name [1].

[1] https://forum.proxmox.com/threads/s...s-how-to-bypass-ssh-known_hosts-bug-s.137809/

I'm aware of that bug with naming, that's why my node is currently named DAL2 instead of DAL1. I just tried again with DAL3. Same issue, the cluster join seems to fail because the keys aren't being shared.

QC3 ("master") shows:

Feb 05 15:03:50 QC3 pveproxy[589910]: '/etc/pve/nodes/DAL3/pve-ssl.pem' does not exist!
Feb 05 15:03:50 QC3 pveproxy[589912]: '/etc/pve/nodes/DAL3/pve-ssl.pem' does not exist!

I keep googling and finding more and more forum threads for this, but either nobody responded to the thread, or the responses just aren't useful.

I don't understand why these files wouldn't be automatically transferred during the cluster join.

And then QC3 starts spamming:

Feb 05 15:26:58 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:26:59 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:00 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:00 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:00 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:00 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:00 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:00 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:00 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:00 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:01 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:02 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:02 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41
Feb 05 15:27:03 QC3 corosync[613252]: [TOTEM ] Retransmit List: 32 33 34 35 36 37 39 3a 3b 3c 3d 3f 41

I have to have run into some sort of recurring bug on this version or something. The amount of times I've reinstalled this new node with a fresh copy and it continues to fail to join, regardless of hostname, dns, or IP address, is baffling.

esi_y · Feb 5, 2024

hymsan said:
Hi,

I've had an existing cluster running with 2 nodes for quite some time without much issue. Today I tried to add a freshly installed 3rd node, and every time I add it to the cluster, all hell breaks loose.

Can you show pvecm status from each node AFTER you attempt to add the new node?

hymsan said:
The addition log says:

It shows up in the GUI of the cluster, but for some reason, the pve-ssl.key and .pem are not synced to the rest of the cluster for DAL2 (new node.)

Are just the files missing, or the whole directory is missing?

hymsan said:
If I manually move them over to the master/QC3 node, it will sync up, and I can basically use the new node from the gui, but it demands a password login each time.

Do NOT move these certs manually.

hymsan said:
And the entire cluster begins to have degraded performance, to the point where basically nothing can be accessed. (can't even ssh or restart any proxmox services)

You would need to be more specific, e.g. from where you try to connect to which nodes. Ideally, for the troubleshooting, do not use GUI, but SSH in from yet separate (non-cluster) machine, could be your workstation.

hymsan said:
But the moment I disable DAL2 (new node) - cluster performance returns to normal and operates properly.

How do you "disable" it?

hymsan said:
I have reinstalled DAL2 (new node) fresh from ISO 5 times, the only thing I've run after installing Proxmox is apt-update and apt-upgrade to make sure it was the same version as my other nodes.

At one point, I let the degraded performance continue for a bit to see if it would ever recover. Eventually it just seemed to crash and I had to delnode the new node again. (log of that here)

If it's easily reproducible, journalctl -u pveproxy could be of help on the said node, perhaps.

hymsan said:
There are no hardware issues, when new node is separated from cluster, it performs perfectly, as does the cluster. When new node joins the cluster, everything breaks down.

QC3 ("Master")

proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-6
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
pve-kernel-5.15.116-1-pve: 5.15.116-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
frr-pythontools: 8.5.2-1+pve1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
openvswitch-switch: 3.1.0-2
proxmox-backup-client: 3.1.3-1
proxmox-backup-file-restore: 3.1.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.2-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1 LA1 (Existing node)

proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
frr-pythontools: 8.5.2-1+pve1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
openvswitch-switch: 3.1.0-2
proxmox-backup-client: 3.1.3-1
proxmox-backup-file-restore: 3.1.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.2-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1 DAL2 (New node)

proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4
ceph-fuse: 17.2.7-pve1
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.3-1
proxmox-backup-file-restore: 3.1.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.4
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.2-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1

All these nodes are on one local network?

esi_y · Feb 5, 2024

VictorSTS said:
Use a different, not-yet-used in this cluster name, when you add a node. There are some issues that need to be sorted out manually if reusing a host name [1].

[1] https://forum.proxmox.com/threads/s...s-how-to-bypass-ssh-known_hosts-bug-s.137809/

For the record, those are SSH related issues, these are SSL/TLS keys missing for the OP.

esi_y · Feb 5, 2024

hymsan said:
And then QC3 starts spamming:

Can you describe your network setup in detail?

hymsan · Feb 5, 2024

tempacc346235 said:
Can you show pvecm status from each node AFTER you attempt to add the new node?

When the new node (DAL) is added, the entire cluster goes unresponsive. You cannot run any pve commands until you disable the network on new node.

tempacc346235 said:
Are just the files missing, or the whole directory is missing?

The files exist on the new node, but they do not seem to get synced onto the other nodes in the cluster during join.

tempacc346235 said:
Do NOT move these certs manually.

I only moved those files as a test to see if it would bring the node up without issue. Sometimes it works, sometimes it doesn't.

tempacc346235 said:
Would want to see the retransmit list, actually. If you want to rename or readdress your nodes before public posting, feel free e.g. 1.1.1.1, 2.2.2.2, etc.

It doesn't vary, the spam is just the same line repeating with the same numeric values after it.

tempacc346235 said:
You would need to be more specific, e.g. from where you try to connect to which nodes. Ideally, for the troubleshooting, do not use GUI, but SSH in from yet separate (non-cluster) machine, could be your workstation.

I've tried the gui join, I've tried clik pvecm add, force add, it all results in the same behavior.

tempacc346235 said:
How do you "disable" it?

Shutting down the new node or simply disabling it's network.

If it's easily reproducible, journalctl -u pveproxy could be of help on the said node, perhaps.

Here's the journal for the most recent attempt:

Feb 05 08:57:58 DAL5 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Feb 05 08:57:58 DAL5 systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Feb 05 08:57:58 DAL5 pmxcfs[2103]: [main] notice: resolved node name 'DAL5' to '108.xxx.xxx.xxx' for default node IP address
Feb 05 08:57:58 DAL5 pmxcfs[2103]: [main] notice: resolved node name 'DAL5' to '108.xxx.xxx.xxx' for default node IP address
Feb 05 08:57:58 DAL5 corosync[2102]: [MAIN ] Corosync Cluster Engine starting up
Feb 05 08:57:58 DAL5 corosync[2102]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
Feb 05 08:57:58 DAL5 pmxcfs[2104]: [quorum] crit: quorum_initialize failed: 2
Feb 05 08:57:58 DAL5 pmxcfs[2104]: [quorum] crit: can't initialize service
Feb 05 08:57:58 DAL5 pmxcfs[2104]: [confdb] crit: cmap_initialize failed: 2
Feb 05 08:57:58 DAL5 pmxcfs[2104]: [confdb] crit: can't initialize service
Feb 05 08:57:58 DAL5 pmxcfs[2104]: [dcdb] crit: cpg_initialize failed: 2
Feb 05 08:57:58 DAL5 pmxcfs[2104]: [dcdb] crit: can't initialize service
Feb 05 08:57:58 DAL5 pmxcfs[2104]: [status] crit: cpg_initialize failed: 2
Feb 05 08:57:58 DAL5 pmxcfs[2104]: [status] crit: can't initialize service
Feb 05 08:57:58 DAL5 corosync[2102]: [TOTEM ] Initializing transport (Kronosnet).
Feb 05 08:57:58 DAL5 kernel: sctp: Hash tables configured (bind 256/256)
Feb 05 08:57:58 DAL5 corosync[2102]: [TOTEM ] totemknet initialized
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] pmtud: MTU manually set to: 0
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Feb 05 08:57:58 DAL5 corosync[2102]: [SERV ] Service engine loaded: corosync configuration map access [0]
Feb 05 08:57:58 DAL5 corosync[2102]: [QB ] server name: cmap
Feb 05 08:57:58 DAL5 corosync[2102]: [SERV ] Service engine loaded: corosync configuration service [1]
Feb 05 08:57:58 DAL5 corosync[2102]: [QB ] server name: cfg
Feb 05 08:57:58 DAL5 corosync[2102]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Feb 05 08:57:58 DAL5 corosync[2102]: [QB ] server name: cpg
Feb 05 08:57:58 DAL5 corosync[2102]: [SERV ] Service engine loaded: corosync profile loading service [4]
Feb 05 08:57:58 DAL5 corosync[2102]: [SERV ] Service engine loaded: corosync resource monitoring service [6]
Feb 05 08:57:58 DAL5 corosync[2102]: [WD ] Watchdog not enabled by configuration
Feb 05 08:57:58 DAL5 corosync[2102]: [WD ] resource load_15min missing a recovery key.
Feb 05 08:57:58 DAL5 corosync[2102]: [WD ] resource memory_used missing a recovery key.
Feb 05 08:57:58 DAL5 corosync[2102]: [WD ] no resources configured.
Feb 05 08:57:58 DAL5 corosync[2102]: [SERV ] Service engine loaded: corosync watchdog service [7]
Feb 05 08:57:58 DAL5 corosync[2102]: [QUORUM] Using quorum provider corosync_votequorum
Feb 05 08:57:58 DAL5 corosync[2102]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Feb 05 08:57:58 DAL5 corosync[2102]: [QB ] server name: votequorum
Feb 05 08:57:58 DAL5 corosync[2102]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Feb 05 08:57:58 DAL5 corosync[2102]: [QB ] server name: quorum
Feb 05 08:57:58 DAL5 corosync[2102]: [TOTEM ] Configuring link 0
Feb 05 08:57:58 DAL5 corosync[2102]: [TOTEM ] Configured link number 0: local addr: 108.xxx.xxx.xxx, port=5405
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 1 has no active links
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 1 has no active links
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 1 has no active links
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 2 has no active links
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 2 has no active links
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 05 08:57:58 DAL5 corosync[2102]: [KNET ] host: host: 2 has no active links
Feb 05 08:57:58 DAL5 corosync[2102]: [QUORUM] Sync members[1]: 3
Feb 05 08:57:58 DAL5 corosync[2102]: [QUORUM] Sync joined[1]: 3
Feb 05 08:57:58 DAL5 corosync[2102]: [TOTEM ] A new membership (3.5) was formed. Members joined: 3
Feb 05 08:57:58 DAL5 corosync[2102]: [QUORUM] Members[1]: 3
Feb 05 08:57:58 DAL5 corosync[2102]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 05 08:57:58 DAL5 systemd[1]: Started corosync.service - Corosync Cluster Engine.
Feb 05 08:57:59 DAL5 systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.
Feb 05 08:58:00 DAL5 corosync[2102]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Feb 05 08:58:00 DAL5 corosync[2102]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 05 08:58:00 DAL5 corosync[2102]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Feb 05 08:58:00 DAL5 corosync[2102]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 05 08:58:00 DAL5 corosync[2102]: [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Feb 05 08:58:00 DAL5 corosync[2102]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Feb 05 08:58:00 DAL5 corosync[2102]: [KNET ] pmtud: Global data MTU changed to: 1397
Feb 05 08:58:01 DAL5 pvescheduler[2142]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Feb 05 08:58:01 DAL5 pvescheduler[2141]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Feb 05 08:58:01 DAL5 corosync[2102]: [QUORUM] Sync members[3]: 1 2 3
Feb 05 08:58:01 DAL5 corosync[2102]: [QUORUM] Sync joined[2]: 1 2
Feb 05 08:58:01 DAL5 corosync[2102]: [TOTEM ] A new membership (1.39e) was formed. Members joined: 1 2
Feb 05 08:58:01 DAL5 cron[1230]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Feb 05 08:58:01 DAL5 corosync[2102]: [QUORUM] This node is within the primary component and will provide service.
Feb 05 08:58:01 DAL5 corosync[2102]: [QUORUM] Members[3]: 1 2 3
Feb 05 08:58:01 DAL5 corosync[2102]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 05 08:58:04 DAL5 pmxcfs[2104]: [status] notice: update cluster info (cluster name Quebec, version = 21)
Feb 05 08:58:04 DAL5 pmxcfs[2104]: [status] notice: node has quorum
Feb 05 08:58:04 DAL5 pve-ha-lrm[1291]: unable to write lrm status file - unable to open file '/etc/pve/nodes/DAL5/lrm_status.tmp.1291' - No such file or directory
Feb 05 08:58:57 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 35
Feb 05 08:58:58 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 33 34
Feb 05 08:59:06 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 36 37 38 39
Feb 05 08:59:07 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 31 32 34 35
Feb 05 08:59:08 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 36 37 38
Feb 05 08:59:08 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 37 38 3a
Feb 05 08:59:16 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 35 36 37 3a 3b 3c
Feb 05 08:59:17 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 2c 2e 2f 31 32 34
Feb 05 08:59:19 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 35 36 37 3a 3b 3c
Feb 05 08:59:19 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 32 34
Feb 05 08:59:20 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3b 3c 3d
Feb 05 08:59:20 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3a
Feb 05 08:59:22 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3a 3b 3d
Feb 05 08:59:27 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3a 3b 3d 3e 3f
Feb 05 08:59:36 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3a 3b 3d 3e 3f 40
Feb 05 08:59:37 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 31 32 34 35 36 37
Feb 05 08:59:38 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 34 35 36 37
Feb 05 08:59:39 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3d 3e 3f 40 41
Feb 05 08:59:40 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 35 36 37 3a 3b
Feb 05 08:59:41 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3d 3e 3f 40 41
Feb 05 08:59:42 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 35 36 37 3a 3b
Feb 05 08:59:43 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3d 3e 3f 40 41
Feb 05 08:59:44 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 35 36 37 3a 3b
Feb 05 08:59:45 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3d 3e 3f 40 41
Feb 05 08:59:46 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 35 36 37 3a 3b
Feb 05 08:59:47 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3d 3e 3f 40 41
Feb 05 08:59:48 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 37 3a 3b
Feb 05 08:59:48 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3f 40 41
Feb 05 08:59:48 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 3b 3d 3e
Feb 05 08:59:48 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 40 41
Feb 05 09:01:53 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 49
Feb 05 09:01:54 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 48
Feb 05 09:01:55 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 49
Feb 05 09:01:55 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 48
Feb 05 09:01:56 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 49
Feb 05 09:01:56 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 48
Feb 05 09:01:57 DAL5 corosync[2102]: [TOTEM ] Retransmit List: 49
Feb 05 09:01:57 DAL5 kernel: INFO: task cron:1230 blocked for more than 120 seconds.
Feb 05 09:01:57 DAL5 kernel: Tainted: P O 6.5.11-8-pve #1
Feb 05 09:01:57 DAL5 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 05 09:01:57 DAL5 kernel: task:cron state stack:0 pid:1230 ppid:1 flags:0x00000002
Feb 05 09:01:57 DAL5 kernel: Call Trace:
Feb 05 09:01:57 DAL5 kernel: <TASK>
Feb 05 09:01:57 DAL5 kernel: __schedule+0x3fd/0x1450
Feb 05 09:01:57 DAL5 kernel: ? mutex_lock+0x12/0x50
Feb 05 09:01:57 DAL5 kernel: ? rrw_exit+0x72/0x170 [zfs]
Feb 05 09:01:57 DAL5 kernel: schedule+0x63/0x110
Feb 05 09:01:57 DAL5 kernel: schedule_preempt_disabled+0x15/0x30
Feb 05 09:01:57 DAL5 kernel: rwsem_down_read_slowpath+0x284/0x4d0
Feb 05 09:01:57 DAL5 kernel: down_read+0x48/0xc0
Feb 05 09:01:57 DAL5 kernel: walk_component+0x108/0x190
Feb 05 09:01:57 DAL5 kernel: path_lookupat+0x67/0x1a0
Feb 05 09:01:57 DAL5 kernel: ? rrm_exit+0x4c/0xa0 [zfs]
Feb 05 09:01:57 DAL5 kernel: filename_lookup+0xe4/0x200
Feb 05 09:01:57 DAL5 kernel: ? __pfx_zpl_put_link+0x10/0x10 [zfs]
Feb 05 09:01:57 DAL5 kernel: ? strncpy_from_user+0x50/0x170
Feb 05 09:01:57 DAL5 kernel: vfs_statx+0xa1/0x180
Feb 05 09:01:57 DAL5 kernel: vfs_fstatat+0x58/0x80
Feb 05 09:01:57 DAL5 kernel: __do_sys_newfstatat+0x44/0x90
Feb 05 09:01:57 DAL5 kernel: __x64_sys_newfstatat+0x1c/0x30
Feb 05 09:01:57 DAL5 kernel: do_syscall_64+0x5b/0x90
Feb 05 09:01:57 DAL5 kernel: ? syscall_exit_to_user_mode+0x37/0x60
Feb 05 09:01:57 DAL5 kernel: ? do_syscall_64+0x67/0x90
Feb 05 09:01:57 DAL5 kernel: entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Feb 05 09:01:57 DAL5 kernel: RIP: 0033:0x7f18e87ae75a
Feb 05 09:01:57 DAL5 kernel: RSP: 002b:00007ffddebd7a48 EFLAGS: 00000246 ORIG_RAX: 0000000000000106
Feb 05 09:01:57 DAL5 kernel: RAX: ffffffffffffffda RBX: 0000564bfbde1186 RCX: 00007f18e87ae75a
Feb 05 09:01:57 DAL5 kernel: RDX: 00007ffddebd7c50 RSI: 00007ffddebd7de0 RDI: 00000000ffffff9c
Feb 05 09:01:57 DAL5 kernel: RBP: 00007ffddebd7de0 R08: 0000000000000000 R09: 0000000000000073
Feb 05 09:01:57 DAL5 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000564bfbde1251
Feb 05 09:01:57 DAL5 kernel: R13: 0000564bfcbeaa30 R14: 0000564bfcbed420 R15: 00007ffddebd9e40
Feb 05 09:01:57 DAL5 kernel: </TASK>
Feb 05 09:01:57 DAL5 kernel: INFO: task pvestatd:1245 blocked for more than 120 seconds.
Feb 05 09:01:57 DAL5 kernel: Tainted: P O 6.5.11-8-pve #1
Feb 05 09:01:57 DAL5 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 05 09:01:57 DAL5 kernel: taskvestatd state stack:0 pid:1245 ppid:1 flags:0x00000002
Feb 05 09:01:57 DAL5 kernel: Call Trace:
Feb 05 09:01:57 DAL5 kernel: <TASK>
Feb 05 09:01:57 DAL5 kernel: __schedule+0x3fd/0x1450
Feb 05 09:01:57 DAL5 kernel: schedule+0x63/0x110
Feb 05 09:01:57 DAL5 kernel: schedule_preempt_disabled+0x15/0x30
Feb 05 09:01:57 DAL5 kernel: rwsem_down_read_slowpath+0x284/0x4d0
Feb 05 09:01:57 DAL5 kernel: down_read+0x48/0xc0
Feb 05 09:01:57 DAL5 kernel: walk_component+0x108/0x190

tempacc346235 said:
All these nodes are on one local network?

No local network. Joining is being done over internet. (Existing nodes have firewall ipset to whitelist all traffic from each node's ip. New node has no firewall rules at all.) SSH connection works fine between nodes, etc.

esi_y · Feb 5, 2024

hymsan said:
No local network. Joining is being done over internet. (Existing nodes have firewall ipset to whitelist all traffic from each node's ip. New node has no firewall rules at all.) SSH connection works fine between nodes, etc.

I suspected this from the node names already.

So I am not sure if this applies to the two of your existing nodes already, but with the third node (in yet another location?) you got enough that you needed to make it all manifest.

Corosync is not really built for anything other than low-latency local networking. You could have some additional issue going on (e.g. port blocked despite your best knowledge) somewhere inbetween. It's definitely not meant to go over public internet, but even if you had e.g. VPN there the latency/jitter and each node being different makes it exactly the wrong kind of setup for corosync-based cluster.

If you could spin up a new node on the same network to test this first it could be of some help. If that works with no issues, you know it's the solely the network.

hymsan · Feb 5, 2024

tempacc346235 said:
I suspected this from the node names already. So I am not sure if this applies to the two of your existing nodes already (but with the third node in yet another location you got enough that you needed to make it all manifest)

Corosync is not really built for anything other than low-latency local networking. You could have some additional issue going on (e.g. port blocked despite your best knowledge) somewhere inbetween. It's definitely not meant to go over public internet, but even if you had e.g. VPN there the latency/jitter and each node being different makes it exactly the wrong kind of setup for corosync-based cluster.

If you could spin up a new node on the same network to test this first it could be of some help. If that works with no issues, you know it's the solely the network.

I should have provided a bit more detail. I own the IP space, the networking equipment, and the AS number of the network at each location. There's zero chance this is related to a port blocking/network issue.

Something is busted, I don't yet know what it is. But I'm certain it's not the network causing issues.

esi_y · Feb 5, 2024

hymsan said:
I should have provided a bit more detail. I own the IP space, the networking equipment, and the AS number of the network at each location. There's zero chance this is related to a port blocking/network issue.

Something is busted, I don't yet know what it is. But I'm certain it's not the network causing issues.

Understood, but even if it's one AS, as you say these are they are different locations. Ideally this would be all in one datacentre. These are separate AS's, so I assume there's routing going on at the least and it's not exactly homogenous network. Corosync needs very low latency network. This is all completely unrelated to firewall issues.

What's your latency node to node currently, for example?

hymsan · Feb 5, 2024

tempacc346235 said:
Understood, but even if it's one AS, as you say these are they are different locations. Ideally this would be all in one datacentre. These are separate AS's, so I assume there's routing going on at the least and it's not exactly homogenous network. Corosync needs very low latency network. This is all completely unrelated to firewall issues.

What's your latency node to node currently, for example?

Just ran an MTR

QC3 -> LA1: 63ms
QC3 -> DAL: 45ms
LA1 -> DAL: 29ms

QC3 and LA1 have been running together for around 2 years now. Since Proxmox 7.2. And DAL has a lower latency connection to both endpoints.

esi_y · Feb 5, 2024

Have a look at the PVE recommendation [1] (they probably ran some tests and do not wish to have to support anything above that out of empirical experience):

Network Requirements
The Proxmox VE cluster stack requires a reliable network with latencies under 5 milliseconds (LAN performance) between all nodes to operate stably. While on setups with a small node count a network with higher latencies may work, this is not guaranteed and gets rather unlikely with more than three nodes and latencies above around 10 ms.

I happen to be the last one who likes to take arbitrary PVE "specs" at face value, but consider that corosync itself is meant to expect 50ms by default as maximum [2].

The reason it worked so far is basically that there were just 2 nodes and that the network was basically symmetrical by definition. Now you introduced third which is disrupting it. You can out of curiosity start up two extra nodes at one of the two existing sites, I suspect they would join just fine.

NB If this is production, do NOT do the tests with the actual nodes, make a test setup.

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network
[2] https://manpages.debian.org/testing/corosync/corosync.conf.5.en.html#max_network_delay

esi_y · Feb 5, 2024

hymsan said:
When the new node (DAL) is added, the entire cluster goes unresponsive. You cannot run any pve commands until you disable the network on new node.

I will just add, if you want to test what's happening, get yourself ssh-connected (do not use GUI tty) to all three nodes, then you can have a go at each node with:

Code:

corosync-cfgtool -s
corosync-cmapctl

hymsan · Feb 5, 2024

tempacc346235 said:
Have a look at the PVE recommendation [1] (they probably ran some tests and do not wish to have to support anything above that out of empirical experience):

I happen to be the last one who likes to take arbitrary PVE "specs" at face value, but consider that corosync itself is meant to expect 50ms by default as maximum [2].

The reason it worked so far is basically that there were just 2 nodes and that the network was basically symmetrical by definition. Now you introduced third which is disrupting it. You can out of curiosity start up two extra nodes at one of the two existing sites, I suspect they would join just fine.

NB If this is production, do NOT do the tests with the actual nodes, make a test setup.

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network
[2] https://manpages.debian.org/testing/corosync/corosync.conf.5.en.html#max_network_delay

Thank you, these links lead me down the corosync rabbithole. I was able to bring up the third node with minor adjustments to the corosync.conf.

totem {
cluster_name: Quebec
config_version: 49
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
token_retransmit: 300
token: 40000
}

esi_y · Feb 5, 2024

hymsan said:
Thank you, these links lead me down the corosync rabbithole. I was able to bring up the third node with minor adjustments to the corosync.conf.

totem {
cluster_name: Quebec
config_version: 49
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
token_retransmit: 300
token: 40000
}

Token 40000 ... you are NOT using any of the HA features, correct?

hymsan · Feb 5, 2024

tempacc346235 said:
Token 40000 ... you are NOT using any of the HA features, correct?

Nope, no HA, just have them clustered so I can migrate between hosts easily.

esi_y · Feb 5, 2024

hymsan said:
Nope, no HA, just have them clustered so I can migrate between hosts easily.

Did it really require 40 secs or you just arbitrarily chose so? I mean, it's a math at the end of the day, even the default was increased not that long ago [1], but I am rather sure someone from PVE team would now have something to say about it.

The issue (from my point of view) is that lots of things in PVE code depend on some of the expected values, so ... I would keep an eye on what's happening. Or not run a cluster.

You might want to have a look at qm remote migrate [2] instead. There's pct remote-migrate as well. I think it's still considered "technology preview" though.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1870449
[2] https://pve.proxmox.com/pve-docs/qm.1.html

hymsan · Feb 5, 2024

It didn't require 40 seconds, but I'll be adding another node to this cluster in the near future. I'm sure ~10000 will probably work. I'll circle back to it at some point.

esi_y · Feb 5, 2024

hymsan said:
It didn't require 40 seconds, but I'll be adding another node to this cluster in the near future. I'm sure ~10000 will probably work. I'll circle back to it at some point.

Ideally, you should not need to modify both token and the token_retransmit. Just saying.

esi_y · Feb 8, 2024

tempacc346235 said:
The issue (from my point of view) is that lots of things in PVE code depend on some of the expected values, so ...

Linking through explaining post for completeness here:
https://forum.proxmox.com/threads/high-latency-clusters.141098/

Thanks @fabian!

[SOLVED] Proxmox 8.1.4 - Adding 3rd node to cluster causes havoc.

New Member

Famous Member

New Member

Renowned Member

Renowned Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

Network Requirements​

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

Renowned Member

We value your privacy

Network Requirements