Hi,
I've had an existing cluster running with 2 nodes for quite some time without much issue. Today I tried to add a freshly installed 3rd node, and every time I add it to the cluster, all hell breaks loose.
The addition log says:
It shows up in the GUI of the cluster, but for some reason, the pve-ssl.key and .pem are not synced to the rest of the cluster for DAL2 (new node.)
If I manually move them over to the master/QC3 node, it will sync up, and I can basically use the new node from the gui, but it demands a password login each time.
But running updatecerts breaks it, and without manually copying over the .key and .pem for the new node, the Master/QC3 node starts spamming the syslog with:
And the entire cluster begins to have degraded performance, to the point where basically nothing can be accessed. (can't even ssh or restart any proxmox services)
But the moment I disable DAL2 (new node) - cluster performance returns to normal and operates properly.
I have reinstalled DAL2 (new node) fresh from ISO 5 times, the only thing I've run after installing Proxmox is apt-update and apt-upgrade to make sure it was the same version as my other nodes.
At one point, I let the degraded performance continue for a bit to see if it would ever recover. Eventually it just seemed to crash and I had to delnode the new node again. (log of that here)
There are no hardware issues, when new node is separated from cluster, it performs perfectly, as does the cluster. When new node joins the cluster, everything breaks down.
I've had an existing cluster running with 2 nodes for quite some time without much issue. Today I tried to add a freshly installed 3rd node, and every time I add it to the cluster, all hell breaks loose.
The addition log says:
Task viewer: Join Cluster
Establishing API connection with host '192.XX.XX.XX'
Login succeeded.
check cluster join API version
No cluster network links passed explicitly, fallback to local node IP '108.XX.XX.XX.XX'
Request addition of this node
It shows up in the GUI of the cluster, but for some reason, the pve-ssl.key and .pem are not synced to the rest of the cluster for DAL2 (new node.)
2024-02-05T07:29:53.856860+00:00 QC3 pvedaemon[1488]: <root@pam> adding node DAL2 to cluster
2024-02-05T07:40:40.830839+00:00 QC3 pveproxy[1572]: '/etc/pve/nodes/DAL2/pve-ssl.pem' does not exist!#012
2024-02-05T07:40:41.009924+00:00 QC3 pveproxy[1572]: '/etc/pve/nodes/DAL2/pve-ssl.pem' does not exist!#012
2024-02-05T07:40:42.697662+00:00 QC3 pveproxy[1572]: '/etc/pve/nodes/DAL2/pve-ssl.pem' does not exist!#012
2024-02-05T07:40:42.704920+00:00 QC3 pveproxy[49180]: '/etc/pve/nodes/DAL2/pve-ssl.pem' does not exist!#012
If I manually move them over to the master/QC3 node, it will sync up, and I can basically use the new node from the gui, but it demands a password login each time.
But running updatecerts breaks it, and without manually copying over the .key and .pem for the new node, the Master/QC3 node starts spamming the syslog with:
Feb 05 08:35:08 QC3 corosync[1442]: [TOTEM ] Retransmit List: [redacted]
Feb 05 08:35:09 QC3 corosync[1442]: [TOTEM ] Retransmit List: [redacted]
Feb 05 08:35:09 QC3 corosync[1442]: [TOTEM ] Retransmit List: [redacted]
Feb 05 08:35:09 QC3 corosync[1442]: [TOTEM ] Retransmit List: [redacted]
Feb 05 08:35:09 QC3 corosync[1442]: [TOTEM ] Retransmit List: [redacted]
Feb 05 08:35:10 QC3 corosync[1442]: [TOTEM ] Retransmit List: [redacted]
And the entire cluster begins to have degraded performance, to the point where basically nothing can be accessed. (can't even ssh or restart any proxmox services)
But the moment I disable DAL2 (new node) - cluster performance returns to normal and operates properly.
I have reinstalled DAL2 (new node) fresh from ISO 5 times, the only thing I've run after installing Proxmox is apt-update and apt-upgrade to make sure it was the same version as my other nodes.
At one point, I let the degraded performance continue for a bit to see if it would ever recover. Eventually it just seemed to crash and I had to delnode the new node again. (log of that here)
Code:
Feb 05 07:32:59 QC3 kernel: INFO: task pveproxy worker:1572 blocked for more than 120 seconds.
Feb 05 07:32:59 QC3 kernel: Tainted: P O 6.5.11-8-pve #1
Feb 05 07:32:59 QC3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 05 07:32:59 QC3 kernel: task:pveproxy worker state:D stack:0 pid:1572 ppid:1569 flags:0x00004002
Feb 05 07:32:59 QC3 kernel: Call Trace:
Feb 05 07:32:59 QC3 kernel: <TASK>
Feb 05 07:32:59 QC3 kernel: __schedule+0x3fd/0x1450
Feb 05 07:32:59 QC3 kernel: schedule+0x63/0x110
Feb 05 07:32:59 QC3 kernel: schedule_preempt_disabled+0x15/0x30
Feb 05 07:32:59 QC3 kernel: rwsem_down_read_slowpath+0x284/0x4d0
Feb 05 07:32:59 QC3 kernel: down_read+0x48/0xc0
Feb 05 07:32:59 QC3 kernel: walk_component+0x108/0x190
Feb 05 07:32:59 QC3 kernel: path_lookupat+0x67/0x1a0
Feb 05 07:32:59 QC3 kernel: filename_lookup+0xe4/0x200
Feb 05 07:32:59 QC3 kernel: vfs_statx+0xa1/0x180
Feb 05 07:32:59 QC3 kernel: vfs_fstatat+0x58/0x80
Feb 05 07:32:59 QC3 kernel: __do_sys_newfstatat+0x44/0x90
Feb 05 07:32:59 QC3 kernel: __x64_sys_newfstatat+0x1c/0x30
Feb 05 07:32:59 QC3 kernel: do_syscall_64+0x5b/0x90
Feb 05 07:32:59 QC3 kernel: ? exit_to_user_mode_prepare+0xa5/0x190
Feb 05 07:32:59 QC3 kernel: ? syscall_exit_to_user_mode+0x37/0x60
Feb 05 07:32:59 QC3 kernel: ? do_syscall_64+0x67/0x90
Feb 05 07:32:59 QC3 kernel: ? syscall_exit_to_user_mode+0x37/0x60
Feb 05 07:32:59 QC3 kernel: ? do_syscall_64+0x67/0x90
Feb 05 07:32:59 QC3 kernel: ? do_syscall_64+0x67/0x90
Feb 05 07:32:59 QC3 kernel: ? syscall_exit_to_user_mode+0x37/0x60
Feb 05 07:32:59 QC3 kernel: ? do_syscall_64+0x67/0x90
Feb 05 07:32:59 QC3 kernel: ? do_syscall_64+0x67/0x90
Feb 05 07:32:59 QC3 kernel: ? do_syscall_64+0x67/0x90
Feb 05 07:32:59 QC3 kernel: entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Feb 05 07:32:59 QC3 kernel: RIP: 0033:0x7f2b9805475a
Feb 05 07:32:59 QC3 kernel: RSP: 002b:00007ffd8150e098 EFLAGS: 00000246 ORIG_RAX: 0000000000000106
Feb 05 07:32:59 QC3 kernel: RAX: ffffffffffffffda RBX: 000055a8a46e92a0 RCX: 00007f2b9805475a
Feb 05 07:32:59 QC3 kernel: RDX: 000055a8a46e94a8 RSI: 000055a8ac65fe60 RDI: 00000000ffffff9c
Feb 05 07:32:59 QC3 kernel: RBP: 000055a8ac750180 R08: 0000000000000000 R09: 000055a8ac67dcf0
Feb 05 07:32:59 QC3 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 000055a8ac65fe60
Feb 05 07:32:59 QC3 kernel: R13: 000055a8a2ba223b R14: 0000000000000000 R15: 00007f2b98259020
Feb 05 07:32:59 QC3 kernel: </TASK>
There are no hardware issues, when new node is separated from cluster, it performs perfectly, as does the cluster. When new node joins the cluster, everything breaks down.
QC3 ("Master") proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve) pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79) proxmox-kernel-helper: 8.1.0 pve-kernel-5.15: 7.4-6 proxmox-kernel-6.5: 6.5.11-8 proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8 proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7 proxmox-kernel-6.2.16-20-pve: 6.2.16-20 proxmox-kernel-6.2: 6.2.16-20 proxmox-kernel-6.2.16-19-pve: 6.2.16-19 pve-kernel-5.15.116-1-pve: 5.15.116-1 pve-kernel-5.15.74-1-pve: 5.15.74-1 ceph-fuse: 16.2.11+ds-2 corosync: 3.1.7-pve3 criu: 3.17.1-2 frr-pythontools: 8.5.2-1+pve1 glusterfs-client: 10.3-5 ifupdown2: 3.2.0-1+pmx8 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-4 libknet1: 1.28-pve1 libproxmox-acme-perl: 1.5.0 libproxmox-backup-qemu0: 1.4.1 libproxmox-rs-perl: 0.3.3 libpve-access-control: 8.0.7 libpve-apiclient-perl: 3.3.1 libpve-common-perl: 8.1.0 libpve-guest-common-perl: 5.0.6 libpve-http-server-perl: 5.0.5 libpve-network-perl: 0.9.5 libpve-rs-perl: 0.8.8 libpve-storage-perl: 8.0.5 libspice-server1: 0.15.1-1 lvm2: 2.03.16-2 lxc-pve: 5.0.2-4 lxcfs: 5.0.3-pve4 novnc-pve: 1.4.0-3 openvswitch-switch: 3.1.0-2 proxmox-backup-client: 3.1.3-1 proxmox-backup-file-restore: 3.1.3-1 proxmox-kernel-helper: 8.1.0 proxmox-mail-forward: 0.2.3 proxmox-mini-journalreader: 1.4.0 proxmox-widget-toolkit: 4.1.3 pve-cluster: 8.0.5 pve-container: 5.0.8 pve-docs: 8.1.3 pve-edk2-firmware: 4.2023.08-3 pve-firewall: 5.0.3 pve-firmware: 3.9-1 pve-ha-manager: 4.0.3 pve-i18n: 3.2.0 pve-qemu-kvm: 8.1.2-6 pve-xtermjs: 5.3.0-3 qemu-server: 8.0.10 smartmontools: 7.3-pve1 spiceterm: 3.3.0 swtpm: 0.8.0+pve1 vncterm: 1.8.0 zfsutils-linux: 2.2.2-pve1 | LA1 (Existing node) proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve) pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79) proxmox-kernel-helper: 8.1.0 pve-kernel-6.2: 8.0.5 proxmox-kernel-6.5: 6.5.11-8 proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8 proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7 proxmox-kernel-6.2.16-20-pve: 6.2.16-20 proxmox-kernel-6.2: 6.2.16-20 proxmox-kernel-6.2.16-19-pve: 6.2.16-19 pve-kernel-6.2.16-3-pve: 6.2.16-3 ceph-fuse: 17.2.6-pve1+3 corosync: 3.1.7-pve3 criu: 3.17.1-2 frr-pythontools: 8.5.2-1+pve1 glusterfs-client: 10.3-5 ifupdown2: 3.2.0-1+pmx8 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-4 libknet1: 1.28-pve1 libproxmox-acme-perl: 1.5.0 libproxmox-backup-qemu0: 1.4.1 libproxmox-rs-perl: 0.3.3 libpve-access-control: 8.0.7 libpve-apiclient-perl: 3.3.1 libpve-common-perl: 8.1.0 libpve-guest-common-perl: 5.0.6 libpve-http-server-perl: 5.0.5 libpve-network-perl: 0.9.5 libpve-rs-perl: 0.8.8 libpve-storage-perl: 8.0.5 libspice-server1: 0.15.1-1 lvm2: 2.03.16-2 lxc-pve: 5.0.2-4 lxcfs: 5.0.3-pve4 novnc-pve: 1.4.0-3 openvswitch-switch: 3.1.0-2 proxmox-backup-client: 3.1.3-1 proxmox-backup-file-restore: 3.1.3-1 proxmox-kernel-helper: 8.1.0 proxmox-mail-forward: 0.2.3 proxmox-mini-journalreader: 1.4.0 proxmox-widget-toolkit: 4.1.3 pve-cluster: 8.0.5 pve-container: 5.0.8 pve-docs: 8.1.3 pve-edk2-firmware: 4.2023.08-3 pve-firewall: 5.0.3 pve-firmware: 3.9-1 pve-ha-manager: 4.0.3 pve-i18n: 3.2.0 pve-qemu-kvm: 8.1.2-6 pve-xtermjs: 5.3.0-3 qemu-server: 8.0.10 smartmontools: 7.3-pve1 spiceterm: 3.3.0 swtpm: 0.8.0+pve1 vncterm: 1.8.0 zfsutils-linux: 2.2.2-pve1 | DAL2 (New node) proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve) pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79) proxmox-kernel-helper: 8.1.0 proxmox-kernel-6.5: 6.5.11-8 proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8 proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4 ceph-fuse: 17.2.7-pve1 corosync: 3.1.7-pve3 criu: 3.17.1-2 glusterfs-client: 10.3-5 ifupdown2: 3.2.0-1+pmx8 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-4 libknet1: 1.28-pve1 libproxmox-acme-perl: 1.5.0 libproxmox-backup-qemu0: 1.4.1 libproxmox-rs-perl: 0.3.3 libpve-access-control: 8.0.7 libpve-apiclient-perl: 3.3.1 libpve-common-perl: 8.1.0 libpve-guest-common-perl: 5.0.6 libpve-http-server-perl: 5.0.5 libpve-network-perl: 0.9.5 libpve-rs-perl: 0.8.8 libpve-storage-perl: 8.0.5 libspice-server1: 0.15.1-1 lvm2: 2.03.16-2 lxc-pve: 5.0.2-4 lxcfs: 5.0.3-pve4 novnc-pve: 1.4.0-3 proxmox-backup-client: 3.1.3-1 proxmox-backup-file-restore: 3.1.3-1 proxmox-kernel-helper: 8.1.0 proxmox-mail-forward: 0.2.3 proxmox-mini-journalreader: 1.4.0 proxmox-offline-mirror-helper: 0.6.4 proxmox-widget-toolkit: 4.1.3 pve-cluster: 8.0.5 pve-container: 5.0.8 pve-docs: 8.1.3 pve-edk2-firmware: 4.2023.08-3 pve-firewall: 5.0.3 pve-firmware: 3.9-1 pve-ha-manager: 4.0.3 pve-i18n: 3.2.0 pve-qemu-kvm: 8.1.2-6 pve-xtermjs: 5.3.0-3 qemu-server: 8.0.10 smartmontools: 7.3-pve1 spiceterm: 3.3.0 swtpm: 0.8.0+pve1 vncterm: 1.8.0 zfsutils-linux: 2.2.2-pve1 |