[SOLVED] kernel timeouts joining cluster on fresh proxmox 8.4 installs

daanw

New Member
May 5, 2025
3
0
1
Fresh installation of two proxmox 8.4 nodes (both on zfs raid1).
pve1: vmbr1: 192.168.0.200
pve2: vmbr1: 192.168.0.201
Both nodes can ping each other using both hostname and ip, time averages at 0.250 ms.
Both nodes can ssh to each other.
Both nodes can ping the outside world.
timedatecls shows exact the same output on both nodes ().

Create cluster pxc on pve1 works fine (GUI).
Using the join info from pve1 on pve2 (GUI) to join the cluster initially seems to work, but then throws "proxmox cluster permission denied invalid pve ticket 401".
GUI on pve2 is not accessible anymore from this point onwards.
I can still ssh into pve2.
journal on pve2 shows kernel timeout errors:

Code:
May 05 14:19:41 pve2 kernel: INFO: task cron:1214 blocked for more than 122 seconds.
May 05 14:19:41 pve2 kernel:       Tainted: P           O       6.8.12-9-pve #1
May 05 14:19:41 pve2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 05 14:19:41 pve2 kernel: task:cron            state:D stack:0     pid:1214  tgid:1214  ppid:1      flags:0x00000002
May 05 14:19:41 pve2 kernel: Call Trace:
May 05 14:19:41 pve2 kernel:  <TASK>
May 05 14:19:41 pve2 kernel:  __schedule+0x42b/0x1500
May 05 14:19:41 pve2 kernel:  ? mutex_lock+0x12/0x50
May 05 14:19:41 pve2 kernel:  ? rrw_exit+0x72/0x170 [zfs]
May 05 14:19:41 pve2 kernel:  ? xa_load+0x87/0xf0
May 05 14:19:41 pve2 kernel:  schedule+0x33/0x110
May 05 14:19:41 pve2 kernel:  schedule_preempt_disabled+0x15/0x30
May 05 14:19:41 pve2 kernel:  rwsem_down_read_slowpath+0x284/0x4d0
May 05 14:19:41 pve2 kernel:  ? dput+0xf2/0x1b0
May 05 14:19:41 pve2 kernel:  down_read+0x48/0xc0
May 05 14:19:41 pve2 kernel:  walk_component+0x108/0x190
May 05 14:19:41 pve2 kernel:  path_lookupat+0x67/0x1a0
May 05 14:19:41 pve2 kernel:  filename_lookup+0xe4/0x200
May 05 14:19:41 pve2 kernel:  ? __pfx_zpl_put_link+0x10/0x10 [zfs]
May 05 14:19:41 pve2 kernel:  ? strncpy_from_user+0x25/0x120
May 05 14:19:41 pve2 kernel:  vfs_statx+0x95/0x1d0
May 05 14:19:41 pve2 kernel:  vfs_fstatat+0xaa/0xe0
May 05 14:19:41 pve2 kernel:  __do_sys_newfstatat+0x44/0x90
May 05 14:19:41 pve2 kernel:  __x64_sys_newfstatat+0x1c/0x30
May 05 14:19:41 pve2 kernel:  x64_sys_call+0x18bd/0x2480
May 05 14:19:41 pve2 kernel:  do_syscall_64+0x81/0x170
May 05 14:19:41 pve2 kernel:  ? __do_sys_newfstatat+0x53/0x90
May 05 14:19:41 pve2 kernel:  ? syscall_exit_to_user_mode+0x86/0x260
May 05 14:19:41 pve2 kernel:  ? do_syscall_64+0x8d/0x170
May 05 14:19:41 pve2 kernel:  ? syscall_exit_to_user_mode+0x86/0x260
May 05 14:19:41 pve2 kernel:  ? do_syscall_64+0x8d/0x170
May 05 14:19:41 pve2 kernel:  ? syscall_exit_to_user_mode+0x86/0x260
May 05 14:19:41 pve2 kernel:  ? do_syscall_64+0x8d/0x170
May 05 14:19:41 pve2 kernel:  ? syscall_exit_to_user_mode+0x86/0x260
May 05 14:19:41 pve2 kernel:  ? do_syscall_64+0x8d/0x170
May 05 14:19:41 pve2 kernel:  ? do_syscall_64+0x8d/0x170
May 05 14:19:41 pve2 kernel:  ? syscall_exit_to_user_mode+0x86/0x260
May 05 14:19:41 pve2 kernel:  ? do_syscall_64+0x8d/0x170
May 05 14:19:41 pve2 kernel:  ? do_syscall_64+0x8d/0x170
May 05 14:19:41 pve2 kernel:  ? irqentry_exit+0x43/0x50
May 05 14:19:41 pve2 kernel:  entry_SYSCALL_64_after_hwframe+0x78/0x80
May 05 14:19:41 pve2 kernel: RIP: 0033:0x72e89452b81a
May 05 14:19:41 pve2 kernel: RSP: 002b:00007ffd0ddcb1d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000106
May 05 14:19:41 pve2 kernel: RAX: ffffffffffffffda RBX: 00005fa9970fb186 RCX: 000072e89452b81a
May 05 14:19:41 pve2 kernel: RDX: 00007ffd0ddcb3e0 RSI: 00007ffd0ddcb570 RDI: 00000000ffffff9c
May 05 14:19:41 pve2 kernel: RBP: 00007ffd0ddcb570 R08: 0000000000000000 R09: 0000000000000073
May 05 14:19:41 pve2 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00005fa9970fb251
May 05 14:19:41 pve2 kernel: R13: 00005fa9b4b34d40 R14: 00005fa9b4b34f70 R15: 00007ffd0ddcd5d0
May 05 14:19:41 pve2 kernel:  </TASK>

Subsequently, similar timeouts occur with the exact same call trace:

Code:
May 05 14:19:41 pve2 kernel: task:cron            state:D stack:0     pid:1214  tgid:1214  ppid:1      flags:0x00000002
May 05 14:19:41 pve2 kernel: task:pve-firewall    state:D stack:0     pid:1225  tgid:1225  ppid:1      flags:0x00004002
May 05 14:19:41 pve2 kernel: task:pvestatd        state:D stack:0     pid:1232  tgid:1232  ppid:1      flags:0x00000002
May 05 14:19:41 pve2 kernel: task:pveproxy worker state:D stack:0     pid:1281  tgid:1281  ppid:1280   flags:0x00000002
May 05 14:19:41 pve2 kernel: task:pveproxy worker state:D stack:0     pid:1282  tgid:1282  ppid:1280   flags:0x00004002
May 05 14:19:41 pve2 kernel: task:pveproxy worker state:D stack:0     pid:1283  tgid:1283  ppid:1280   flags:0x00004002
May 05 14:19:41 pve2 kernel: task:pve-ha-lrm      state:D stack:0     pid:1291  tgid:1291  ppid:1      flags:0x00000002
May 05 14:19:41 pve2 kernel: task:pvescheduler    state:D stack:0     pid:6961  tgid:6961  ppid:1300   flags:0x00000002
May 05 14:19:41 pve2 kernel: task:pvescheduler    state:D stack:0     pid:6962  tgid:6962  ppid:1300   flags:0x00004002
May 05 14:21:44 pve2 kernel: task:cron            state:D stack:0     pid:1214  tgid:1214  ppid:1      flags:0x00000002

One of the processes where the timeouts occur is the certificate generation by pveproxy. (This snippet is after a reboot, so different pids.)

Code:
root        1292  0.4  0.1  88124 55552 ?        Ss   14:52   0:00 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
root        1293  0.0  0.1  88124 48184 ?        D    14:52   0:00  \_ /usr/bin/perl /usr/bin/pvecm updatecerts --silent

May 05 14:52:23 pve2 systemd[1]: Starting pveproxy.service - PVE API Proxy Server...
May 05 14:53:53 pve2 systemd[1]: pveproxy.service: start-pre operation timed out. Terminating.
May 05 14:55:24 pve2 systemd[1]: pveproxy.service: State 'stop-sigterm' timed out. Killing.
May 05 14:55:24 pve2 systemd[1]: pveproxy.service: Killing process 1292 (pvecm) with signal SIGKILL.
May 05 14:55:24 pve2 systemd[1]: pveproxy.service: Killing process 1293 (pvecm) with signal SIGKILL.
May 05 14:56:54 pve2 systemd[1]: pveproxy.service: State 'final-sigterm' timed out. Killing.
May 05 14:56:54 pve2 systemd[1]: pveproxy.service: Killing process 1293 (pvecm) with signal SIGKILL.

GUI on pve1:
- error for pve2 on Join Cluster: Error: unable to create directory '/etc/pve/nodes' - Permission denied.
- '/etc/pve/nodes/pve2/pve-ssl.pem' does not exist! (500)
- error:0A000086:SSL routines::certificate verify failed (596)

Journal on pve1:

Code:
May 05 14:32:44 pve1 pveproxy[1527]: '/etc/pve/nodes/pve2/pve-ssl.pem' does not exist!
May 05 14:32:44 pve1 pveproxy[1527]: '/etc/pve/nodes/pve2/pve-ssl.pem' does not exist!
May 05 14:32:54 pve1 pveproxy[9517]: '/etc/pve/nodes/pve2/pve-ssl.pem' does not exist!

pve1 always has quorum. pve2 initially has too, but fails to ouput pvecm status after processes are hung.

Code:
Cluster information
-------------------
Name:             pxc
Config Version:   2
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon May  5 14:58:44 2025
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.23
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.0.200 (local)
0x00000002          1 192.168.0.201

From what I see somehow the creation/saving of the certificates fails on pve2 after joining the cluster.
From there it seems to go south as that disables GUI access and any further communication within the cluster.

Updating to 8.4.1 (and kernel 6.8.12-10-pve) prior to creating/joining the cluster makes no difference.

Any pointers on how to get this cluster going are very welcome!
 
Update: believe it or not, the above was all the result of a bad UTP cable. :mad:
Replaced the UTP cable for pve2 and the cluster went up smooth as silk.
Still, quite bold that a bad UTP cable apparently can lead to kernel timeouts.
 
hello, i have the same problem with a fresh install,

i use a ventoy USB disk and 8.4.3 version image download the 5/05/2025.

my configuration use 3 link beetween my 2 nodes, the last link is "console network".


1746565421808.png

i use the last script of : https://community-scripts.github.io/ProxmoxVE/scripts?id=post-pve-install
i select HA Enable

i up to date the os
i reboot all nodes before cluster creation

i had all IP adresse in Hosts exemple :
10.30.0.246 pve3.home.arpa pve3
10.128.0.246 pve3.home.arpa pve3
10.200.0.246 pve3.home.arpa pve3
10.254.0.246 pve3.home.arpa pve3

all directories in /etc/pve/ was deleted


In pve4 console , i can open ssh for pve3 target, on all ip used in the link cluster config

do you have a solution ?

best regards
 
Last edited: