Cannot add a node to a cluster

glueckm · Aug 23, 2023

Hello,

The last 2 days I'm trying create a cluster of two nodes.
I already had one node up and running for almost a year

No I wanted to setup a second node and create a cluster.

The old node (pve1) is running
pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-6-pve)

The new node (pve2) is running
pve-manager/8.0.3/bbf3993334bfa916 (running kernel: 6.2.16-3-pve)

On the existing node (pve1) I create the cluster.

On the new node (pve2) I ran

root@pve2:~# pvecm add <IP-pve1>
Please enter superuser (root) password for '<IP-pve1>': 
Establishing API connection with host '<IP-pve1>'
The authenticity of host '<IP>' can't be established.
X509 SHA256 key fingerprint is <FINGERPRINT>.
Are you sure you want to continue connecting (yes/no)? yes
Login succeeded.
check cluster join API version
No cluster network links passed explicitly, fallback to local node IP '<IP-pve2>'
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1692791514.sql.gz'
waiting for quorum...OK
can't create shared ssh key database '/etc/pve/priv/authorized_keys'
(re)generate node files
generate new node certificate
unable to create directory '/etc/pve/nodes' - Permission denied

I have reinstalled the pve2 node multiple times, tried the adding via the GUI, but everything failed.

Both nodes have the
Proxmox VE Community Subscription 1 CPU
subscription.

I no longer have any idea what I'm doing wrong.
Any help would be appricated.

Thanks,
Martin

supermicro_server · Aug 23, 2023

glueckm said:
Hello,

The last 2 days I'm trying create a cluster of two nodes.
I already had one node up and running for almost a year

No I wanted to setup a second node and create a cluster.

The old node (pve1) is running
pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-6-pve)

The new node (pve2) is running
pve-manager/8.0.3/bbf3993334bfa916 (running kernel: 6.2.16-3-pve)

On the existing node (pve1) I create the cluster.

On the new node (pve2) I ran
root@pve2:~# pvecm add <IP-pve1> Please enter superuser (root) password for '<IP-pve1>': Establishing API connection with host '<IP-pve1>' The authenticity of host '<IP>' can't be established. X509 SHA256 key fingerprint is <FINGERPRINT>. Are you sure you want to continue connecting (yes/no)? yes Login succeeded. check cluster join API version No cluster network links passed explicitly, fallback to local node IP '<IP-pve2>' Request addition of this node Join request OK, finishing setup locally stopping pve-cluster service backup old database to '/var/lib/pve-cluster/backup/config-1692791514.sql.gz' waiting for quorum...OK can't create shared ssh key database '/etc/pve/priv/authorized_keys' (re)generate node files generate new node certificate unable to create directory '/etc/pve/nodes' - Permission denied

I have reinstalled the pve2 node multiple times, tried the adding via the GUI, but everything failed.

Both nodes have the
Proxmox VE Community Subscription 1 CPU
subscription.

I no longer have any idea what I'm doing wrong.
Any help would be appricated.

Thanks,
Martin

Try to use pvecm add --force instead of pvecm add only
MM

glueckm · Aug 23, 2023

Thanks for tip,
unfortunately, this resulted in the same:
No cluster network links passed explicitly, fallback to local node IP '192.168.178.231'
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1692796749.sql.gz'
waiting for quorum...OK
can't create shared ssh key database '/etc/pve/priv/authorized_keys'
(re)generate node files
generate new node certificate
unable to create directory '/etc/pve/nodes' - Permission denied

This time I hade the monitor still connected form the restart and I saw a message there.
So took a look dmsg:

This was the message I also saw on the console:

Code:

[  243.277831] INFO: task pvestatd:1249 blocked for more than 120 seconds.
[  243.277863]       Tainted: P           O       6.2.16-8-pve #1
[  243.277886] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

But this looks like the real cause:

Code:

[  243.277916] task:pvestatd        state:D stack:0     pid:1249  ppid:1      flags:0x00000002
[  243.277920] Call Trace:
[  243.277921]  <TASK>
[  243.277924]  __schedule+0x402/0x1510
[  243.277930]  schedule+0x63/0x110
[  243.277932]  schedule_preempt_disabled+0x15/0x30
[  243.277934]  rwsem_down_read_slowpath+0x284/0x4d0
[  243.277937]  down_read+0x48/0xc0
[  243.277938]  walk_component+0x108/0x190
[  243.277941]  path_lookupat+0x67/0x1a0
[  243.277943]  filename_lookup+0xe4/0x200
[  243.277946]  ? strncpy_from_user+0x44/0x160
[  243.277949]  vfs_statx+0xa1/0x180
[  243.277952]  vfs_fstatat+0x58/0x80
[  243.277953]  __do_sys_newfstatat+0x44/0x90
[  243.277956]  __x64_sys_newfstatat+0x1c/0x30
[  243.277957]  do_syscall_64+0x5b/0x90
[  243.277959]  ? exc_page_fault+0x91/0x1b0
[  243.277961]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[  243.277964] RIP: 0033:0x7f5e3e87c63a
[  243.277966] RSP: 002b:00007ffd3cfbb8b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000106
[  243.277968] RAX: ffffffffffffffda RBX: 0000559ec913f2a0 RCX: 00007f5e3e87c63a
[  243.277969] RDX: 0000559ec913f4a8 RSI: 0000559ec95d2610 RDI: 00000000ffffff9c
[  243.277969] RBP: 0000559ecded1ce0 R08: 0000000000000003 R09: 0000559ecded1ce0
[  243.277970] R10: 0000000000000000 R11: 0000000000000246 R12: 0000559ecb2022f0
[  243.277971] R13: 0000559ec95d2610 R14: 0000559ec83902af R15: 0000000000000000
[  243.277973]  </TASK>

Looks like pvestatd crashes...

Now I'm completely lost...

Martin

supermicro_server · Aug 23, 2023

glueckm said:
Thanks for tip,
unfortunately, this resulted in the same:
No cluster network links passed explicitly, fallback to local node IP '192.168.178.231'
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1692796749.sql.gz'
waiting for quorum...OK
can't create shared ssh key database '/etc/pve/priv/authorized_keys'
(re)generate node files
generate new node certificate
unable to create directory '/etc/pve/nodes' - Permission denied

This time I hade the monitor still connected form the restart and I saw a message there.
So took a look dmsg:

This was the message I also saw on the console:

Code:

[ 243.277831] INFO: task pvestatd:1249 blocked for more than 120 seconds. [ 243.277863] Tainted: P O 6.2.16-8-pve #1 [ 243.277886] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

But this looks like the real cause:

Code:

[ 243.277916] task:pvestatd state:D stack:0 pid:1249 ppid:1 flags:0x00000002 [ 243.277920] Call Trace: [ 243.277921] <TASK> [ 243.277924] __schedule+0x402/0x1510 [ 243.277930] schedule+0x63/0x110 [ 243.277932] schedule_preempt_disabled+0x15/0x30 [ 243.277934] rwsem_down_read_slowpath+0x284/0x4d0 [ 243.277937] down_read+0x48/0xc0 [ 243.277938] walk_component+0x108/0x190 [ 243.277941] path_lookupat+0x67/0x1a0 [ 243.277943] filename_lookup+0xe4/0x200 [ 243.277946] ? strncpy_from_user+0x44/0x160 [ 243.277949] vfs_statx+0xa1/0x180 [ 243.277952] vfs_fstatat+0x58/0x80 [ 243.277953] __do_sys_newfstatat+0x44/0x90 [ 243.277956] __x64_sys_newfstatat+0x1c/0x30 [ 243.277957] do_syscall_64+0x5b/0x90 [ 243.277959] ? exc_page_fault+0x91/0x1b0 [ 243.277961] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 243.277964] RIP: 0033:0x7f5e3e87c63a [ 243.277966] RSP: 002b:00007ffd3cfbb8b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000106 [ 243.277968] RAX: ffffffffffffffda RBX: 0000559ec913f2a0 RCX: 00007f5e3e87c63a [ 243.277969] RDX: 0000559ec913f4a8 RSI: 0000559ec95d2610 RDI: 00000000ffffff9c [ 243.277969] RBP: 0000559ecded1ce0 R08: 0000000000000003 R09: 0000559ecded1ce0 [ 243.277970] R10: 0000000000000000 R11: 0000000000000246 R12: 0000559ecb2022f0 [ 243.277971] R13: 0000559ec95d2610 R14: 0000559ec83902af R15: 0000000000000000 [ 243.277973] </TASK>

Looks like pvestatd crashes...

Now I'm completely lost...

Martin

try to upgrade PVE at the same version and let me know.

On the main node, check if there are any "zombie" node in /etc/pve/nodes

glueckm · Aug 23, 2023

Hi,

Since I had to reinstall proxmox in the new node (because after the attempt to add the node to the cluster the /etv/pve/nodes directory is gone)
I upgraded the system the latest version.
And I removed the node/pve2 directory from the pve1 node. So there is only the pve1 directory in the /etv/pve/nodes directory -> no zombies

root@pve1:/etc/pve/nodes# pveversion
pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-6-pve)

root@pve2:~# pveversion
pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-8-pve)

Thanks
Martin

supermicro_server · Aug 23, 2023

glueckm said:
Hi,

Since I had to reinstall proxmox in the new node (because after the attempt to add the node to the cluster the /etv/pve/nodes directory is gone)
I upgraded the system the latest version.
And I removed the node/pve2 directory from the pve1 node. So there is only the pve1 directory in the /etv/pve/nodes directory -> no zombies

root@pve1:/etc/pve/nodes# pveversion
pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-6-pve)

root@pve2:~# pveversion
pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-8-pve)

Thanks
Martin

just to be sure, please post pvecm status on pve1
Since you have tried to create cluster serveral times we need to be sure that no "cluster trush" left on the pve1.

After that, I would attempt to create the cluster from scratch.

glueckm · Aug 23, 2023

Hi,

I already "cleaned" the cluster from pve1 using this instructions:
https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node

root@pve1:~# pvecm status
Error: Corosync config '/etc/pve/corosync.conf' does not exist - is this node part of a cluster?

supermicro_server said:
After that, I would attempt to create the cluster from scratch.

Should I create the cluster via command line this time?

Thanks for your help,
Martin

supermicro_server · Aug 23, 2023

glueckm said:
Hi,

I already "cleaned" the cluster from pve1 using this instructions:
https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node

root@pve1:~# pvecm status Error: Corosync config '/etc/pve/corosync.conf' does not exist - is this node part of a cluster?

Should I create the cluster via command line this time?

Thanks for your help,
Martin

Perfect..

I would try to create cluster by using GUI interface. This process should be completed smoothly.

glueckm · Aug 23, 2023

Hi,
Created the a new cluster on pve1:

Code:

root@pve1:~# pvecm status
Cluster information
-------------------
Name:             mangari
Config Version:   1
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Aug 23 16:20:25 2023
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.2c2
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   1
Highest expected: 1
Total votes:      1
Quorum:           1
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 <pve1-ip> (local)

Than I run the add node command on the pve2:
pvecm add 192.168.178.230 --force
Unfortunately I got the same result:

Code:

Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1692800466.sql.gz'
waiting for quorum...OK
can't create shared ssh key database '/etc/pve/priv/authorized_keys'
(re)generate node files
generate new node certificate
unable to create directory '/etc/pve/nodes' - Permission denied

And again a lot of stack straces from the kernel:

Code:

[ 1692.969070] INFO: task pvestatd:1249 blocked for more than 120 seconds.
[ 1692.969117]       Tainted: P           O       6.2.16-8-pve #1
[ 1692.969140] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1692.969180] task:pvestatd        state:D stack:0     pid:1249  ppid:1      flags:0x00000002
[ 1692.969183] Call Trace:
[ 1692.969184]  <TASK>
[ 1692.969187]  __schedule+0x402/0x1510
[ 1692.969193]  ? xa_load+0x87/0xf0
[ 1692.969197]  schedule+0x63/0x110
[ 1692.969199]  schedule_preempt_disabled+0x15/0x30
[ 1692.969201]  rwsem_down_read_slowpath+0x284/0x4d0
[ 1692.969203]  down_read+0x48/0xc0
[ 1692.969205]  walk_component+0x108/0x190
[ 1692.969208]  path_lookupat+0x67/0x1a0
[ 1692.969210]  ? try_to_unlazy+0x60/0xe0
[ 1692.969211]  filename_lookup+0xe4/0x200
[ 1692.969214]  vfs_statx+0xa1/0x180
[ 1692.969217]  vfs_fstatat+0x58/0x80
[ 1692.969218]  __do_sys_newfstatat+0x44/0x90
[ 1692.969221]  __x64_sys_newfstatat+0x1c/0x30
[ 1692.969222]  do_syscall_64+0x5b/0x90
[ 1692.969225]  ? irqentry_exit_to_user_mode+0x9/0x20
[ 1692.969228]  ? irqentry_exit+0x43/0x50
[ 1692.969229]  ? exc_page_fault+0x91/0x1b0
[ 1692.969231]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 1692.969234] RIP: 0033:0x7fcea3d9963a
[ 1692.969236] RSP: 002b:00007ffce72f6368 EFLAGS: 00000246 ORIG_RAX: 0000000000000106
[ 1692.969238] RAX: ffffffffffffffda RBX: 000055f3043412a0 RCX: 00007fcea3d9963a
[ 1692.969239] RDX: 000055f3043414a8 RSI: 000055f3047dea90 RDI: 00000000ffffff9c
[ 1692.969239] RBP: 000055f3090d9930 R08: 0000000000000003 R09: 000055f3090d9930
[ 1692.969240] R10: 0000000000000000 R11: 0000000000000246 R12: 000055f306406a30
[ 1692.969241] R13: 000055f3047dea90 R14: 000055f302fae2af R15: 0000000000000000
[ 1692.969243]  </TASK>
[ 1692.969244] INFO: task pveproxy worker:1285 blocked for more than 120 seconds.
[ 1692.969273]       Tainted: P           O       6.2.16-8-pve #1
[ 1692.969296] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1692.969326] task:pveproxy worker state:D stack:0     pid:1285  ppid:1284   flags:0x00004002
[ 1692.969327] Call Trace:
[ 1692.969328]  <TASK>
[ 1692.969329]  __schedule+0x402/0x1510
[ 1692.969331]  ? __pfx_fuse_get_inode_acl+0x10/0x10
[ 1692.969334]  ? __get_acl.part.0+0xef/0x180
[ 1692.969337]  schedule+0x63/0x110
[ 1692.969339]  schedule_preempt_disabled+0x15/0x30
[ 1692.969341]  rwsem_down_read_slowpath+0x284/0x4d0
[ 1692.969343]  down_read+0x48/0xc0
[ 1692.969344]  path_openat+0xaff/0x12a0
[ 1692.969346]  do_filp_open+0xaf/0x170
[ 1692.969349]  do_sys_openat2+0xbf/0x180
[ 1692.969352]  __x64_sys_openat+0x6c/0xa0
[ 1692.969354]  do_syscall_64+0x5b/0x90
[ 1692.969355]  ? irqentry_exit_to_user_mode+0x9/0x20
[ 1692.969357]  ? irqentry_exit+0x43/0x50
[ 1692.969358]  ? exc_page_fault+0x91/0x1b0
[ 1692.969360]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 1692.969362] RIP: 0033:0x7f13e8bcade1
[ 1692.969363] RSP: 002b:00007ffe88f07a30 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
[ 1692.969364] RAX: ffffffffffffffda RBX: 0000000000080000 RCX: 00007f13e8bcade1
[ 1692.969365] RDX: 0000000000080000 RSI: 0000557f8e56d500 RDI: 00000000ffffff9c
[ 1692.969366] RBP: 0000557f8e56d500 R08: 00007ffe88f07cd0 R09: 00000000ffffffff
[ 1692.969366] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
[ 1692.969367] R13: 0000000000000000 R14: 0000557f8e56d500 R15: 0000557f8c33fa10
[ 1692.969368]  </TASK>
[ 1692.969369] INFO: task pveproxy worker:1286 blocked for more than 120 seconds.
[ 1692.969397]       Tainted: P           O       6.2.16-8-pve #1
[ 1692.969419] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1692.969449] task:pveproxy worker state:D stack:0     pid:1286  ppid:1284   flags:0x00000002
[ 1692.969450] Call Trace:
[ 1692.969451]  <TASK>
[ 1692.969452]  __schedule+0x402/0x1510
[ 1692.969454]  ? update_load_avg+0x82/0x810
[ 1692.969458]  schedule+0x63/0x110
[ 1692.969460]  schedule_preempt_disabled+0x15/0x30
[ 1692.969462]  rwsem_down_read_slowpath+0x284/0x4d0
[ 1692.969463]  ? try_to_unlazy+0x60/0xe0
[ 1692.969465]  down_read+0x48/0xc0
[ 1692.969466]  walk_component+0x108/0x190
[ 1692.969468]  path_lookupat+0x67/0x1a0
[ 1692.969470]  filename_lookup+0xe4/0x200
[ 1692.969473]  vfs_statx+0xa1/0x180
[ 1692.969474]  vfs_fstatat+0x58/0x80
[ 1692.969476]  __do_sys_newfstatat+0x44/0x90
[ 1692.969478]  __x64_sys_newfstatat+0x1c/0x30
[ 1692.969480]  do_syscall_64+0x5b/0x90
[ 1692.969481]  ? exit_to_user_mode_prepare+0x39/0x190
[ 1692.969483]  ? irqentry_exit_to_user_mode+0x9/0x20
[ 1692.969485]  ? irqentry_exit+0x43/0x50
[ 1692.969486]  ? exc_page_fault+0x91/0x1b0
[ 1692.969488]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 1692.969490] RIP: 0033:0x7f13e8bca63a
[ 1692.969490] RSP: 002b:00007ffe88f07d48 EFLAGS: 00000246 ORIG_RAX: 0000000000000106
[ 1692.969491] RAX: ffffffffffffffda RBX: 0000557f8c31e2a0 RCX: 00007f13e8bca63a
[ 1692.969492] RDX: 0000557f8c31e4a8 RSI: 0000557f93b04040 RDI: 00000000ffffff9c
[ 1692.969493] RBP: 0000557f93b26718 R08: 0000000000000003 R09: 0000557f8c31e2a0
[ 1692.969493] R10: 0000000000000000 R11: 0000000000000246 R12: 0000557f8e56d500
[ 1692.969494] R13: 0000557f93b04040 R14: 0000557f8b2122af R15: 0000000000000000
[ 1692.969495]  </TASK>
[ 1692.969496] INFO: task pveproxy worker:1287 blocked for more than 120 seconds.
[ 1692.969523]       Tainted: P           O       6.2.16-8-pve #1
[ 1692.969546] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1692.969576] task:pveproxy worker state:D stack:0     pid:1287  ppid:1284   flags:0x00004002
[ 1692.969577] Call Trace:
[ 1692.969577]  <TASK>
[ 1692.969578]  __schedule+0x402/0x1510
[ 1692.969580]  ? __pfx_fuse_get_inode_acl+0x10/0x10
[ 1692.969581]  ? __get_acl.part.0+0xef/0x180
[ 1692.969583]  schedule+0x63/0x110
[ 1692.969585]  schedule_preempt_disabled+0x15/0x30
[ 1692.969587]  rwsem_down_read_slowpath+0x284/0x4d0
[ 1692.969589]  down_read+0x48/0xc0
[ 1692.969591]  path_openat+0xaff/0x12a0
[ 1692.969592]  ? hrtimer_cancel+0x15/0x30
[ 1692.969595]  do_filp_open+0xaf/0x170
[ 1692.969598]  do_sys_openat2+0xbf/0x180
[ 1692.969600]  __x64_sys_openat+0x6c/0xa0
[ 1692.969602]  do_syscall_64+0x5b/0x90
[ 1692.969603]  ? syscall_exit_to_user_mode+0x29/0x50
[ 1692.969605]  ? do_syscall_64+0x67/0x90
[ 1692.969607]  ? exit_to_user_mode_prepare+0x39/0x190
[ 1692.969608]  ? syscall_exit_to_user_mode+0x29/0x50
[ 1692.969609]  ? do_syscall_64+0x67/0x90
[ 1692.969611]  ? exc_page_fault+0x91/0x1b0
[ 1692.969612]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 1692.969614] RIP: 0033:0x7f13e8bcade1
[ 1692.969615] RSP: 002b:00007ffe88f07a30 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
[ 1692.969616] RAX: ffffffffffffffda RBX: 0000000000080000 RCX: 00007f13e8bcade1
[ 1692.969616] RDX: 0000000000080000 RSI: 0000557f8e56d500 RDI: 00000000ffffff9c
[ 1692.969617] RBP: 0000557f8e56d500 R08: 00007ffe88f07cd0 R09: 00000000ffffffff
[ 1692.969618] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
[ 1692.969618] R13: 0000000000000000 R14: 0000557f8e56d500 R15: 0000557f8c33fa10
[ 1692.969620]  </TASK>
[ 1692.969621] INFO: task pve-ha-lrm:1293 blocked for more than 120 seconds.
[ 1692.969647]       Tainted: P           O       6.2.16-8-pve #1
[ 1692.969669] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1692.969699] task:pve-ha-lrm      state:D stack:0     pid:1293  ppid:1      flags:0x00000002
[ 1692.969700] Call Trace:
[ 1692.969701]  <TASK>
[ 1692.969701]  __schedule+0x402/0x1510
[ 1692.969704]  ? fuse_dentry_revalidate+0x34a/0x3c0
[ 1692.969706]  schedule+0x63/0x110
[ 1692.969708]  schedule_preempt_disabled+0x15/0x30
[ 1692.969710]  rwsem_down_read_slowpath+0x284/0x4d0
[ 1692.969712]  down_read+0x48/0xc0
[ 1692.969713]  walk_component+0x108/0x190
[ 1692.969715]  ? inode_permission+0x74/0x200
[ 1692.969716]  link_path_walk.part.0.constprop.0+0x28a/0x3f0
[ 1692.969718]  ? path_init+0x28f/0x3c0
[ 1692.969720]  path_openat+0xab/0x12a0
[ 1692.969722]  do_filp_open+0xaf/0x170
[ 1692.969725]  do_sys_openat2+0xbf/0x180
[ 1692.969727]  __x64_sys_openat+0x6c/0xa0
[ 1692.969729]  do_syscall_64+0x5b/0x90
[ 1692.969730]  ? exit_to_user_mode_prepare+0x39/0x190
[ 1692.969732]  ? syscall_exit_to_user_mode+0x29/0x50
[ 1692.969733]  ? do_syscall_64+0x67/0x90
[ 1692.969735]  ? irqentry_exit+0x43/0x50
[ 1692.969736]  ? sysvec_apic_timer_interrupt+0x4b/0xd0
[ 1692.969738]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 1692.969740] RIP: 0033:0x7fbc04eb8de1
[ 1692.969740] RSP: 002b:00007ffec6770580 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
[ 1692.969741] RAX: ffffffffffffffda RBX: 00000000000800c1 RCX: 00007fbc04eb8de1
[ 1692.969742] RDX: 00000000000800c1 RSI: 000056457bf84250 RDI: 00000000ffffff9c
[ 1692.969742] RBP: 000056457bf84250 R08: 00007ffec67707f0 R09: 00000000ffffffff
[ 1692.969743] R10: 00000000000001a4 R11: 0000000000000202 R12: 0000000000000000
[ 1692.969744] R13: 0000000000000000 R14: 000056457bf84250 R15: 00005645760b19d0
[ 1692.969745]  </TASK>

I guess I will run a memtest now.... Just to be sure.

Thanks for you support,
Martin

glueckm · Aug 23, 2023

glueckm said:
I guess I will run a memtest now.... Just to be sure.

Look's like the memtest does NOT show any problem.
Passed already 15% of the complete memtest86 test...

Martin

supermicro_server · Aug 23, 2023

glueckm said:
Look's like the memtest does NOT show any problem.
Passed already 15% of the complete memtest86 test...

Martin

Did you install Proxmox on Debian, or did you use an ISO installer?

glueckm · Aug 23, 2023

I used the ISO image

Maximiliano · Aug 25, 2023

Hello, could you share with use the contents of /etc/network/interfaces of both nodes, and whether they both can ping each other, whats the ping time?

It could also be helpful to restart both pvestatd and pmxcfs on both nodes (systemctl restart pvestatd.service; systemctl restart pmxcfs.service).

Do you have syslogs (journalctl) from pve2 containing the moment when it tries to join the cluster?

glueckm · Aug 26, 2023

Hi,

pve1

Code:

root@pve1:~# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

source /etc/network/interfaces.d/*

auto lo
iface lo inet loopback

auto enp7s0
#iface enp7s0 inet static
#       address 192.168.178.230/16
#       gateway 192.168.178.1
#       dns-nameservers 192.168.178.1 8.8.8.8
# dns-* options are implemented by the resolvconf package, if installed

#iface enp7s0 inet6 auto
#       dns-nameservers 192.168.178.1 8.8.8.8
#       dns-search neugasse.net

iface enp7s0 inet manual
auto vmbr0
iface vmbr0 inet static
        address 192.168.178.230/24
        gateway 192.168.178.1
        bridge-ports enp7s0
        bridge-stp off
        bridge-fd 0
root@pve1:~# ping pve1
PING pve1.fritz.box (192.168.178.230) 56(84) bytes of data.
64 bytes from pve1.fritz.box (192.168.178.230): icmp_seq=1 ttl=64 time=0.017 ms
64 bytes from pve1.fritz.box (192.168.178.230): icmp_seq=2 ttl=64 time=0.027 ms
^C
--- pve1.fritz.box ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1029ms
rtt min/avg/max/mdev = 0.017/0.022/0.027/0.005 ms

pve2:

Code:

root@pve2:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface enp34s0 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.178.231/16
        gateway 192.168.178.1
        bridge-ports enp34s0
        bridge-stp off
        bridge-fd 0
root@pve2:~# ping pve1
PING pve1.fritz.box (192.168.178.230) 56(84) bytes of data.
64 bytes from pve1.fritz.box (192.168.178.230): icmp_seq=1 ttl=64 time=0.596 ms
64 bytes from pve1.fritz.box (192.168.178.230): icmp_seq=2 ttl=64 time=0.763 ms
^C
--- pve1.fritz.box ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1011ms
rtt min/avg/max/mdev = 0.596/0.679/0.763/0.083 ms

I than create a new cluster on pve1, and tried to restart the services as you decribed, but I get the same error message on both nodes:

Code:

root@pve2:~# systemctl restart pvestatd.service; systemctl restart pmxcfs.service
Failed to restart pmxcfs.service: Unit pmxcfs.service not found.

I'm not Sure if I should try to add the node now...

Thanks for your help,
Martin

Maximiliano · Aug 28, 2023

glueckm said:

Hi,

pve1

Code:

root@pve1:~# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

source /etc/network/interfaces.d/*

auto lo
iface lo inet loopback

auto enp7s0
#iface enp7s0 inet static
#       address 192.168.178.230/16
#       gateway 192.168.178.1
#       dns-nameservers 192.168.178.1 8.8.8.8
# dns-* options are implemented by the resolvconf package, if installed

#iface enp7s0 inet6 auto
#       dns-nameservers 192.168.178.1 8.8.8.8
#       dns-search neugasse.net

iface enp7s0 inet manual
auto vmbr0
iface vmbr0 inet static
        address 192.168.178.230/24
        gateway 192.168.178.1
        bridge-ports enp7s0
        bridge-stp off
        bridge-fd 0
root@pve1:~# ping pve1
PING pve1.fritz.box (192.168.178.230) 56(84) bytes of data.
64 bytes from pve1.fritz.box (192.168.178.230): icmp_seq=1 ttl=64 time=0.017 ms
64 bytes from pve1.fritz.box (192.168.178.230): icmp_seq=2 ttl=64 time=0.027 ms
^C
--- pve1.fritz.box ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1029ms
rtt min/avg/max/mdev = 0.017/0.022/0.027/0.005 ms

pve2:

Code:

root@pve2:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface enp34s0 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.178.231/16
        gateway 192.168.178.1
        bridge-ports enp34s0
        bridge-stp off
        bridge-fd 0
root@pve2:~# ping pve1
PING pve1.fritz.box (192.168.178.230) 56(84) bytes of data.
64 bytes from pve1.fritz.box (192.168.178.230): icmp_seq=1 ttl=64 time=0.596 ms
64 bytes from pve1.fritz.box (192.168.178.230): icmp_seq=2 ttl=64 time=0.763 ms
^C
--- pve1.fritz.box ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1011ms
rtt min/avg/max/mdev = 0.596/0.679/0.763/0.083 ms

I than create a new cluster on pve1, and tried to restart the services as you decribed, but I get the same error message on both nodes:

Code:

root@pve2:~# systemctl restart pvestatd.service; systemctl restart pmxcfs.service
Failed to restart pmxcfs.service: Unit pmxcfs.service not found.

I'm not Sure if I should try to add the node now...

Thanks for your help,
Martin

My bad, the service is called `pve-cluster.service`. Do you have system logs when trying to connect to the cluster?

glueckm · Aug 28, 2023

Hi Maximiliano,

I have now the log for you, but they are a bit long.

PVE2 Log

The important parts I think start around Aug 28 10:25:31 (line 109), here is where the crash happens.

I also have the log from PVE1:

PVE1 Log

As you can see, here also on task crashes Aug 28 10:25:13 (line 340)

Hope that helps you a bit,

Thanks
Martin

Maximiliano · Aug 28, 2023

So in the logs I see

Code:

Aug 28 10:22:13 pve2 pmxcfs[2824]: [quorum] crit: quorum_initialize failed: 2

so it seems pve2 actually joined the cluster, could you please share the contents of `/etc/pve/corosync.conf`, it is possible that they cannot ping each other over the corosync network?

Azunai333 · Aug 28, 2023

The netmask seems to be different: /16 vs /24.

pve1:

glueckm said:
192.168.178.231/16

pve2:

glueckm said:
192.168.178.230/24

glueckm · Aug 28, 2023

Hi,

Here is the content of pve2 (on pve1 it looks exactly the same).

Code:

root@pve2:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.178.230
  }
  node {
    name: pve2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.178.231
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Mangari
  config_version: 2
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Maximiliano said:
it is possible that they cannot ping each other over the corosync network

Not sure how I test that?
A normal ping works on both sides.

Thanks,
Martin

Maximiliano · Aug 28, 2023

I am not entirely sure whats the issue, the different subnets is something I would strongly advise against but shouldn't be the issue. I would suggest to remove pve2 from the cluster from pve1 and add it again from pve2, if the web ui does not allow to do this the commands are `pvecm delnode` and `pvecm add`. Do note that `add` has a `--force` flag.

Cannot add a node to a cluster

New Member

Well-Known Member

New Member

Well-Known Member

New Member

Well-Known Member

New Member

Well-Known Member

New Member

New Member

Well-Known Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

Active Member

New Member

Proxmox Staff Member

We value your privacy