[BUG] PM 6 adding new node, all other PVE stopped working && GUI shows no cluster, while listing cluste nodes!

mailinglists · Jan 29, 2021

Hi,

i did what I did many times before. Added a node to existing cluster. After adding it, whole cluster went down (VMs were running but PVE stopped).

Here is how it looked on one node:

Code:

[Fri Jan 29 14:46:36 2021] INFO: task pvesr:42198 blocked for more than 120 seconds.
[Fri Jan 29 14:46:36 2021]       Tainted: P           O      5.4.65-1-pve #1
[Fri Jan 29 14:46:36 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jan 29 14:46:36 2021] pvesr           D    0 42198  42181 0x00000000
[Fri Jan 29 14:46:36 2021] Call Trace:
[Fri Jan 29 14:46:36 2021]  __schedule+0x2e6/0x6f0
[Fri Jan 29 14:46:36 2021]  ? filename_parentat.isra.57.part.58+0xf7/0x180
[Fri Jan 29 14:46:36 2021]  schedule+0x33/0xa0
[Fri Jan 29 14:46:36 2021]  rwsem_down_write_slowpath+0x2ed/0x4a0
[Fri Jan 29 14:46:36 2021]  down_write+0x3d/0x40
[Fri Jan 29 14:46:36 2021]  filename_create+0x8e/0x180
[Fri Jan 29 14:46:36 2021]  do_mkdirat+0x59/0x110
[Fri Jan 29 14:46:36 2021]  __x64_sys_mkdir+0x1b/0x20
[Fri Jan 29 14:46:36 2021]  do_syscall_64+0x57/0x190
[Fri Jan 29 14:46:36 2021]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Fri Jan 29 14:46:36 2021] RIP: 0033:0x7f6f1cae10d7
[Fri Jan 29 14:46:36 2021] Code: Bad RIP value.
[Fri Jan 29 14:46:36 2021] RSP: 002b:00007ffd3fde42e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[Fri Jan 29 14:46:36 2021] RAX: ffffffffffffffda RBX: 000055cf08ca3260 RCX: 00007f6f1cae10d7
[Fri Jan 29 14:46:36 2021] RDX: 000055cf07d073d4 RSI: 00000000000001ff RDI: 000055cf0ce5bde0
[Fri Jan 29 14:46:36 2021] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000004
[Fri Jan 29 14:46:36 2021] R10: 0000000000000000 R11: 0000000000000246 R12: 000055cf0a10b7f8
[Fri Jan 29 14:46:36 2021] R13: 000055cf0ce5bde0 R14: 000055cf0ca0bc80 R15: 00000000000001ff
[Fri Jan 29 14:46:36 2021] INFO: task pvesr:42931 blocked for more than 120 seconds.
[Fri Jan 29 14:46:36 2021]       Tainted: P           O      5.4.65-1-pve #1
[Fri Jan 29 14:46:36 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jan 29 14:46:36 2021] pvesr           D    0 42931      1 0x00000000
[Fri Jan 29 14:46:36 2021] Call Trace:
[Fri Jan 29 14:46:36 2021]  __schedule+0x2e6/0x6f0
[Fri Jan 29 14:46:36 2021]  ? filename_parentat.isra.57.part.58+0xf7/0x180
[Fri Jan 29 14:46:36 2021]  schedule+0x33/0xa0
[Fri Jan 29 14:46:36 2021]  rwsem_down_write_slowpath+0x2ed/0x4a0
[Fri Jan 29 14:46:36 2021]  down_write+0x3d/0x40
[Fri Jan 29 14:46:36 2021]  filename_create+0x8e/0x180
[Fri Jan 29 14:46:36 2021]  do_mkdirat+0x59/0x110
[Fri Jan 29 14:46:36 2021]  __x64_sys_mkdir+0x1b/0x20
[Fri Jan 29 14:46:36 2021]  do_syscall_64+0x57/0x190
[Fri Jan 29 14:46:36 2021]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Fri Jan 29 14:46:36 2021] RIP: 0033:0x7f0942b020d7
[Fri Jan 29 14:46:36 2021] Code: Bad RIP value.
[Fri Jan 29 14:46:36 2021] RSP: 002b:00007ffeea576748 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[Fri Jan 29 14:46:36 2021] RAX: ffffffffffffffda RBX: 0000557237cd3260 RCX: 00007f0942b020d7
[Fri Jan 29 14:46:36 2021] RDX: 0000557236ca83d4 RSI: 00000000000001ff RDI: 000055723bdb3520
[Fri Jan 29 14:46:36 2021] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000004
[Fri Jan 29 14:46:36 2021] R10: 0000000000000000 R11: 0000000000000246 R12: 000055723913acc8
[Fri Jan 29 14:46:36 2021] R13: 000055723bdb3520 R14: 000055723ba38958 R15: 00000000000001ff

HTTP GUI was not working on old cluster nodes.
I sshed back to the updated nevly added node.
I did pvecm status and saw expected old cluster name, but with only 1 node.
I decided to shut down the newly added node and old cluster nodes started responding.
VMs did not suffer as they would, if I have had HA enabled.

Anyway, i wonder how should I go about debugging this, not to cause problems with my production cluster?

mailinglists · Jan 29, 2021

I did start it again, to get some more errors.
Once it started cluster on old nodes stopped working. pvecm status did not return any value as long as new node was online and this was logged.

Code:

Jan 29 15:06:19 p35 corosync[6204]:   [TOTEM ] A new membership (1.55d) was formed. Members
Jan 29 15:06:20 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 60
Jan 29 15:06:21 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 70
Jan 29 15:06:22 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 80
Jan 29 15:06:23 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 90
Jan 29 15:06:24 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 100
Jan 29 15:06:24 p35 pmxcfs[6037]: [status] notice: cpg_send_message retried 100 times
Jan 29 15:06:24 p35 pmxcfs[6037]: [status] crit: cpg_send_message failed: 6
Jan 29 15:06:24 p35 pveproxy[47058]: proxy detected vanished client connection
Jan 29 15:06:25 p35 corosync[6204]:   [TOTEM ] A new membership (1.571) was formed. Members
Jan 29 15:06:25 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 10
Jan 29 15:06:26 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 20
Jan 29 15:06:27 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 30
Jan 29 15:06:28 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 40
Jan 29 15:06:29 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 50
Jan 29 15:06:30 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 60
Jan 29 15:06:30 p35 corosync[6204]:   [TOTEM ] A new membership (1.585) was formed. Members
Jan 29 15:06:31 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 70
Jan 29 15:06:32 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 80
Jan 29 15:06:33 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 90
Jan 29 15:06:34 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 100
Jan 29 15:06:34 p35 pmxcfs[6037]: [status] notice: cpg_send_message retried 100 times

On new node pvecm status worked and returned just one node:

Code:

root@p38:~# pvecm status
Cluster information
-------------------
Name:             OBC
Config Version:   4
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Jan 29 15:07:28 2021
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000004
Ring ID:          4.64d
Quorate:          No

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      1
Quorum:           3 Activity blocked
Flags:            

Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 10.31.1.38 (local)

And finally I found some errors in log files, which might give us a hint at a problem:

Code:

Jan 29 15:06:25 p38 systemd-timesyncd[1830]: Synchronized to time server for the first time 193.2.4.2:123 (2.debian.pool.ntp.org).
Jan 29 15:06:25 p38 corosync[2267]:   [TOTEM ] A new membership (4.571) was formed. Members
Jan 29 15:06:25 p38 corosync[2267]:   [QUORUM] Members[1]: 4
Jan 29 15:06:25 p38 corosync[2267]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jan 29 15:06:29 p38 pveproxy[2849]: worker exit
Jan 29 15:06:29 p38 pveproxy[2850]: worker exit
Jan 29 15:06:29 p38 pveproxy[2851]: worker exit
Jan 29 15:06:29 p38 pveproxy[2440]: worker 2849 finished
Jan 29 15:06:29 p38 pveproxy[2440]: worker 2851 finished
Jan 29 15:06:29 p38 pveproxy[2440]: worker 2850 finished
Jan 29 15:06:29 p38 pveproxy[2440]: starting 3 worker(s)
Jan 29 15:06:29 p38 pveproxy[2440]: worker 2949 started
Jan 29 15:06:29 p38 pveproxy[2440]: worker 2950 started
Jan 29 15:06:29 p38 pveproxy[2440]: worker 2951 started
Jan 29 15:06:29 p38 pveproxy[2949]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1775.
Jan 29 15:06:29 p38 pveproxy[2950]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1775.
Jan 29 15:06:29 p38 pveproxy[2951]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1775.
Jan 29 15:06:30 p38 corosync[2267]:   [TOTEM ] A new membership (4.585) was formed. Members
Jan 29 15:06:30 p38 corosync[2267]:   [QUORUM] Members[1]: 4
Jan 29 15:06:30 p38 corosync[2267]:   [MAIN  ] Completed service synchronization, ready to provide service.
......
Jan 29 15:06:59 p38 pveproxy[3393]: worker exit
Jan 29 15:06:59 p38 pveproxy[3392]: worker exit
Jan 29 15:06:59 p38 pveproxy[3394]: worker exit
Jan 29 15:06:59 p38 pveproxy[2440]: worker 3394 finished
Jan 29 15:06:59 p38 pveproxy[2440]: worker 3392 finished
Jan 29 15:06:59 p38 pveproxy[2440]: worker 3393 finished
Jan 29 15:06:59 p38 pveproxy[2440]: starting 3 worker(s)
Jan 29 15:06:59 p38 pveproxy[2440]: worker 3482 started
Jan 29 15:06:59 p38 pveproxy[2440]: worker 3483 started
Jan 29 15:06:59 p38 pveproxy[2440]: worker 3484 started
Jan 29 15:06:59 p38 pveproxy[3482]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1775.
Jan 29 15:06:59 p38 pveproxy[3483]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1775.
Jan 29 15:06:59 p38 pveproxy[3484]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1775.
Jan 29 15:07:00 p38 systemd[1]: Starting Proxmox VE replication runner...
Jan 29 15:07:00 p38 systemd[1]: Started Session 3 of user root.
Jan 29 15:07:00 p38 pvesr[3485]: error during cfs-locked 'file-replication_cfg' operation: no quorum!
Jan 29 15:07:00 p38 systemd[1]: pvesr.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 29 15:07:00 p38 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jan 29 15:07:00 p38 systemd[1]: Failed to start Proxmox VE replication runner.
Jan 29 15:07:01 p38 cron[2269]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Jan 29 15:07:02 p38 corosync[2267]:   [TOTEM ] A new membership (4.5fd) was formed. Members
Jan 29 15:07:02 p38 corosync[2267]:   [QUORUM] Members[1]: 4
Jan 29 15:07:02 p38 corosync[2267]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jan 29 15:07:04 p38 pveproxy[3482]: worker exit
Jan 29 15:07:04 p38 pveproxy[3483]: worker exit
Jan 29 15:07:04 p38 pveproxy[3484]: worker exit
Jan 29 15:07:05 p38 pveproxy[2440]: worker 3482 finished
Jan 29 15:07:05 p38 pveproxy[2440]: worker 3484 finished
Jan 29 15:07:05 p38 pveproxy[2440]: worker 3483 finished
Jan 29 15:07:05 p38 pveproxy[2440]: starting 3 worker(s)
Jan 29 15:07:05 p38 pveproxy[2440]: worker 3592 started
Jan 29 15:07:05 p38 pveproxy[2440]: worker 3593 started
Jan 29 15:07:05 p38 pveproxy[2440]: worker 3594 started
Jan 29 15:07:05 p38 pveproxy[3592]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1775.
Jan 29 15:07:05 p38 pveproxy[3593]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1775.
Jan 29 15:07:05 p38 pveproxy[3594]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1775.
Jan 29 15:07:08 p38 corosync[2267]:   [TOTEM ] A new membership (4.611) was formed. Members
Jan 29 15:07:08 p38 corosync[2267]:   [QUORUM] Members[1]: 4
Jan 29 15:07:08 p38 corosync[2267]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jan 29 15:07:10 p38 pveproxy[3592]: worker exit
Jan 29 15:07:10 p38 pveproxy[3593]: worker exit
Jan 29 15:07:10 p38 pveproxy[3594]: worker exit
Jan 29 15:07:10 p38 pveproxy[2440]: worker 3594 finished
Jan 29 15:07:10 p38 pveproxy[2440]: worker 3592 finished
Jan 29 15:07:10 p38 pveproxy[2440]: worker 3593 finished
Jan 29 15:07:10 p38 pveproxy[2440]: starting 3 worker(s)
Jan 29 15:07:10 p38 pveproxy[2440]: worker 3697 started
Jan 29 15:07:10 p38 pveproxy[2440]: worker 3698 started
Jan 29 15:07:10 p38 pveproxy[2440]: worker 3699 started
Jan 29 15:07:10 p38 pveproxy[3698]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1775.
Jan 29 15:07:10 p38 pveproxy[3697]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1775.
Jan 29 15:07:10 p38 pveproxy[3699]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1775.

Now I will focus on missing /etc/pve/local/pve-ssl.key.

mailinglists · Jan 29, 2021

Looking at the logs on primary cluster node, where new cluster node was joined I see these errors:

Code:

Jan 29 14:44:38 p35 pmxcfs[6037]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 4)
Jan 29 14:44:38 p35 corosync[6204]:   [CFG   ] Config reload requested by node 1
Jan 29 14:44:38 p35 corosync[6204]:   [TOTEM ] Configuring link 0
Jan 29 14:44:38 p35 corosync[6204]:   [TOTEM ] Configured link number 0: local addr: 10.31.1.35, port=5405
Jan 29 14:44:38 p35 corosync[6204]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Jan 29 14:44:38 p35 corosync[6204]:   [KNET  ] host: host: 4 has no active links
Jan 29 14:44:38 p35 corosync[6204]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Jan 29 14:44:38 p35 corosync[6204]:   [KNET  ] host: host: 4 has no active links
Jan 29 14:44:38 p35 corosync[6204]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Jan 29 14:44:38 p35 corosync[6204]:   [KNET  ] host: host: 4 has no active links
Jan 29 14:44:38 p35 pmxcfs[6037]: [status] notice: update cluster info (cluster name  ABC, version = 4)
..
Jan 29 14:44:50 p35 corosync[6204]:   [KNET  ] rx: host: 4 link: 0 is up
Jan 29 14:44:50 p35 corosync[6204]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Jan 29 14:44:50 p35 corosync[6204]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
..
Jan 29 14:44:53 p35 pveproxy[43003]: '/etc/pve/nodes/p38/pve-ssl.pem' does not exist!#012
Jan 29 14:44:53 p35 pveproxy[43003]: Could not verify remote node certificate '15:0A:90:C7:5...ABC..4:87' with list of pinned certificates, refreshing cache
Jan 29 14:44:54 p35 pveproxy[43003]: '/etc/pve/nodes/p38/pve-ssl.pem' does not exist!#012
Jan 29 14:44:56 p35 corosync[6204]:   [TOTEM ] A new membership (1.fd) was formed. Members
..
Jan 29 14:44:58 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 10
Jan 29 14:44:59 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 20
Jan 29 14:45:00 p35 systemd[1]: Starting Proxmox VE replication runner...
Jan 29 14:45:00 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 30
Jan 29 14:45:01 p35 CRON[42940]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jan 29 14:45:01 p35 corosync[6204]:   [TOTEM ] A new membership (1.111) was formed. Members
Jan 29 14:45:01 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 40
Jan 29 14:45:02 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 50
Jan 29 14:45:03 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 60
Jan 29 14:45:04 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 70
Jan 29 14:45:05 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 80
Jan 29 14:45:06 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 90
Jan 29 14:45:07 p35 corosync[6204]:   [TOTEM ] A new membership (1.125) was formed. Members
Jan 29 14:45:07 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 100
Jan 29 14:45:07 p35 pmxcfs[6037]: [status] notice: cpg_send_message retried 100 times
Jan 29 14:45:07 p35 pmxcfs[6037]: [status] crit: cpg_send_message failed: 6
Jan 29 14:45:07 p35 pveproxy[46541]: '/etc/pve/nodes/p38/pve-ssl.pem' does not exist!#012
Jan 29 14:45:08 p35 pveproxy[46541]: Clearing outdated entries from certificate cache
..
Jan 29 14:45:48 p35 pveproxy[16106]: '/etc/pve/nodes/p38/pve-ssl.pem' does not exist!#012
Jan 29 14:45:48 p35 pveproxy[16106]: proxy detected vanished client connection
Jan 29 14:45:48 p35 pveproxy[16106]: Could not verify remote node certificate '15:0A:9ABC:1D:84:87' with list of pinned certificates, refreshing cache
Jan 29 14:45:48 p35 pveproxy[43003]: proxy detected vanished client connection
Jan 29 14:45:49 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 10
Jan 29 14:45:50 p35 pmxcfs[6037]: [status] notice: cpg_send_message retry 20

mailinglists · Jan 29, 2021

I ran out of time to continue debugging.
I will provide more info from the new node, now running reparately from this cluster network, so I can look at its files.

However, I think something went horribly wrong when joining a cluster and it might even be a bug.

tom · Jan 29, 2021

The first step: upgrade to current version.

mailinglists · Feb 1, 2021

Hi @tom

thank you for your suggestion. I was just in a process of updating.
I want to add new node to the cluster, so I can live migrate VMs to it and then update and reboot old node.
Repeat the process with all nodes.

But I can not do that, because I can not add new node, to migrate VMs to.

Is adding new node to the cluster with newer minor version not supported?

(I never have had a problem with adding new nodes with the same major version, migrate VMs to them and update old nodes to the same minor version.)

I will try and add another node. I if fails I will reinstall both new nodes. Upgrade old cluster with VMs running on it. Remove new nodes and add them again. Hopefully, there is a bug which will be removed when upgrading old nodes without rebooting them.

I appreciate all suggestions and directions.

mailinglists · Feb 1, 2021

I wanted to follow the procedure as described above, but noticed I do not have a cluster anymore, while cluster still works?!?
So GUI shows no cluster, but shows cluster nodes, pvecm shows as usual.

root@p35:~# pvecm status
Cluster information
-------------------
Name: XYZ
Config Version: 4
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Feb 1 15:37:16 2021
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.692
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 3
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 X.35 (local)
0x00000002 1 X.37
0x00000003 1 X.36
root@p35:~# pvecm nodes

Membership information
----------------------
Nodeid Votes Name
1 1 p35 (local)
2 1 p37
3 1 p36

Now I have no idea how to proceed, to get cluster working again

mailinglists · Feb 1, 2021

Maybe I can get it back by removing p38:

Code:

 pvesh get /cluster/config/join --output-format json-pretty
'/etc/pve/nodes/p38/pve-ssl.pem' does not exist!

mailinglists · Feb 1, 2021

I removed node p38 and now GUI shows cluster.
Now back to the original plan.

Hopefully someone will find my information in the future useful, that's why I'm writing all this.

mailinglists · Feb 1, 2021

I re-added p38 and it worked this time.
Now I will move VMs over and update and reboot old nodes.

Somehow something went wrong when adding it the first time.
I guess using HTTP GUI process for adding nodes could be improved to address issues like mine.

mailinglists · Feb 1, 2021

It might have worked, because I have not updated the node before joining.
Will add now another and do dist-upgrade before joining to see it it happens again.

mailinglists · Feb 1, 2021

I added second node, that was updated to latest version.
It also was joined as expected.

So it has had nothing to do with joining different versions.

mailinglists · Feb 3, 2021

I added another updated node to the cluster, just to see the same cluster lockup and failure.

I guess there is a bug somewhere. Have no time to debug.
Will recover as previously. Kill new node, reinstall it, remove it add it again. Hopefully, this time it works.

mailinglists · Feb 3, 2021

Adding the node again, fails again. Both nodes were updated prior to adding.
Will now have coworker gather all logs and open a bug report.

mailinglists · Feb 5, 2021

https://forum.proxmox.com/threads/pm-6-multicast-related-bugs.83703/
here.

I stopped debugging publicly, because no1 seemed to care. Above is the link with summary of my debugging.

Also I wish @tom would contribue more that with just a comment to upgrade. This regression should also be caught in testing as using bridge for cluster network should be pretty mainstream setup.

Search

Search

[BUG] PM 6 adding new node, all other PVE stopped working && GUI shows no cluster, while listing cluste nodes!

mailinglists

Renowned Member

mailinglists

Renowned Member

mailinglists

Renowned Member

mailinglists

Renowned Member

tom

Proxmox Staff Member

mailinglists

Renowned Member

mailinglists

Renowned Member

mailinglists

Renowned Member

mailinglists

Renowned Member

mailinglists

Renowned Member

mailinglists

Renowned Member

mailinglists

Renowned Member

mailinglists

Renowned Member

mailinglists

Renowned Member

mailinglists

Renowned Member