Join to existing cluster after node rename

kristian.kirilov · Oct 25, 2023

I tried to rename two of my nodes in my lab, and as I expected bricked the cluster configuration.
Then I removed the cluster configs and created a new one, unfortunately I can't join to it..... what could be the reason for that?

Code:

root@sofx1010pve3302.home.lan:~# pvecm add 192.168.30.7 -use_ssh
detected the following error(s):
* this host already contains virtual guests
Check if node may join a cluster failed!
root@sofx1010pve3302.home.lan:~#

Code:

root@sofx1010pve3302.home.lan:~# ls -al /etc/pve/lxc/
total 0
drwxr-xr-x 2 root www-data 0 Oct 25 08:07 .
drwxr-xr-x 2 root www-data 0 Oct 25 08:07 ..
root@sofx1010pve3302.home.lan:~# ls -la /etc/pve/qemu-server/
total 0
drwxr-xr-x 2 root www-data 0 Oct 25 08:07 .
drwxr-xr-x 2 root www-data 0 Oct 25 08:07 ..
root@sofx1010pve3302.home.lan:~#

Code:

root@sofx1010pve3302.home.lan:~# ping `hostname -s`
PING sofx1010pve3302.home.lan (192.168.30.2) 56(84) bytes of data.
64 bytes from sofx1010pve3302.home.lan (192.168.30.2): icmp_seq=1 ttl=64 time=0.020 ms
^C
--- sofx1010pve3302.home.lan ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.020/0.020/0.020/0.000 ms
root@sofx1010pve3302.home.lan:~#

Code:

root@sofx1010pve3302.home.lan:~# ls -la /etc/corosync/
total 20
drwxr-xr-x   3 root root  4096 Oct 25 09:22 .
drwxr-xr-x 146 root root 12288 Oct 25 08:45 ..
drwxr-xr-x   2 root root  4096 Jun  8  2017 uidgid.d
root@sofx1010pve3302.home.lan:~# ls -la /etc/corosync/uidgid.d/
total 8
drwxr-xr-x 2 root root 4096 Jun  8  2017 .
drwxr-xr-x 3 root root 4096 Oct 25 09:22 ..
root@sofx1010pve3302.home.lan:~#

Although I'm in a stand-alone mode I can't get the Proxmox WebUI working...

Code:

root@sofx1010pve3302.home.lan:~# systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
     Active: active (running) since Wed 2023-10-25 09:50:27 EEST; 10min ago
    Process: 2182 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
   Main PID: 2196 (pmxcfs)
      Tasks: 6 (limit: 38288)
     Memory: 37.9M
     CGroup: /system.slice/pve-cluster.service
             └─2196 /usr/bin/pmxcfs

Oct 25 09:50:26 sofx1010pve3302.home.lan systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Oct 25 09:50:26 sofx1010pve3302.home.lan pmxcfs[2182]: [main] notice: resolved node name 'sofx1010pve3302' to '192.168.30.2' for default node IP address
Oct 25 09:50:26 sofx1010pve3302.home.lan pmxcfs[2182]: [main] notice: resolved node name 'sofx1010pve3302' to '192.168.30.2' for default node IP address
Oct 25 09:50:27 sofx1010pve3302.home.lan systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.
root@sofx1010pve3302.home.lan:~#

Chris · Oct 25, 2023

Hi,
please check and post the output of pvecm status and cat /etc/corosync/corosync.conf for all nodes. Also, what is the status for systemctl status corosync.service.

kristian.kirilov said:
Although I'm in a stand-alone mode I can't get the Proxmox WebUI working...

Well, than you might have an issue with the pveproxy.service, check the output of systemctl status pveproxy.service pvedaemon.service and check the systemd journal for errors. journalctl -b -r gives you a paginated view of the journal since boot in reverse order.

kristian.kirilov · Oct 25, 2023

More interesting part is that although I have re-created the cluster in the WebUI of the "master" node I still see the old nodes with their VMs,

Code:

root@sofx1010pve3307:~# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 sofx1010pve3307 (local)
root@sofx1010pve3307:~# pvecm status
Cluster information
-------------------
Name:             Proxmox
Config Version:   1
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Oct 25 10:10:45 2023
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.a
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   1
Highest expected: 1
Total votes:      1
Quorum:           1
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.30.7 (local)
root@sofx1010pve3307:~#

kristian.kirilov · Oct 25, 2023

Thanks for the fast replay, here is all the info you require,
(because the post went so long I also have it in pastebin: https://pastebin.com/eaM4jxND)

Code:

root@sofx1010pve3302.home.lan:~# pvecm status
Error: Corosync config '/etc/pve/corosync.conf' does not exist - is this node part of a cluster?
root@sofx1010pve3302.home.lan:~#

root@sofx1010pve3303.home.lan:~# pvecm status
Error: Corosync config '/etc/pve/corosync.conf' does not exist - is this node part of a cluster?
root@sofx1010pve3303.home.lan:~#

root@sofx1010pve3307:~# pvecm status
Cluster information
-------------------
Name:             Proxmox
Config Version:   1
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Oct 25 10:14:52 2023
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.a
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   1
Highest expected: 1
Total votes:      1
Quorum:           1
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.30.7 (local)
root@sofx1010pve3307:~#

Code:

root@sofx1010pve3302.home.lan:~# cat /etc/corosync/corosync.conf
cat: /etc/corosync/corosync.conf: No such file or directory
root@sofx1010pve3302.home.lan:~#

root@sofx1010pve3303.home.lan:~# cat /etc/corosync/corosync.conf
cat: /etc/corosync/corosync.conf: No such file or directory
root@sofx1010pve3303.home.lan:~#

root@sofx1010pve3307:~# cat /etc/corosync/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: sofx1010pve3307
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.30.7
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Proxmox
  config_version: 1
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

root@sofx1010pve3307:~#

Code:

root@sofx1010pve3302.home.lan:~# systemctl status corosync.service
○ corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: inactive (dead)
  Condition: start condition failed at Wed 2023-10-25 10:16:31 EEST; 249ms ago
             └─ ConditionPathExists=/etc/corosync/corosync.conf was not met
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview

Oct 25 10:14:05 sofx1010pve3302.home.lan systemd[1]: corosync.service - Corosync Cluster Engine was skipped because of an unmet condition check (ConditionPathExists=/etc/corosync/corosync.c>
Oct 25 10:14:21 sofx1010pve3302.home.lan systemd[1]: corosync.service - Corosync Cluster Engine was skipped because of an unmet condition check (ConditionPathExists=/etc/corosync/corosync.c>
Oct 25 10:16:31 sofx1010pve3302.home.lan systemd[1]: corosync.service - Corosync Cluster Engine was skipped because of an unmet condition check (ConditionPathExists=/etc/corosync/corosync.c>
root@sofx1010pve3302.home.lan:~#

root@sofx1010pve3303.home.lan:~# systemctl status corosync.service
○ corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: inactive (dead)
  Condition: start condition failed at Wed 2023-10-25 10:16:31 EEST; 5s ago
             └─ ConditionPathExists=/etc/corosync/corosync.conf was not met
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview

Oct 25 10:14:05 sofx1010pve3303.home.lan systemd[1]: corosync.service - Corosync Cluster Engine was skipped because of an unmet condition check (ConditionPathExists=/etc/corosync/corosync.c>
Oct 25 10:14:21 sofx1010pve3303.home.lan systemd[1]: corosync.service - Corosync Cluster Engine was skipped because of an unmet condition check (ConditionPathExists=/etc/corosync/corosync.c>
root@sofx1010pve3303.home.lan:~#

root@sofx1010pve3307:~# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: active (running) since Wed 2023-10-25 09:37:49 EEST; 38min ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 2445 (corosync)
      Tasks: 9 (limit: 76866)
     Memory: 157.1M
     CGroup: /system.slice/corosync.service
             └─2445 /usr/sbin/corosync -f

Oct 25 09:37:49 sofx1010pve3307 corosync[2445]:   [QB    ] server name: quorum
Oct 25 09:37:49 sofx1010pve3307 corosync[2445]:   [TOTEM ] Configuring link 0
Oct 25 09:37:49 sofx1010pve3307 corosync[2445]:   [TOTEM ] Configured link number 0: local addr: 192.168.30.7, port=5405
Oct 25 09:37:49 sofx1010pve3307 corosync[2445]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Oct 25 09:37:49 sofx1010pve3307 corosync[2445]:   [QUORUM] Sync members[1]: 1
Oct 25 09:37:49 sofx1010pve3307 corosync[2445]:   [QUORUM] Sync joined[1]: 1
Oct 25 09:37:49 sofx1010pve3307 corosync[2445]:   [TOTEM ] A new membership (1.a) was formed. Members joined: 1
Oct 25 09:37:49 sofx1010pve3307 corosync[2445]:   [QUORUM] Members[1]: 1
Oct 25 09:37:49 sofx1010pve3307 corosync[2445]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 25 09:37:49 sofx1010pve3307 systemd[1]: Started corosync.service - Corosync Cluster Engine.
root@sofx1010pve3307:~#

Continue to the next post, as there is limitation of max characters per post.

kristian.kirilov · Oct 25, 2023

Here are the outputs only on the nodes which are failing to start the UI:

Code:

root@sofx1010pve3302.home.lan:~# systemctl status pveproxy.service pvedaemon.service
● pveproxy.service - PVE API Proxy Server
     Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; preset: enabled)
     Active: active (running) since Wed 2023-10-25 10:03:50 EEST; 13min ago
    Process: 2467 ExecStartPre=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
    Process: 2480 ExecStart=/usr/bin/pveproxy start (code=exited, status=0/SUCCESS)
   Main PID: 2529 (pveproxy)
      Tasks: 4 (limit: 38288)
     Memory: 141.3M
     CGroup: /system.slice/pveproxy.service
             ├─2529 pveproxy
             ├─2530 "pveproxy worker"
             ├─2531 "pveproxy worker"
             └─2532 "pveproxy worker"

Oct 25 10:03:49 sofx1010pve3302.home.lan systemd[1]: Starting pveproxy.service - PVE API Proxy Server...
Oct 25 10:03:50 sofx1010pve3302.home.lan pveproxy[2529]: starting server
Oct 25 10:03:50 sofx1010pve3302.home.lan pveproxy[2529]: starting 3 worker(s)
Oct 25 10:03:50 sofx1010pve3302.home.lan pveproxy[2529]: worker 2530 started
Oct 25 10:03:50 sofx1010pve3302.home.lan pveproxy[2529]: worker 2531 started
Oct 25 10:03:50 sofx1010pve3302.home.lan pveproxy[2529]: worker 2532 started
Oct 25 10:03:50 sofx1010pve3302.home.lan systemd[1]: Started pveproxy.service - PVE API Proxy Server.

● pvedaemon.service - PVE API Daemon
     Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled; preset: enabled)
     Active: active (running) since Wed 2023-10-25 10:03:49 EEST; 13min ago
    Process: 2313 ExecStart=/usr/bin/pvedaemon start (code=exited, status=0/SUCCESS)
   Main PID: 2462 (pvedaemon)
      Tasks: 4 (limit: 38288)
     Memory: 209.0M
     CGroup: /system.slice/pvedaemon.service
             ├─2462 pvedaemon
             ├─2463 "pvedaemon worker"
             ├─2464 "pvedaemon worker"
             └─2465 "pvedaemon worker"

Oct 25 10:03:48 sofx1010pve3302.home.lan systemd[1]: Starting pvedaemon.service - PVE API Daemon...
Oct 25 10:03:49 sofx1010pve3302.home.lan pvedaemon[2462]: starting server
Oct 25 10:03:49 sofx1010pve3302.home.lan pvedaemon[2462]: starting 3 worker(s)
Oct 25 10:03:49 sofx1010pve3302.home.lan pvedaemon[2462]: worker 2463 started
Oct 25 10:03:49 sofx1010pve3302.home.lan pvedaemon[2462]: worker 2464 started
Oct 25 10:03:49 sofx1010pve3302.home.lan pvedaemon[2462]: worker 2465 started
Oct 25 10:03:49 sofx1010pve3302.home.lan systemd[1]: Started pvedaemon.service - PVE API Daemon.
root@sofx1010pve3302.home.lan:~#

root@sofx1010pve3303.home.lan:~# systemctl status pveproxy.service pvedaemon.service
● pveproxy.service - PVE API Proxy Server
     Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; preset: enabled)
     Active: active (running) since Wed 2023-10-25 10:03:50 EEST; 14min ago
    Process: 1402 ExecStartPre=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
    Process: 1408 ExecStart=/usr/bin/pveproxy start (code=exited, status=0/SUCCESS)
   Main PID: 1416 (pveproxy)
      Tasks: 4 (limit: 38288)
     Memory: 141.5M
     CGroup: /system.slice/pveproxy.service
             ├─1416 pveproxy
             ├─1417 "pveproxy worker"
             ├─1418 "pveproxy worker"
             └─1419 "pveproxy worker"

Oct 25 10:03:48 sofx1010pve3303.home.lan systemd[1]: Starting pveproxy.service - PVE API Proxy Server...
Oct 25 10:03:50 sofx1010pve3303.home.lan pveproxy[1416]: starting server
Oct 25 10:03:50 sofx1010pve3303.home.lan pveproxy[1416]: starting 3 worker(s)
Oct 25 10:03:50 sofx1010pve3303.home.lan pveproxy[1416]: worker 1417 started
Oct 25 10:03:50 sofx1010pve3303.home.lan pveproxy[1416]: worker 1418 started
Oct 25 10:03:50 sofx1010pve3303.home.lan pveproxy[1416]: worker 1419 started
Oct 25 10:03:50 sofx1010pve3303.home.lan systemd[1]: Started pveproxy.service - PVE API Proxy Server.

● pvedaemon.service - PVE API Daemon
     Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled; preset: enabled)
     Active: active (running) since Wed 2023-10-25 10:03:48 EEST; 14min ago
    Process: 1262 ExecStart=/usr/bin/pvedaemon start (code=exited, status=0/SUCCESS)
   Main PID: 1397 (pvedaemon)
      Tasks: 4 (limit: 38288)
     Memory: 208.8M
     CGroup: /system.slice/pvedaemon.service
             ├─1397 pvedaemon
             ├─1398 "pvedaemon worker"
             ├─1399 "pvedaemon worker"
             └─1400 "pvedaemon worker"

Oct 25 10:03:48 sofx1010pve3303.home.lan systemd[1]: Starting pvedaemon.service - PVE API Daemon...
Oct 25 10:03:48 sofx1010pve3303.home.lan pvedaemon[1397]: starting server
Oct 25 10:03:48 sofx1010pve3303.home.lan pvedaemon[1397]: starting 3 worker(s)
Oct 25 10:03:48 sofx1010pve3303.home.lan pvedaemon[1397]: worker 1398 started
Oct 25 10:03:48 sofx1010pve3303.home.lan pvedaemon[1397]: worker 1399 started
Oct 25 10:03:48 sofx1010pve3303.home.lan pvedaemon[1397]: worker 1400 started
Oct 25 10:03:48 sofx1010pve3303.home.lan systemd[1]: Started pvedaemon.service - PVE API Daemon.
root@sofx1010pve3303.home.lan:~#

Again only on those nodes which are failing:

Code:

Oct 25 10:18:57 sofx1010pve3302.home.lan pacemakerd[6277]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
Oct 25 10:18:57 sofx1010pve3302.home.lan systemd[1]: Started pacemaker.service - Pacemaker High Availability Cluster Manager.
Oct 25 10:18:57 sofx1010pve3302.home.lan systemd[1]: corosync.service - Corosync Cluster Engine was skipped because of an unmet condition check (ConditionPathExists=/etc/corosync/corosync.c>
Oct 25 10:18:57 sofx1010pve3302.home.lan systemd[1]: Stopped pacemaker.service - Pacemaker High Availability Cluster Manager.
Oct 25 10:18:57 sofx1010pve3302.home.lan systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 56.
Oct 25 10:18:56 sofx1010pve3302.home.lan systemd[1]: pacemaker.service: Failed with result 'exit-code'.
Oct 25 10:18:56 sofx1010pve3302.home.lan systemd[1]: pacemaker.service: Main process exited, code=exited, status=69/UNAVAILABLE
Oct 25 10:18:56 sofx1010pve3302.home.lan pacemakerd[6229]:  crit: Could not connect to Corosync CMAP: CS_ERR_LIBRARY
Oct 25 10:18:55 sofx1010pve3302.home.lan pvestatd[2408]: status update time (10.183 seconds)
Oct 25 10:18:55 sofx1010pve3302.home.lan pvestatd[2408]: storage 'truenas-nfs' is not online
Oct 25 10:18:45 sofx1010pve3302.home.lan pvestatd[2408]: status update time (10.183 seconds)
Oct 25 10:18:45 sofx1010pve3302.home.lan pvestatd[2408]: storage 'truenas-nfs' is not online
Oct 25 10:18:41 sofx1010pve3302.home.lan pacemakerd[6229]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
Oct 25 10:18:41 sofx1010pve3302.home.lan systemd[1]: Started pacemaker.service - Pacemaker High Availability Cluster Manager.
Oct 25 10:18:41 sofx1010pve3302.home.lan systemd[1]: corosync.service - Corosync Cluster Engine was skipped because of an unmet condition check (ConditionPathExists=/etc/corosync/corosync.c>
Oct 25 10:18:41 sofx1010pve3302.home.lan systemd[1]: Stopped pacemaker.service - Pacemaker High Availability Cluster Manager.
Oct 25 10:18:41 sofx1010pve3302.home.lan systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 55.
Oct 25 10:18:40 sofx1010pve3302.home.lan systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deactivated successfully.
Oct 25 10:18:40 sofx1010pve3302.home.lan systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary Directories.
Oct 25 10:18:40 sofx1010pve3302.home.lan systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Oct 25 10:18:40 sofx1010pve3302.home.lan systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories...
Oct 25 10:18:40 sofx1010pve3302.home.lan systemd[1]: pacemaker.service: Failed with result 'exit-code'.
Oct 25 10:18:40 sofx1010pve3302.home.lan systemd[1]: pacemaker.service: Main process exited, code=exited, status=69/UNAVAILABLE
Oct 25 10:18:40 sofx1010pve3302.home.lan pacemakerd[6164]:  crit: Could not connect to Corosync CMAP: CS_ERR_LIBRARY
Oct 25 10:18:35 sofx1010pve3302.home.lan pvestatd[2408]: status update time (10.183 seconds)

Oct 25 10:20:17 sofx1010pve3303.home.lan systemd[1]: pacemaker.service: Failed with result 'exit-code'.
Oct 25 10:20:17 sofx1010pve3303.home.lan systemd[1]: pacemaker.service: Main process exited, code=exited, status=69/UNAVAILABLE
Oct 25 10:20:17 sofx1010pve3303.home.lan pacemakerd[5284]:  crit: Could not connect to Corosync CMAP: CS_ERR_LIBRARY
Oct 25 10:20:16 sofx1010pve3303.home.lan pvestatd[1343]: status update time (10.166 seconds)
Oct 25 10:20:16 sofx1010pve3303.home.lan pvestatd[1343]: storage 'truenas-nfs' is not online
Oct 25 10:20:06 sofx1010pve3303.home.lan pvestatd[1343]: status update time (10.166 seconds)
Oct 25 10:20:06 sofx1010pve3303.home.lan pvestatd[1343]: storage 'truenas-nfs' is not online
Oct 25 10:20:02 sofx1010pve3303.home.lan pacemakerd[5284]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
Oct 25 10:20:02 sofx1010pve3303.home.lan systemd[1]: Started pacemaker.service - Pacemaker High Availability Cluster Manager.
Oct 25 10:20:02 sofx1010pve3303.home.lan systemd[1]: corosync.service - Corosync Cluster Engine was skipped because of an unmet condition check (ConditionPathExists=/etc/corosync/corosync.c>
Oct 25 10:20:02 sofx1010pve3303.home.lan systemd[1]: Stopped pacemaker.service - Pacemaker High Availability Cluster Manager.
Oct 25 10:20:02 sofx1010pve3303.home.lan systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 60.
Oct 25 10:20:02 sofx1010pve3303.home.lan CRON[5263]: pam_unix(cron:session): session closed for user root
Oct 25 10:20:01 sofx1010pve3303.home.lan systemd-logind[758]: Removed session 18.
Oct 25 10:20:01 sofx1010pve3303.home.lan systemd-logind[758]: Session 18 logged out. Waiting for processes to exit.
Oct 25 10:20:01 sofx1010pve3303.home.lan systemd[1]: session-18.scope: Deactivated successfully.
Oct 25 10:20:01 sofx1010pve3303.home.lan sshd[5260]: pam_unix(sshd:session): session closed for user root
Oct 25 10:20:01 sofx1010pve3303.home.lan sshd[5260]: Disconnected from user root 192.168.30.2 port 36358
Oct 25 10:20:01 sofx1010pve3303.home.lan sshd[5260]: Received disconnect from 192.168.30.2 port 36358:11: disconnected by user
Oct 25 10:20:01 sofx1010pve3303.home.lan sshd[5260]: pam_env(sshd:session): deprecated reading of user environment enabled
Oct 25 10:20:01 sofx1010pve3303.home.lan systemd[1]: Started session-18.scope - Session 18 of User root.
Oct 25 10:20:01 sofx1010pve3303.home.lan systemd-logind[758]: New session 18 of user root.
Oct 25 10:20:01 sofx1010pve3303.home.lan sshd[5260]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Oct 25 10:20:01 sofx1010pve3303.home.lan sshd[5260]: Accepted publickey for root from 192.168.30.2 port 36358 ssh2: RSA SHA256:rbXel5Ru72ZLUZGrasjIV8XP4pn95nU/7r8Qmgx8lJ0
Oct 25 10:20:01 sofx1010pve3303.home.lan CRON[5262]: pam_unix(cron:session): session closed for user root
Oct 25 10:20:01 sofx1010pve3303.home.lan CRON[5265]: (root) CMD (/usr/local/sbin/check_interfaces_realtime.sh)
Oct 25 10:20:01 sofx1010pve3303.home.lan CRON[5264]: (root) CMD (unison profile-var-lib-vz.prf >/dev/null 2>&1)
Oct 25 10:20:01 sofx1010pve3303.home.lan CRON[5262]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 25 10:20:01 sofx1010pve3303.home.lan CRON[5263]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 25 10:20:01 sofx1010pve3303.home.lan systemd[1]: pacemaker.service: Failed with result 'exit-code'.
Oct 25 10:20:01 sofx1010pve3303.home.lan systemd[1]: pacemaker.service: Main process exited, code=exited, status=69/UNAVAILABLE
Oct 25 10:20:01 sofx1010pve3303.home.lan pacemakerd[5189]:  crit: Could not connect to Corosync CMAP: CS_ERR_LIBRARY
Oct 25 10:19:56 sofx1010pve3303.home.lan pvestatd[1343]: status update time (10.167 seconds)

Chris · Oct 25, 2023

Hmm, I suspect there is pacemaker interfering with corosync, as you can see from your logs:

kristian.kirilov said:
pacemakerd[5189]: crit: Could not connect to Corosync CMAP: CS_ERR_LIBRARY

please remove this before continuing with anything else.

The rest seems okay, corosync and the proxmox cluster seem fine so far.

Further, check the output of ls -la /etc/pve/nodes on all the nodes.

kristian.kirilov · Oct 25, 2023

I'm not sure what do you mean with "please remove this before continuing with anything else."
I have checked pacemaker and it complaints that it cannot connect to corosync. I think this is normal because of uninitialized cluster config.

Code:

root@sofx1010pve3302.home.lan:~# pcs status corosync
Error: corosync not running
root@sofx1010pve3302.home.lan:~# ls -la /etc/corosync/
total 20
drwxr-xr-x   3 root root  4096 Oct 25 09:22 .
drwxr-xr-x 146 root root 12288 Oct 25 08:45 ..
drwxr-xr-x   2 root root  4096 Jun  8  2017 uidgid.d
root@sofx1010pve3302.home.lan:~# pcs status
Error: error running crm_mon, is pacemaker running?
  error: Could not connect to launcher: Connection refused
  crm_mon: Connection to cluster failed: Connection refused
root@sofx1010pve3302.home.lan:~# systemctl status pacemaker
● pacemaker.service - Pacemaker High Availability Cluster Manager
     Loaded: loaded (/lib/systemd/system/pacemaker.service; enabled; preset: enabled)
     Active: active (running) since Wed 2023-10-25 11:47:49 EEST; 5s ago
       Docs: man:pacemakerd
             https://clusterlabs.org/pacemaker/doc/
   Main PID: 27603 (pacemakerd)
      Tasks: 1
     Memory: 1.4M
     CGroup: /system.slice/pacemaker.service
             └─27603 /usr/sbin/pacemakerd

Oct 25 11:47:49 sofx1010pve3302.home.lan systemd[1]: Started pacemaker.service - Pacemaker High Availability Cluster Manager.
Oct 25 11:47:49 sofx1010pve3302.home.lan pacemakerd[27603]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
root@sofx1010pve3302.home.lan:~#

So I'm in a cycle, I have to bring up corosync cluster up and running, but I can't because I can't join on already existing cluster:

Code:

root@sofx1010pve3302.home.lan:~# pvecm add 192.168.30.7 -use_ssh
detected the following error(s):
* this host already contains virtual guests
Check if node may join a cluster failed!
root@sofx1010pve3302.home.lan:~#

Chris · Oct 25, 2023

kristian.kirilov said:
I have checked pacemaker and it complaints that it cannot connect to corosync

Pacemaker is not part of our software stack, so I cannot guarantee it is not interfering, therefore my suggestions to uninstall for now or at least disable the service.

kristian.kirilov said:
* this host already contains virtual guests

During join there are virtual guests detected, so did you check the suggested folder for leftover guest configs? Otherwise a reinstall of the nodes will bring you to a clean state.

Do you have backups of your previous VMs/CTs? Otherwise you should definitely create backups of their data before going further.

kristian.kirilov · Oct 25, 2023

Chris said:
Pacemaker is not part of our software stack, so I cannot guarantee it is not interfering, therefore my suggestions to uninstall for now or at least disable the service.

During join there are virtual guests detected, so did you check the suggested folder for leftover guest configs? Otherwise a reinstall of the nodes will bring you to a clean state.

Do you have backups of your previous VMs/CTs? Otherwise you should definitely create backups of their data before going further.

I see, no worries, pacemaker is not uninstalled.
By the way I've managed to log in to the webUI by disabling the firewall. Probably some misconfiguration.

Code:

root@sofx1010pve3302.home.lan:~# dpkg -l |grep pacemaker
root@sofx1010pve3302.home.lan:~# systemctl -a |grep pacemaker
root@sofx1010pve3302.home.lan:~# ls -la /etc/pve/qemu-server/
total 0
drwxr-xr-x 2 root www-data 0 Oct 25 08:07 .
drwxr-xr-x 2 root www-data 0 Oct 25 08:07 ..
root@sofx1010pve3302.home.lan:~# ls -la /etc/pve/lxc/
total 0
drwxr-xr-x 2 root www-data 0 Oct 25 08:07 .
drwxr-xr-x 2 root www-data 0 Oct 25 08:07 ..
root@sofx1010pve3302.home.lan:~# systemctl status corosync
○ corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: inactive (dead)
  Condition: start condition failed at Wed 2023-10-25 13:00:20 EEST; 58s ago
             └─ ConditionPathExists=/etc/corosync/corosync.conf was not met
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview

Oct 25 13:00:20 sofx1010pve3302.home.lan systemd[1]: corosync.service - Corosync Cluster Engine was skipped because of an unmet condition check (ConditionPathExists=/etc/corosync/corosync.c>
root@sofx1010pve3302.home.lan:~# pvecm add 192.168.30.7 -use_ssh
detected the following error(s):
* this host already contains virtual guests
Check if node may join a cluster failed!
root@sofx1010pve3302.home.lan:~#

I see, there are some leftovers somewhere in the system, but I can't really figure it out...

Code:

root@sofx1010pve3302.home.lan:~# qm list
root@sofx1010pve3302.home.lan:~# pct list
root@sofx1010pve3302.home.lan:~#

Reinstall left the only one possible solution.

Search

Search

Join to existing cluster after node rename

kristian.kirilov

Well-Known Member

Chris

Proxmox Staff Member

kristian.kirilov

Well-Known Member

kristian.kirilov

Well-Known Member

kristian.kirilov

Well-Known Member

Chris

Proxmox Staff Member

kristian.kirilov

Well-Known Member

Chris

Proxmox Staff Member

kristian.kirilov

Well-Known Member