Proxmox Cluster los node

daros

Renowned Member
Jul 22, 2014
57
2
73
Hello,

I got switch troubles this weekend, after swapping of the network switch everything was fine again.
But currently 1 node is out of the cluster:
I tried to add it back to the cluster but it got running vm's so thats not possible.

Hope you guys can point me to the right solution.

Code:
Cluster information
-------------------
Name:             prox-cluster01
Config Version:   14
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Feb 15 16:49:58 2021
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000001
Ring ID:          1.1676
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   6
Highest expected: 6
Total votes:      5
Quorum:           4
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.15.91 (local)
0x00000002          1 192.168.15.92
0x00000003          1 192.168.15.93
0x00000004          1 192.168.15.94
0x00000007          1 192.168.15.96



I tried to add it to the cluster after running:
Code:
systemctl stop pve-cluster
systemctl stop corosync
pmxcfs -l
rm /etc/pve/corosync.conf
rm -r /etc/corosync/*
killall pmxcfs
systemctl start pve-cluster

But got the error:
Code:
root@prox-s05:~# pvecm add 192.168.15.91
Please enter superuser (root) password for '192.168.15.91': ****************
detected the following error(s):
* this host already contains virtual guests
Check if node may join a cluster failed!

I tried to shutdown the vm, and add an .conf file to another working node but could not create the conf file because it existed but its not there.

So bassicly i wont to move the vm's to an other node to readd the node.
Ceph is running and working on the lost node.
 
Last edited:
you don't need to readd node with "pvecm add ", if the node is still in /etc/pve/corosync.conf. (Expected votes: 6)

node out of the cluster, don't mean "removed" from the cluster. It's just that it's not able to join the cluster.
(journalctl -u corosync could give information why).


so, try on this node

Code:
systemctl stop pve-cluster
systemctl stop corosync
copy back  /etc/corosync/*    from other working node
rm -rf /var/lib/pve-cluster/*    (as you have made change in /etc/pve/...  , better to flush it, and take new version from the other nodes)
systemctl start corosync
#check if corosync is starting , "journalctl -u corosync -f"
systemctl start pve-cluster
 
Hello,

No i did not removed the node from the cluster.
I did what you suggested, wierd thing is that i can ping all the other nodes but got the following logging:

Code:
Feb 16 08:59:50 prox-s05 systemd[1]: Stopped Corosync Cluster Engine.
Feb 16 08:59:50 prox-s05 systemd[1]: Starting Corosync Cluster Engine...
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [MAIN  ] Corosync Cluster Engine 3.0.4 starting up
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [MAIN  ] Please migrate config file to nodelist.
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [TOTEM ] Initializing transport (Kronosnet).
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [TOTEM ] kronosnet crypto initialized: aes256/sha256
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [TOTEM ] totemknet initialized
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [QB    ] server name: cmap
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [QB    ] server name: cfg
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [QB    ] server name: cpg
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [WD    ] Watchdog not enabled by configuration
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [WD    ] resource load_15min missing a recovery key.
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [WD    ] resource memory_used missing a recovery key.
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [WD    ] no resources configured.
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [QUORUM] Using quorum provider corosync_votequorum
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [QB    ] server name: votequorum
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [QB    ] server name: quorum
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 1 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 1 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 1 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 2 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 2 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 2 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [TOTEM ] A new membership (5.18e9) was formed. Members joined: 5
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 3 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 systemd[1]: Started Corosync Cluster Engine.
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 3 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 3 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 0)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 4 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 4 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 4 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [QUORUM] Members[1]: 5
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 7 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 7 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 7 has no active links
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] rx: host: 7 link: 0 is up
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] rx: host: 3 link: 0 is up
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] rx: host: 4 link: 0 is up
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] rx: host: 2 link: 0 is up
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] rx: host: 1 link: 0 is up
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] pmtud: PMTUD link change for host: 7 link: 0 from 469 to 1397
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] pmtud: Global data MTU changed to: 1397


Corosync:
Code:
root@prox-s05:~# cat /etc/corosync/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: prox-s01
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.15.91
  }
  node {
    name: prox-s02
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.15.92
  }
  node {
    name: prox-s03
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.15.93
  }
  node {
    name: prox-s04
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.15.94
  }
  node {
    name: prox-s05
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 192.168.15.95
  }
  node {
    name: prox-s06
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 192.168.15.96
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: prox-cluster01
  config_version: 14
  interface {
    bindnetaddr: 192.168.15.91
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}
 
Its working now.
Looks like the fix from @spirit was correct but it was missing the ssh keys.
After ssh in to an other prox server is started to work again.

Still think it wierd what happend.