Proxmox Cluster los node

daros · Feb 15, 2021

Hello,

I got switch troubles this weekend, after swapping of the network switch everything was fine again.
But currently 1 node is out of the cluster:
I tried to add it back to the cluster but it got running vm's so thats not possible.

Hope you guys can point me to the right solution.

Code:

Cluster information
-------------------
Name:             prox-cluster01
Config Version:   14
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Feb 15 16:49:58 2021
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000001
Ring ID:          1.1676
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   6
Highest expected: 6
Total votes:      5
Quorum:           4
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.15.91 (local)
0x00000002          1 192.168.15.92
0x00000003          1 192.168.15.93
0x00000004          1 192.168.15.94
0x00000007          1 192.168.15.96

I tried to add it to the cluster after running:

Code:

systemctl stop pve-cluster
systemctl stop corosync
pmxcfs -l
rm /etc/pve/corosync.conf
rm -r /etc/corosync/*
killall pmxcfs
systemctl start pve-cluster

But got the error:

Code:

root@prox-s05:~# pvecm add 192.168.15.91
Please enter superuser (root) password for '192.168.15.91': ****************
detected the following error(s):
* this host already contains virtual guests
Check if node may join a cluster failed!

I tried to shutdown the vm, and add an .conf file to another working node but could not create the conf file because it existed but its not there.

So bassicly i wont to move the vm's to an other node to readd the node.
Ceph is running and working on the lost node.

spirit · Feb 15, 2021

you don't need to readd node with "pvecm add ", if the node is still in /etc/pve/corosync.conf. (Expected votes: 6)

node out of the cluster, don't mean "removed" from the cluster. It's just that it's not able to join the cluster.
(journalctl -u corosync could give information why).

so, try on this node

Code:

systemctl stop pve-cluster
systemctl stop corosync
copy back  /etc/corosync/*    from other working node
rm -rf /var/lib/pve-cluster/*    (as you have made change in /etc/pve/...  , better to flush it, and take new version from the other nodes)
systemctl start corosync
#check if corosync is starting , "journalctl -u corosync -f"
systemctl start pve-cluster

daros · Feb 16, 2021

Hello,

No i did not removed the node from the cluster.
I did what you suggested, wierd thing is that i can ping all the other nodes but got the following logging:

Code:

Feb 16 08:59:50 prox-s05 systemd[1]: Stopped Corosync Cluster Engine.
Feb 16 08:59:50 prox-s05 systemd[1]: Starting Corosync Cluster Engine...
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [MAIN  ] Corosync Cluster Engine 3.0.4 starting up
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [MAIN  ] Please migrate config file to nodelist.
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [TOTEM ] Initializing transport (Kronosnet).
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [TOTEM ] kronosnet crypto initialized: aes256/sha256
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [TOTEM ] totemknet initialized
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [QB    ] server name: cmap
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [QB    ] server name: cfg
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [QB    ] server name: cpg
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [WD    ] Watchdog not enabled by configuration
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [WD    ] resource load_15min missing a recovery key.
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [WD    ] resource memory_used missing a recovery key.
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [WD    ] no resources configured.
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [QUORUM] Using quorum provider corosync_votequorum
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [QB    ] server name: votequorum
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [QB    ] server name: quorum
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 1 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 1 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 1 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 2 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 2 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 2 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [TOTEM ] A new membership (5.18e9) was formed. Members joined: 5
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 3 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 systemd[1]: Started Corosync Cluster Engine.
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 3 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 3 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 0)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 4 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 4 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 4 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [QUORUM] Members[1]: 5
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 7 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 7 has no active links
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Feb 16 08:59:51 prox-s05 corosync[2593136]:   [KNET  ] host: host: 7 has no active links
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] rx: host: 7 link: 0 is up
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] rx: host: 3 link: 0 is up
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] rx: host: 4 link: 0 is up
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] rx: host: 2 link: 0 is up
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] rx: host: 1 link: 0 is up
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] pmtud: PMTUD link change for host: 7 link: 0 from 469 to 1397
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Feb 16 08:59:53 prox-s05 corosync[2593136]:   [KNET  ] pmtud: Global data MTU changed to: 1397

Corosync:

Code:

root@prox-s05:~# cat /etc/corosync/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: prox-s01
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.15.91
  }
  node {
    name: prox-s02
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.15.92
  }
  node {
    name: prox-s03
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.15.93
  }
  node {
    name: prox-s04
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.15.94
  }
  node {
    name: prox-s05
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 192.168.15.95
  }
  node {
    name: prox-s06
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 192.168.15.96
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: prox-cluster01
  config_version: 14
  interface {
    bindnetaddr: 192.168.15.91
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

daros · Feb 16, 2021

Its working now.
Looks like the fix from @spirit was correct but it was missing the ssh keys.
After ssh in to an other prox server is started to work again.

Still think it wierd what happend.

Search

Search

Proxmox Cluster los node

daros

Renowned Member

spirit

Distinguished Member

daros

Renowned Member

daros

Renowned Member

We value your privacy