Cluster nodes "offline" but working

jarenas · Apr 12, 2018

I have created a new cluster with 4 nodes, the problem is that when I reboot them all of them are working, but after some minutes some of them says that are disconnected.

When this happens I execute in the shell node:

service corosync restart

After execute this command the node is online again, but the next one (cp3) turn to "offline".

I don't know what it's happening.

Can somebody help me?

Thanks, regards!

wolfgang · Apr 13, 2018

Hi,

try to restart the pvestatd on the whole cluster.

jarenas · Apr 13, 2018

wolfgang said:
Hi,

try to restart the pvestatd on the whole cluster.

I have done it and nothing happens

Vasu Sreekumar · Apr 14, 2018

Try all 4

service pve-cluster restart
service pveproxy restart
service pvedaemon restart
service pvestatd restart

jarenas · Apr 15, 2018

Vasu Sreekumar said:
Try all 4

service pve-cluster restart
service pveproxy restart
service pvedaemon restart
service pvestatd restart

I've done it these commands in three nodes, but in one I'm getting this:

root@cp2:~# service pve-cluster restart
Job for pve-cluster.service failed because the control process exited with error code.
See "systemctl status pve-cluster.service" and "journalctl -xe" for details.

journal ctl

Apr 15 11:15:28 cp2 pvestatd[232476]: status update error: Connection refused
Apr 15 11:15:29 cp2 pveproxy[510975]: worker exit
Apr 15 11:15:29 cp2 pveproxy[510976]: worker exit
Apr 15 11:15:29 cp2 pveproxy[510791]: worker 510975 finished
Apr 15 11:15:29 cp2 pveproxy[510791]: starting 1 worker(s)
Apr 15 11:15:29 cp2 pveproxy[510791]: worker 510976 finished
Apr 15 11:15:29 cp2 pveproxy[510791]: worker 510978 started
Apr 15 11:15:29 cp2 pveproxy[510977]: worker exit
Apr 15 11:15:29 cp2 pveproxy[510978]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.
Apr 15 11:15:29 cp2 pveproxy[510791]: worker 510977 finished
Apr 15 11:15:29 cp2 pveproxy[510791]: starting 2 worker(s)
Apr 15 11:15:29 cp2 pveproxy[510791]: worker 510979 started
Apr 15 11:15:29 cp2 pveproxy[510791]: worker 510980 started
Apr 15 11:15:29 cp2 pveproxy[510979]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.
Apr 15 11:15:29 cp2 pveproxy[510980]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.
Apr 15 11:15:32 cp2 pve-ha-crm[3460]: ipcc_send_rec[1] failed: Connection refused
Apr 15 11:15:32 cp2 pve-ha-lrm[4838]: ipcc_send_rec[1] failed: Connection refused
Apr 15 11:15:32 cp2 pve-ha-crm[3460]: ipcc_send_rec[2] failed: Connection refused
Apr 15 11:15:32 cp2 pve-ha-lrm[4838]: ipcc_send_rec[2] failed: Connection refused
Apr 15 11:15:32 cp2 pve-ha-crm[3460]: ipcc_send_rec[3] failed: Connection refused
Apr 15 11:15:32 cp2 pve-ha-lrm[4838]: ipcc_send_rec[3] failed: Connection refused
Apr 15 11:15:34 cp2 pveproxy[510978]: worker exit
Apr 15 11:15:34 cp2 pveproxy[510791]: worker 510978 finished
Apr 15 11:15:34 cp2 pveproxy[510791]: starting 1 worker(s)
Apr 15 11:15:34 cp2 pveproxy[510791]: worker 510986 started
Apr 15 11:15:34 cp2 pveproxy[510979]: worker exit
Apr 15 11:15:34 cp2 pveproxy[510980]: worker exit
Apr 15 11:15:34 cp2 pveproxy[510986]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.
Apr 15 11:15:34 cp2 pveproxy[510791]: worker 510979 finished
Apr 15 11:15:34 cp2 pveproxy[510791]: worker 510980 finished
Apr 15 11:15:34 cp2 pveproxy[510791]: starting 2 worker(s)
Apr 15 11:15:34 cp2 pveproxy[510791]: worker 510987 started
Apr 15 11:15:34 cp2 pveproxy[510791]: worker 510988 started
Apr 15 11:15:34 cp2 pveproxy[510987]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.
Apr 15 11:15:34 cp2 pveproxy[510988]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.

I think that this is happening because it can't mount the pve configuration filesystem

r.jochum · Apr 15, 2018

What gives

Code:

$ systemctl

any errors?

Especialy look for "pve-ha-crm.service"

jarenas · Apr 15, 2018

René Jochum said:
What gives

Code:

$ systemctl

any errors?

Especialy look for "pve-ha-crm.service"

These have errors:

pve-cluster.service loaded failed failed The Proxmox VE cluster filesystem
pvesr.service loaded failed failed Proxmox VE replication runner
zfs-mount.service loaded failed failed Mount ZFS filesystems
zfs-share.service loaded failed failed ZFS file system shares

And also this when I execute systemctl status pve-ha-crm.service:

systemctl status pve-ha-crm.service
● pve-ha-crm.service - PVE Cluster Ressource Manager Daemon
Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2018-04-12 16:44:08 CEST; 2 days ago
Main PID: 3460 (pve-ha-crm)
Tasks: 1 (limit: 36864)
Memory: 81.2M
CPU: 19.103s
CGroup: /system.slice/pve-ha-crm.service
└─3460 pve-ha-crm

Apr 15 13:00:00 cp2 pve-ha-crm[3460]: ipcc_send_rec[3] failed: Connection refused
Apr 15 13:00:05 cp2 pve-ha-crm[3460]: ipcc_send_rec[1] failed: Connection refused
Apr 15 13:00:05 cp2 pve-ha-crm[3460]: ipcc_send_rec[2] failed: Connection refused
Apr 15 13:00:05 cp2 pve-ha-crm[3460]: ipcc_send_rec[3] failed: Connection refused
Apr 15 13:00:10 cp2 pve-ha-crm[3460]: ipcc_send_rec[1] failed: Connection refused
Apr 15 13:00:10 cp2 pve-ha-crm[3460]: ipcc_send_rec[2] failed: Connection refused
Apr 15 13:00:10 cp2 pve-ha-crm[3460]: ipcc_send_rec[3] failed: Connection refused
Apr 15 13:00:15 cp2 pve-ha-crm[3460]: ipcc_send_rec[1] failed: Connection refused
Apr 15 13:00:15 cp2 pve-ha-crm[3460]: ipcc_send_rec[2] failed: Connection refused
Apr 15 13:00:15 cp2 pve-ha-crm[3460]: ipcc_send_rec[3] failed: Connection refused

Thanks! Regards

r.jochum · Apr 15, 2018

Do you have the right ip set for "pvelocalhost" ? I've seen this before where pvelocalhost was wrong. But that was a single node.

jarenas · Apr 15, 2018

René Jochum said:
Do you have the right ip set for "pvelocalhost" ? I've seen this before where pvelocalhost was wrong. But that was a single node.

I don't know what do you mean with ip set for pvelocalhost, I've got configured two ips, one for my lan network and another for CEPH in /etc/network/interfaces.

r.jochum · Apr 15, 2018

See "Add an /etc/hosts entry for your IP address"

jarenas · Apr 16, 2018

René Jochum said:
See "Add an /etc/hosts entry for your IP address"

After execute this command:
hostname --ip-address

It shows me my correct ip address.

GadgetPig · Apr 16, 2018

For node "cp2" that shows offline, could you post that server's output of

#cat /etc/corosync/corosync.conf

jarenas · Apr 16, 2018

GadgetPig said:
For node "cp2" that shows offline, could you post that server's output of

#cat /etc/corosync/corosync.conf

cat /etc/corosync/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: cp1
nodeid: 1
quorum_votes: 1
ring0_addr: cp1
}
node {
name: cp2
nodeid: 2
quorum_votes: 1
ring0_addr: cp2
}
node {
name: cp3
nodeid: 3
quorum_votes: 1
ring0_addr: cp3
}
node {
name: cp4
nodeid: 4
quorum_votes: 1
ring0_addr: cp4
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: cpx
config_version: 4
interface {
bindnetaddr: 10.85.20.101
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}

jarenas · Apr 17, 2018

Vasu Sreekumar said:
Try all 4

service pve-cluster restart
service pveproxy restart
service pvedaemon restart
service pvestatd restart

After doing this I have this problem:

jarenas · Apr 24, 2018

Nobody knows anything?

r.jochum · May 7, 2018

Hi jarenas,

i've had a talk with alwin, the root cause could be either your switch and multicast packages or an incorrect of /etc/hosts on each node.

first make sure you can resolve "cpX" from each node, if that works have a look at https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network

Start omping on each node then restart corosync if omping reports errors at the same time the host goes down in pve its your switch.

Ask here if you need help with your switch.

I hope you get that fixed.

jarenas · May 8, 2018

r.jochum said:
Hi jarenas,

i've had a talk with alwin, the root cause could be either your switch and multicast packages or an incorrect of /etc/hosts on each node.

first make sure you can resolve "cpX" from each node, if that works have a look at https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network

Start omping on each node then restart corosync if omping reports errors at the same time the host goes down in pve its your switch.

Ask here if you need help with your switch.

I hope you get that fixe.

Yes, it is a switch problem (with multicast), so finally I tried to convert it to unicast, following the next steps in this page:

https://pve.proxmox.com/wiki/Multicast_notes

But I had problems:

https://forum.proxmox.com/threads/need-to-restore-corosync-conf-file.43286/#post-207623

This is my original corosync.conf and I have written (bold words) the things that I think that I have to moddify in the file:

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: cp1
nodeid: 1
quorum_votes: 1
ring0_addr: cp1
}
node {
name: cp2
nodeid: 2
quorum_votes: 1
ring0_addr: cp2
}
node {
name: cp3
nodeid: 3
quorum_votes: 1
ring0_addr: cp3
}
node {
name: cp4
nodeid: 4
quorum_votes: 1
ring0_addr: cp4
}
}

quorum {
provider: corosync_votequorum
}

totem {
<-------------------------------------------------- Here I think that I have to add "transport: udpu"
cluster_name: cp-oficina
config_version: 4
interface {
bindnetaddr: 10.85.20.101
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2 <------------------------------------ And here I think that I have to change the version
}

r.jochum · May 8, 2018

Don't go UNICAST you'll likely get a lot of troubles (lots of traffic), fix your switch

jarenas · May 8, 2018

r.jochum said:
Don't go UNICAST you'll likely get a lot of troubles (lots of traffic), fix your switch

I would like, but at this moment it's not possible because the switch have got a lot of important configurations and it needs an update to work with multicast, and if we update it maybe this configuration won't work.

Thanks, regards!

canifer · Mar 1, 2019

Hello, I just have the same problem after having a power outage. Just run

Code:

systemctl restart corosync

and its connected again in cluster just like that.

Thank you.

Cluster nodes "offline" but working

Member

Proxmox Retired Staff

Member

Active Member

Member

Renowned Member

Member

Renowned Member

Member

Renowned Member

Member

Member

Member

Member

Attachments

Member

Renowned Member

Member

Renowned Member

Member

Member

We value your privacy