jarenas

Member
Mar 7, 2018
33
0
6
30
I have created a new cluster with 4 nodes, the problem is that when I reboot them all of them are working, but after some minutes some of them says that are disconnected.

Captura.jpg


When this happens I execute in the shell node:

service corosync restart

After execute this command the node is online again, but the next one (cp3) turn to "offline".

I don't know what it's happening.

Can somebody help me?

Thanks, regards!
 
Last edited:
Hi,

try to restart the pvestatd on the whole cluster.
 
Try all 4

service pve-cluster restart
service pveproxy restart
service pvedaemon restart
service pvestatd restart

I've done it these commands in three nodes, but in one I'm getting this:

root@cp2:~# service pve-cluster restart
Job for pve-cluster.service failed because the control process exited with error code.
See "systemctl status pve-cluster.service" and "journalctl -xe" for details.


journal ctl

Apr 15 11:15:28 cp2 pvestatd[232476]: status update error: Connection refused
Apr 15 11:15:29 cp2 pveproxy[510975]: worker exit
Apr 15 11:15:29 cp2 pveproxy[510976]: worker exit
Apr 15 11:15:29 cp2 pveproxy[510791]: worker 510975 finished
Apr 15 11:15:29 cp2 pveproxy[510791]: starting 1 worker(s)
Apr 15 11:15:29 cp2 pveproxy[510791]: worker 510976 finished
Apr 15 11:15:29 cp2 pveproxy[510791]: worker 510978 started
Apr 15 11:15:29 cp2 pveproxy[510977]: worker exit
Apr 15 11:15:29 cp2 pveproxy[510978]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.
Apr 15 11:15:29 cp2 pveproxy[510791]: worker 510977 finished
Apr 15 11:15:29 cp2 pveproxy[510791]: starting 2 worker(s)
Apr 15 11:15:29 cp2 pveproxy[510791]: worker 510979 started
Apr 15 11:15:29 cp2 pveproxy[510791]: worker 510980 started
Apr 15 11:15:29 cp2 pveproxy[510979]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.
Apr 15 11:15:29 cp2 pveproxy[510980]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.
Apr 15 11:15:32 cp2 pve-ha-crm[3460]: ipcc_send_rec[1] failed: Connection refused
Apr 15 11:15:32 cp2 pve-ha-lrm[4838]: ipcc_send_rec[1] failed: Connection refused
Apr 15 11:15:32 cp2 pve-ha-crm[3460]: ipcc_send_rec[2] failed: Connection refused
Apr 15 11:15:32 cp2 pve-ha-lrm[4838]: ipcc_send_rec[2] failed: Connection refused
Apr 15 11:15:32 cp2 pve-ha-crm[3460]: ipcc_send_rec[3] failed: Connection refused
Apr 15 11:15:32 cp2 pve-ha-lrm[4838]: ipcc_send_rec[3] failed: Connection refused
Apr 15 11:15:34 cp2 pveproxy[510978]: worker exit
Apr 15 11:15:34 cp2 pveproxy[510791]: worker 510978 finished
Apr 15 11:15:34 cp2 pveproxy[510791]: starting 1 worker(s)
Apr 15 11:15:34 cp2 pveproxy[510791]: worker 510986 started
Apr 15 11:15:34 cp2 pveproxy[510979]: worker exit
Apr 15 11:15:34 cp2 pveproxy[510980]: worker exit
Apr 15 11:15:34 cp2 pveproxy[510986]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.
Apr 15 11:15:34 cp2 pveproxy[510791]: worker 510979 finished
Apr 15 11:15:34 cp2 pveproxy[510791]: worker 510980 finished
Apr 15 11:15:34 cp2 pveproxy[510791]: starting 2 worker(s)
Apr 15 11:15:34 cp2 pveproxy[510791]: worker 510987 started
Apr 15 11:15:34 cp2 pveproxy[510791]: worker 510988 started
Apr 15 11:15:34 cp2 pveproxy[510987]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.
Apr 15 11:15:34 cp2 pveproxy[510988]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1642.


I think that this is happening because it can't mount the pve configuration filesystem
 
What gives
Code:
$ systemctl
any errors?

Especialy look for "pve-ha-crm.service"

These have errors:

pve-cluster.service loaded failed failed The Proxmox VE cluster filesystem
pvesr.service loaded failed failed Proxmox VE replication runner
zfs-mount.service loaded failed failed Mount ZFS filesystems
zfs-share.service loaded failed failed ZFS file system shares

And also this when I execute systemctl status pve-ha-crm.service:

systemctl status pve-ha-crm.service
● pve-ha-crm.service - PVE Cluster Ressource Manager Daemon
Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2018-04-12 16:44:08 CEST; 2 days ago
Main PID: 3460 (pve-ha-crm)
Tasks: 1 (limit: 36864)
Memory: 81.2M
CPU: 19.103s
CGroup: /system.slice/pve-ha-crm.service
└─3460 pve-ha-crm

Apr 15 13:00:00 cp2 pve-ha-crm[3460]: ipcc_send_rec[3] failed: Connection refused
Apr 15 13:00:05 cp2 pve-ha-crm[3460]: ipcc_send_rec[1] failed: Connection refused
Apr 15 13:00:05 cp2 pve-ha-crm[3460]: ipcc_send_rec[2] failed: Connection refused
Apr 15 13:00:05 cp2 pve-ha-crm[3460]: ipcc_send_rec[3] failed: Connection refused
Apr 15 13:00:10 cp2 pve-ha-crm[3460]: ipcc_send_rec[1] failed: Connection refused
Apr 15 13:00:10 cp2 pve-ha-crm[3460]: ipcc_send_rec[2] failed: Connection refused
Apr 15 13:00:10 cp2 pve-ha-crm[3460]: ipcc_send_rec[3] failed: Connection refused
Apr 15 13:00:15 cp2 pve-ha-crm[3460]: ipcc_send_rec[1] failed: Connection refused
Apr 15 13:00:15 cp2 pve-ha-crm[3460]: ipcc_send_rec[2] failed: Connection refused
Apr 15 13:00:15 cp2 pve-ha-crm[3460]: ipcc_send_rec[3] failed: Connection refused


Thanks! Regards

 
Do you have the right ip set for "pvelocalhost" ? I've seen this before where pvelocalhost was wrong. But that was a single node.
I don't know what do you mean with ip set for pvelocalhost, I've got configured two ips, one for my lan network and another for CEPH in /etc/network/interfaces.
 
For node "cp2" that shows offline, could you post that server's output of

#cat /etc/corosync/corosync.conf
 
Last edited:
For node "cp2" that shows offline, could you post that server's output of

#cat /etc/corosync/corosync.conf

cat /etc/corosync/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: cp1
nodeid: 1
quorum_votes: 1
ring0_addr: cp1
}
node {
name: cp2
nodeid: 2
quorum_votes: 1
ring0_addr: cp2
}
node {
name: cp3
nodeid: 3
quorum_votes: 1
ring0_addr: cp3
}
node {
name: cp4
nodeid: 4
quorum_votes: 1
ring0_addr: cp4
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: cpx
config_version: 4
interface {
bindnetaddr: 10.85.20.101
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}
 
Hi jarenas,

i've had a talk with alwin, the root cause could be either your switch and multicast packages or an incorrect of /etc/hosts on each node.

first make sure you can resolve "cpX" from each node, if that works have a look at https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network

Start omping on each node then restart corosync if omping reports errors at the same time the host goes down in pve its your switch.

Ask here if you need help with your switch.

I hope you get that fixed.
 
Last edited:
Hi jarenas,

i've had a talk with alwin, the root cause could be either your switch and multicast packages or an incorrect of /etc/hosts on each node.

first make sure you can resolve "cpX" from each node, if that works have a look at https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network

Start omping on each node then restart corosync if omping reports errors at the same time the host goes down in pve its your switch.

Ask here if you need help with your switch.

I hope you get that fixe.
Yes, it is a switch problem (with multicast), so finally I tried to convert it to unicast, following the next steps in this page:

https://pve.proxmox.com/wiki/Multicast_notes

But I had problems:

https://forum.proxmox.com/threads/need-to-restore-corosync-conf-file.43286/#post-207623

This is my original corosync.conf and I have written (bold words) the things that I think that I have to moddify in the file:

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: cp1
nodeid: 1
quorum_votes: 1
ring0_addr: cp1
}
node {
name: cp2
nodeid: 2
quorum_votes: 1
ring0_addr: cp2
}
node {
name: cp3
nodeid: 3
quorum_votes: 1
ring0_addr: cp3
}
node {
name: cp4
nodeid: 4
quorum_votes: 1
ring0_addr: cp4
}
}

quorum {
provider: corosync_votequorum
}

totem {
<-------------------------------------------------- Here I think that I have to add "transport: udpu"

cluster_name: cp-oficina
config_version: 4
interface {
bindnetaddr: 10.85.20.101
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2 <------------------------------------ And here I think that I have to change the version

}
 
Don't go UNICAST you'll likely get a lot of troubles (lots of traffic), fix your switch :)
I would like, but at this moment it's not possible because the switch have got a lot of important configurations and it needs an update to work with multicast, and if we update it maybe this configuration won't work.

Thanks, regards!
 
Hello, I just have the same problem after having a power outage. Just run
Code:
systemctl restart corosync
and its connected again in cluster just like that.

Thank you.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!