Nodes keep dropping out

wheelsca

New Member
Jan 18, 2019
1
0
1
51
I am new here and new to Proxmox, I am having an issue. I have rebuilt a cluster twice now and keep having nodes dropping out.

The first time I had a couple NFS shares, a couple Windows VM's and a Vesta web server. The second time, same VM's but no shares.

Hopefully someone has some ideas of what's going on.

Thanks.
 

Attachments

  • droppednodes.jpg
    droppednodes.jpg
    335.2 KB · Views: 16
Hello,

We are also seeing the same problem as wheelsca mentions above. We have rebuilt our proxmox cluster from 4.3 to 5.3, using the same network switches and server hardware, we see random nodes dissapear and then results in no quorate. The way we can get around it without restarting most of the nodes is to restart three services - pvestatd, pvedaemon, pveproxy. There appears to be one node which we reloaded again yesterday, gave it a different IP, host name etc.. hoping this may help. It runs for a about 10 minutes and then in the web gui it goes offline. In the logs, it shows the below -:

Mar 19 09:15:01 pve5n pvesr[6291]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 19 09:15:02 pve5n pvesr[6291]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 19 09:15:03 pve5n pvesr[6291]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 19 09:15:04 pve5n pvesr[6291]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 19 09:15:05 pve5n pvesr[6291]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 19 09:15:06 pve5n pvesr[6291]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 19 09:15:07 pve5n pvesr[6291]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 19 09:15:08 pve5n pvesr[6291]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 19 09:15:09 pve5n pvesr[6291]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 19 09:15:10 pve5n pvesr[6291]: error with cfs lock 'file-replication_cfg': no quorum!
Mar 19 09:15:10 pve5n systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Mar 19 09:15:10 pve5n systemd[1]: Failed to start Proxmox VE replication runner.
Mar 19 09:15:10 pve5n systemd[1]: pvesr.service: Unit entered failed state.
Mar 19 09:15:10 pve5n systemd[1]: pvesr.service: Failed with result 'exit-code'.

With this node we do end up rebooting it but then it goes offline again etc...

Any ideas what is causing this ? When we were running proxmox 4.3 on the same server hardware, same switch we never had any of these issues - we have tried a lot of things trying to resolve it but we are now a bit of a loss...

We have performed the multicast test and all looks fine - again, we were running proxmox 4.3 on the same switch without any problems.

Any ideas ?

Thanks,
 
Hello,

Does anyone have any suggestions on what to try ? We have still been unable to resolve the problem.

If you need any more information please let me know.

Thanks,
 
Hi, seeing the same issues with 5.3

replication ports are on:
11.0.0.1, 11.0.0.2, 11.0.0.3, 11.0.0.4, 11.0.0.15, 11.0.0.16, 11.0.0.7 with subnet 255.255.255.224 broadcast is 11.0.0.31
running through HP 1820 switch, multicast is working.

corosync.conf

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: pve1
nodeid: 1
quorum_votes: 1
ring0_addr: 194.1.1.81
}
node {
name: pve2
nodeid: 2
quorum_votes: 1
ring0_addr: 194.1.1.82
}
node {
name: pve3
nodeid: 3
quorum_votes: 1
ring0_addr: 194.1.1.83
}
node {
name: pve4
nodeid: 5
quorum_votes: 1
ring0_addr: 194.1.1.84
}
node {
name: pve5n
nodeid: 6
quorum_votes: 1
ring0_addr: 194.1.1.95
}
node {
name: pve6n
nodeid: 7
quorum_votes: 1
ring0_addr: 194.1.1.96
}
node {
name: pve7
nodeid: 4
quorum_votes: 1
ring0_addr: 194.1.1.87
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: hostingcluster
config_version: 26
interface {
bindnetaddr: 194.1.1.81
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}

pvestatus - pve5n gone missing (194.1.1.95)

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 194.1.1.81 (local)
0x00000002 1 194.1.1.82
0x00000003 1 194.1.1.83
0x00000005 1 194.1.1.84
0x00000004 1 194.1.1.87
0x00000007 1 194.1.1.96
root@pve1:/etc/corosync# pvecm status
Quorum information
------------------
Date: Mon Mar 25 10:46:20 2019
Quorum provider: corosync_votequorum
Nodes: 6
Node ID: 0x00000001
Ring ID: 1/295504
Quorate: Yes

Votequorum information
----------------------
Expected votes: 7
Highest expected: 7
Total votes: 6
Quorum: 4
Flags: Quorate

Screenshot

upload_2019-3-25_10-53-48.png

Any suggestions at all??
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!