Primary cluster NIC stopped communicating

cdukes

Active Member
Sep 11, 2015
88
4
28
Raleigh, NC
www.logzilla.net
I can't figure out why this is happening.
I have 3 servers and all can ping the backend cluster vlan except for the primary
The primary - 192.168.29.14 can ping all other nodes on their other vlans:
vlan 29 - cluster comms - no ping
vlan 28 - works
vlan 10 - works

As a result, the cluster won't come up
Nothing has changed on the switch these are connected to.
However, this happened when I tried to add a new server to the cluster yesterday via the web ui.
Code:
# pveversion
pve-manager/6.0-15/52b91481 (running kernel: 5.0.21-5-pve)

I'm out of ideas and need some help :)


1575297522175.png
 

cdukes

Active Member
Sep 11, 2015
88
4
28
Raleigh, NC
www.logzilla.net
I have verified 100% that the port on the switch this is connected to is correct
I have replaced the cable (even though there were no errors on the switch - just in case)
Both the switch and the server (ethtool) say the port is ip
When the cable is unplugged, both the server and the switch show the interface down (proving it is in the correct port)
I have verified that the port on the switch is vlan 29 (correct vlan)
I even tried plugging into a different switch in the same rack and it still had the same problem.
So that eliminates Layers 1 and 2 from being the problem
I have rebooted all 3 servers
pve0 can ping itself but nothing else (not even the switch)
pve1 and pve2 can ping each other (and the switch), but can't ping pve0
I have also verified that all servers have matching /etc/hosts
The only thing I can think of is that this HAS to be a Proxmox issue. But there are no errors in dmesg and the only errors in the logs are:


Dec 5 23:13:08 pve0 pvesr[2467177]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 5 23:13:09 pve0 pvesr[2467177]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 5 23:13:10 pve0 pvesr[2467177]: error with cfs lock 'file-replication_cfg': no quorum!
Dec 5 23:13:10 pve0 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Dec 5 23:13:10 pve0 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Dec 5 23:13:10 pve0 systemd[1]: Failed to start Proxmox VE replication runner.
Dec 5 23:14:00 pve0 systemd[1]: Starting Proxmox VE replication runner...
Dec 5 23:14:01 pve0 pvesr[2467422]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 5 23:14:02 pve0 pvesr[2467422]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 5 23:14:03 pve0 pvesr[2467422]: trying to acquire cfs lock 'file-replication_cfg' ...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!