Hosts in GUI turn red every few days until reboot

athompso

Renowned Member
Sep 13, 2013
129
8
83
I've got a 3-node PVE 4.1 cluster running using UDPU.
"pvecm status" and "pvecm nodes" indicate all is well.
I can SSH back and forth between all three nodes at will.
After a fresh boot, I can manage all three hosts from any of the three hosts.

But after about a week or so, each host can only manage itself, and shows the other two hosts as unreachable/down with the red icon instead of the green icon in the GUI.

Restarting pvestatd does not clear up the problem.
I believe restarting pvemanager does, but that restarts all the VMs, too!
This has been a recurring problem for many people since 2.x, surely someone knows a way to resolve it???

Other info: pveproxy does not show any errors, nor any access attempts. Corosync claims all is well, and is not logging any errors or info at all.

Help?
root@pve3:~# pveversion -v
proxmox-ve: 4.1-41 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-22 (running version: 4.1-22/aca130cf)
pve-kernel-4.2.8-1-pve: 4.2.8-41
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-36
qemu-server: 4.0-64
pve-firmware: 1.1-7
libpve-common-perl: 4.0-54
libpve-access-control: 4.0-13
libpve-storage-perl: 4.0-45
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-9
pve-container: 1.0-52
pve-firewall: 2.0-22
pve-ha-manager: 1.0-25
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie
 
I believe restarting pvemanager does, but that restarts all the VMs, too!

What do the logs say? use journalctl or even filter it a bit with:
Code:
journalctl -u pve-cluster.service -u pve-manager.service

Can you trying restart pve-cluster or pveproxy/pvedaemon (won't restart VMs)?
 
What do the logs say? use journalctl or even filter it a bit with:
Code:
journalctl -u pve-cluster.service -u pve-manager.service
Can you trying restart pve-cluster or pveproxy/pvedaemon (won't restart VMs)?

Firstly: output is too long to paste here, so http://pastebin.com/Bam4YURH

Secondly: when I try "service pve-cluster restart" on pve3, I get these additional log entries:
Code:
Apr 20 07:22:54 pve3 systemd[1]: Stopping The Proxmox VE cluster filesystem...
Apr 20 07:22:54 pve3 pmxcfs[1519]: [main] notice: teardown filesystem
Apr 20 07:22:57 pve3 pmxcfs[1519]: [main] notice: exit proxmox configuration filesystem (0)
Apr 20 07:22:57 pve3 systemd[1]: Starting The Proxmox VE cluster filesystem...
Apr 20 07:22:57 pve3 pmxcfs[10328]: [status] notice: update cluster info (cluster name  SCOUT3, version = 1)
Apr 20 07:22:57 pve3 pmxcfs[10328]: [status] notice: node has quorum
Apr 20 07:22:57 pve3 pmxcfs[10328]: [dcdb] notice: members: 1/30818, 2/1595, 3/10328
Apr 20 07:22:57 pve3 pmxcfs[10328]: [dcdb] notice: starting data syncronisation
Apr 20 07:22:57 pve3 pmxcfs[10328]: [status] notice: members: 1/30818, 2/1595, 3/10328
Apr 20 07:22:57 pve3 pmxcfs[10328]: [status] notice: starting data syncronisation
Apr 20 07:22:58 pve3 pvecm[10330]: ipcc_send_rec failed: Connection refused
Apr 20 07:22:58 pve3 pvecm[10330]: ipcc_send_rec failed: Connection refused
Apr 20 07:22:58 pve3 pvecm[10330]: ipcc_send_rec failed: Connection refused
Apr 20 07:23:26 pve3 pmxcfs[10328]: [dcdb] notice: received sync request (epoch 1/30818/00000009)
Apr 20 07:23:26 pve3 pmxcfs[10328]: [status] notice: received sync request (epoch 1/30818/00000009)
Apr 20 07:23:31 pve3 pmxcfs[10328]: [dcdb] notice: received all states
Apr 20 07:23:31 pve3 pmxcfs[10328]: [dcdb] notice: leader is 1/30818
Apr 20 07:23:31 pve3 pmxcfs[10328]: [dcdb] notice: synced members: 1/30818, 2/1595
Apr 20 07:23:31 pve3 pmxcfs[10328]: [dcdb] notice: waiting for updates from leader
Apr 20 07:23:31 pve3 pmxcfs[10328]: [dcdb] notice: dfsm_deliver_queue: queue length 4
Apr 20 07:23:31 pve3 pmxcfs[10328]: [dcdb] notice: update complete - trying to commit (got 3 inode updates)
Apr 20 07:23:31 pve3 pmxcfs[10328]: [dcdb] notice: all data is up to date
Apr 20 07:23:31 pve3 pmxcfs[10328]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 2
Apr 20 07:23:31 pve3 pmxcfs[10328]: [dcdb] notice: dfsm_deliver_queue: queue length 2
Apr 20 07:23:31 pve3 pmxcfs[10328]: [status] notice: received all states
Apr 20 07:23:31 pve3 pmxcfs[10328]: [status] notice: all data is up to date
Apr 20 07:23:31 pve3 pmxcfs[10328]: [status] notice: dfsm_deliver_queue: queue length 86
Apr 20 07:23:31 pve3 systemd[1]: Started The Proxmox VE cluster filesystem.

...and suddenly pve1, pve2 and pve3 can all see each other.

So all I had to do was restart pve-cluster. But why do I have to do this periodically in the first place?
(FYI - these hosts only have a community subscription, so I won't be opening a ticket for this. Bugzilla, maybe, if I can narrow it down.)
 
I had a similar issue. Turning off IGMP snooping on my switches solved the issue for me.
 
Firstly: output is too long to paste here, so http://pastebin.com/Bam4YURH

hmm, I forgot adding -u corosync to my command out but, btw: you can narrow journalctls out put further down with --since --until which take a date/time string (look in the man pages if interested).

Else yeah you node looses quorum which is pretty the reason why it goes red, the cluster filesystem isn't able to reconnect after quorum was established again, additional corosync log out put may help more here.

So all I had to do was restart pve-cluster. But why do I have to do this periodically in the first place?
I had a similar issue. Turning off IGMP snooping on my switches solved the issue for me.

micush may be right, switch problems with multicast could be a reason for this, although 5 days is a little long for them to appear, imo, but I would follow his advice and disable IGMP snooping if possible in the switch settings.
 
Pretty sure it's not an IGMP problem since, as I said, I'm running in UDPU mode because I have no multicast capability whatsoever between the nodes.
 
Pretty sure it's not an IGMP problem since, as I said, I'm running in UDPU mode because I have no multicast capability whatsoever between the nodes.
Sorry missed that... For a reliable cluster configuration multicast is really recommended, if not available at least secure that the cluster network has no network congestion, at best it would be on its own network, else even on small network unicast can be simply to slow/problem prone.

Can you ensure a latency of < 2ms (<1ms would be better)?
 
  • Like
Reactions: chrone
There should not be, but although latency is normally well under 1ms, it's possible there are occasional spikes.
I would have expected the upper layers to recover automatically once quorum was reestablished, though...?
Manually restarting one daemon from time to time isn't a big deal, just a nuisance.
I'm not crazy enough to attempt HA in this environment, I just wanted single mgmt console for 3 mostly - independent servers so I could clone templates across nodes.
-Adam
 
I would have expected the upper layers to recover automatically once quorum was reestablished, though...?

Normally that could be expected, your setup is a little lest tested although, and such spikes and a lot of traffic on the network could fail some re-connection attempts, its hard to tell without reproducing,
can you add the output from the following command another time to a pastebin? Sorry that I forgot the including the corosync service in my first post.
Code:
journalctl -u pve-cluster  -u corosync


Manually restarting one daemon from time to time isn't a big deal, just a nuisance.
I'm not crazy enough to attempt HA in this environment, I just wanted single mgmt console for 3 mostly - independent servers so I could clone templates across nodes.

Understandable, you could try tweaking the corosync config in favor of your setup, e.g. use an higher token timeout (and also consesus timeout as they related), give man corosync.conf a read and/or ask on the clusterlabs list, there always a few corosync people around which may know better tweaking options than I, as I do not have that many experience with multicast over unicast.
 
  • Like
Reactions: chrone

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!