[SOLVED] Issue with cluster - showing red X'es and possible multicast problems

Mor H.

New Member
Jun 3, 2019
18
1
3
32
Hey everyone,

We run a cluster behind 2x Arista DCS-7050T-64-R switches, both in the same vlan (512).
They are connected to each other also.

The cluster has 9 nodes running Proxmox v4.4 (6 hypervisors, rest storage nodes).

We had igmp snooping disabled in full on the vlan with command:
Code:
no ip igmp snooping vlan 512
And all worked fine for about 10 days, but suddenly today it dropped out of nowhere.

The issue we are experiencing started with the web GUI suddenly giving red X'es next to each node and the issue seems to lay in the pvestatd daemon (at least that is my thought but not sure if it is also the case).

Several hypervisors kept hanging on command 'pvecm nodes', but after restarting the pve-cluster service, it started working again (the command, the X'es remained red).

It seems like when we fix one node by restarting the service, another node that worked before starts having the same issue.

We tried enabling igmp snooping with a carrier address, but no solution.

switch1:
Code:
ip igmp snooping vlan 512 querier
ip igmp snooping vlan 512 querier address 1.1.1.1

switch2:
Code:
ip igmp snooping vlan 512 querier
ip igmp snooping vlan 512 querier address 2.2.2.2

Code:
5.3-CLUS01#show ip igmp snooping vlan  512
  Global IGMP Snooping configuration:
-------------------------------------------
IGMP snooping                  : Enabled
IGMPv2 immediate leave         : Enabled
Robustness variable            : 2
Report flooding                : Disabled

Vlan 512 :
----------
Code:
IGMP snooping                  : Enabled
IGMPv2 immediate leave         : Default
Multicast router learning mode : pim-dvmrp
IGMP max group limit           : No limit set
Recent attempt to exceed limit : No
Report flooding                : Default
IGMP snooping pruning active   : True
Flooding traffic to VLAN       : False

We run this command on all nodes:
Code:
omping -c 600 -i 1 -q <list of all hypervisors>

The results were:
Node11 and Node6 can't communicate at all. Node 11 right away said:
Code:
Node6: server told us to stop

All other nodes could communicate but returned something along these lines pointing at a possible problem with Node6:
Code:
node06 :   unicast, xmt/rcv/%loss = 423/423/0%, min/avg/max/std-dev = 0.066/0.174/1.652/0.112
node06 : multicast, xmt/rcv/%loss = 423/423/0%, min/avg/max/std-dev = 0.079/0.192/1.657/0.111
node07 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.050/0.138/1.586/0.128
node07 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.053/0.148/1.609/0.129
node08 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.052/0.195/3.238/0.267
node08 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.058/0.211/3.242/0.269
node09 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.061/0.270/6.758/0.534
node09 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.067/0.285/6.804/0.536
node11 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.061/0.147/1.613/0.139
node11 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.066/0.158/1.618/0.139

Now this happens on Node6, but not on other nodes:
Code:
root@node06:~# cat /etc/pve/corosync.conf
cat: /etc/pve/corosync.conf: Transport endpoint is not connected

Here is some output from earlier today (CEST), the log /var/log/syslog printed:
Code:
Jun 18 13:48:26 node05 pve-firewall[3285]: firewall update time (10.002 seconds)
Jun 18 13:48:27 node05 pmxcfs[3056]: [status] notice: cpg_leave retry 26850
Jun 18 13:48:27 node05 pmxcfs[3056]: [status] notice: cpg_send_message retry 10
Jun 18 13:48:28 node05 pmxcfs[3056]: [status] notice: cpg_leave retry 26860
Jun 18 13:48:28 node05 pmxcfs[3056]: [status] notice: cpg_send_message retry 20
Jun 18 13:48:29 node05 pmxcfs[3056]: [status] notice: cpg_leave retry 26870
Jun 18 13:48:29 node05 pmxcfs[3056]: [status] notice: cpg_send_message retry 30
Jun 18 13:48:30 node05 pmxcfs[3056]: [status] notice: cpg_leave retry 26880
Jun 18 13:48:30 node05 pmxcfs[3056]: [status] notice: cpg_send_message retry 40
Jun 18 13:48:31 node05 pmxcfs[3056]: [status] notice: cpg_leave retry 26890
Jun 18 13:48:31 node05 pmxcfs[3056]: [status] notice: cpg_send_message retry 50
Jun 18 13:48:32 node05 pmxcfs[3056]: [status] notice: cpg_leave retry 26900
Jun 18 13:48:32 node05 pmxcfs[3056]: [status] notice: cpg_send_message retry 60
Jun 18 13:48:33 node05 pmxcfs[3056]: [status] notice: cpg_leave retry 26910
Jun 18 13:48:33 node05 pmxcfs[3056]: [status] notice: cpg_send_message retry 70
Jun 18 13:48:34 node05 pmxcfs[3056]: [status] notice: cpg_leave retry 26920
Jun 18 13:48:34 node05 pmxcfs[3056]: [status] notice: cpg_send_message retry 80
Jun 18 13:48:35 node05 pmxcfs[3056]: [status] notice: cpg_leave retry 26930
Jun 18 13:48:35 node05 pmxcfs[3056]: [status] notice: cpg_send_message retry 90
Jun 18 13:48:36 node05 pmxcfs[3056]: [status] notice: cpg_leave retry 26940
Jun 18 13:48:36 node05 pmxcfs[3056]: [status] notice: cpg_send_message retry 100
Jun 18 13:48:36 node05 pmxcfs[3056]: [status] notice: cpg_send_message retried 100 times
..........
Jun 18 18:03:01 node05 pmxcfs[39690]: [status] notice: remove message from non-member 10/91328
Jun 18 18:03:01 node05 pmxcfs[39690]: [status] notice: remove message from non-member 10/91328
Jun 18 18:03:01 node05 pmxcfs[39690]: [status] notice: remove message from non-member 10/91328
Jun 18 18:03:01 node05 pmxcfs[39690]: [status] notice: remove message from non-member 10/91328
Jun 18 18:03:01 node05 pmxcfs[39690]: [status] notice: remove message from non-member 10/91328

Can anyone help us debug this without having to reboot all hypervisors?

If more information is required, please let us know.
 
Things that should help in finding the issue:
check (and post) the output of:
* `pvecm status`

* check the journal (`journalctl -r` gives you the complete log in reversed order (newest entries first)) for messages from 'pmxcfs', 'corosync' and potential other problems related to network problems.

* check the switch logs for anything related to your issues.

* compare the configs '/etc/corosync/corosync.conf' on _all_ nodes - it should contain the same data! Note the complete path - '/etc/corosync/corosync.conf' is the file that corosync uses for starting up and providing the necessary syncronization for pmxcfs (proxmox cluster filesystem), which is mounted on '/etc/pve' - '/etc/pve/corosync.conf' is syncronised by pmxcfs and copied over when changed, but your pmxcfs is currently not running properly:
root@node06:~# cat /etc/pve/corosync.conf
cat: /etc/pve/corosync.conf: Transport endpoint is not connected

* please run the exact `omping` commands from https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network on all cluster-nodes at the same time and post the output - The first shows whether the latency is acceptable even when much traffic goes over the network, the second one usually spots problems with (missing) multicast-queriers

* With a consistent version of '/etc/corosync/corosync.conf' you can try to restart the corosync.service on all nodes - note that if you currently have a quorate partition (at least half of your nodes seeing each other) it might lose quorum too, which would prevent you from making changes to that partition.

* Try to restart pmxcfs (`systemctl restart pve-cluster.service`) while keeping an eye on the logs

* Should the 'Transport endpoint is not connected' issue not go away - you might need to first unmount /etc/pve, before restarting pmxcfs:
`fusermount -u /etc/pve`

We run a cluster behind 2x Arista DCS-7050T-64-R switches, both in the same vlan (512).
They are connected to each other also.
I assume that they also see each other's traffic of VLAN 512?

We had igmp snooping disabled in full on the vlan with command:
usually you want your switches to snoop igmp-traffic, in order to know where to forward the multicast messages to - see https://pve.proxmox.com/wiki/Multicast_notes for some background

The cluster has 9 nodes running Proxmox v4.4 (6 hypervisors, rest storage nodes).
On a sidenote - PVE 4.4 has been EOL since about a year now - please consider upgrading soon - there have been quite a few security issues in the meantime!


Hope this helps!
 
Of course, proceed as @Stoiko Ivanov said above. Also, you might consider:

1. Find the multicast address that corosync is using on the nodes: netstat -uln . You should see 'udp 239.x.x.x:nnnn' under Local Address.

2. On both switches under the VLAN 512 interface, configure ip igmp static-group <mcast group> using the address from step 1.

Repeat the omping commands while specifying that multicast address.

EDIT: bad BB codes
 
Things that should help in finding the issue:
check (and post) the output of:
* `pvecm status`

* check the journal (`journalctl -r` gives you the complete log in reversed order (newest entries first)) for messages from 'pmxcfs', 'corosync' and potential other problems related to network problems.

* check the switch logs for anything related to your issues.

* compare the configs '/etc/corosync/corosync.conf' on _all_ nodes - it should contain the same data! Note the complete path - '/etc/corosync/corosync.conf' is the file that corosync uses for starting up and providing the necessary syncronization for pmxcfs (proxmox cluster filesystem), which is mounted on '/etc/pve' - '/etc/pve/corosync.conf' is syncronised by pmxcfs and copied over when changed, but your pmxcfs is currently not running properly:


* please run the exact `omping` commands from https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network on all cluster-nodes at the same time and post the output - The first shows whether the latency is acceptable even when much traffic goes over the network, the second one usually spots problems with (missing) multicast-queriers

* With a consistent version of '/etc/corosync/corosync.conf' you can try to restart the corosync.service on all nodes - note that if you currently have a quorate partition (at least half of your nodes seeing each other) it might lose quorum too, which would prevent you from making changes to that partition.

* Try to restart pmxcfs (`systemctl restart pve-cluster.service`) while keeping an eye on the logs

* Should the 'Transport endpoint is not connected' issue not go away - you might need to first unmount /etc/pve, before restarting pmxcfs:
`fusermount -u /etc/pve`


I assume that they also see each other's traffic of VLAN 512?


usually you want your switches to snoop igmp-traffic, in order to know where to forward the multicast messages to - see https://pve.proxmox.com/wiki/Multicast_notes for some background


On a sidenote - PVE 4.4 has been EOL since about a year now - please consider upgrading soon - there have been quite a few security issues in the meantime!


Hope this helps!

Thanks for your help! Here's what we got:

Code:
root@hyp06-nl-ams:/etc/corosync# pvecm status
pve configuration filesystem not mounted
root@hyp06-nl-ams:/etc/corosync# journalctl -r
-- Logs begin at Fri 2019-06-14 23:33:58 CEST, end at Tue 2019-06-18 20:04:33 CEST. --
Jun 18 20:04:33 hyp06-nl-ams.mydomain.com pve-ha-lrm[2784]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:33 hyp06-nl-ams.mydomain.com pve-ha-lrm[2784]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:33 hyp06-nl-ams.mydomain.com pve-ha-lrm[2784]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:33 hyp06-nl-ams.mydomain.com pve-ha-crm[2774]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:33 hyp06-nl-ams.mydomain.com pve-ha-crm[2774]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:33 hyp06-nl-ams.mydomain.com pve-ha-crm[2774]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:32 hyp06-nl-ams.mydomain.com sshd[33452]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=181.191.241.6
Jun 18 20:04:32 hyp06-nl-ams.mydomain.com sshd[33452]: pam_unix(sshd:auth): check pass; user unknown
Jun 18 20:04:32 hyp06-nl-ams.mydomain.com sshd[33452]: input_userauth_request: invalid user smmsp [preauth]
Jun 18 20:04:32 hyp06-nl-ams.mydomain.com sshd[33452]: Invalid user smmsp from 181.191.241.6
Jun 18 20:04:28 hyp06-nl-ams.mydomain.com pvestatd[2731]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:28 hyp06-nl-ams.mydomain.com pvestatd[2731]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:28 hyp06-nl-ams.mydomain.com pvestatd[2731]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:28 hyp06-nl-ams.mydomain.com pvestatd[2731]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:28 hyp06-nl-ams.mydomain.com pvestatd[2731]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:28 hyp06-nl-ams.mydomain.com pvestatd[2731]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:28 hyp06-nl-ams.mydomain.com pve-ha-lrm[2784]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:28 hyp06-nl-ams.mydomain.com pve-ha-lrm[2784]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:28 hyp06-nl-ams.mydomain.com pve-ha-lrm[2784]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:28 hyp06-nl-ams.mydomain.com pve-ha-crm[2774]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:28 hyp06-nl-ams.mydomain.com pve-ha-crm[2774]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:28 hyp06-nl-ams.mydomain.com pve-ha-crm[2774]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:23 hyp06-nl-ams.mydomain.com pve-ha-lrm[2784]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:23 hyp06-nl-ams.mydomain.com pve-ha-lrm[2784]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:23 hyp06-nl-ams.mydomain.com pve-ha-lrm[2784]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:23 hyp06-nl-ams.mydomain.com pve-ha-crm[2774]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:23 hyp06-nl-ams.mydomain.com pve-ha-crm[2774]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:23 hyp06-nl-ams.mydomain.com pve-ha-crm[2774]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:18 hyp06-nl-ams.mydomain.com pvestatd[2731]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:18 hyp06-nl-ams.mydomain.com pvestatd[2731]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:18 hyp06-nl-ams.mydomain.com pvestatd[2731]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:18 hyp06-nl-ams.mydomain.com pvestatd[2731]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:18 hyp06-nl-ams.mydomain.com pvestatd[2731]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:18 hyp06-nl-ams.mydomain.com pvestatd[2731]: ipcc_send_rec failed: Connection refused
Jun 18 20:04:18 hyp06-nl-ams.mydomain.com pve-ha-lrm[2784]: ipcc_send_rec failed: Connection refused

The /etc/corosync/corosync.conf file has been checked on hyp06 ad it matches all the others.

I have tried umountig the /etc/pve and now no data exists there. Restarting the service 'systemctl restart pve-cluster.service' keeps on failing also after the umount:
Code:
root@hyp06-nl-ams:/etc/corosync# systemctl restart pve-cluster.service
Job for pve-cluster.service failed. See 'systemctl status pve-cluster.service' and 'journalctl -xn' for details.
root@hyp06-nl-ams:/etc/corosync# systemctl status pve-cluster.service
â pve-cluster.service - The Proxmox VE cluster filesystem
  Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
  Active: failed (Result: signal) since Tue 2019-06-18 20:07:39 CEST; 5s ago
 Process: 33520 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS)
Main PID: 33528 (code=killed, signal=KILL)
Jun 18 20:05:58 hyp06-nl-ams.mydomain.com pmxcfs[33528]: [dcdb] notice: starting data syncronisation
Jun 18 20:05:58 hyp06-nl-ams.mydomain.com pmxcfs[33528]: [dcdb] notice: received sync request (epoch 1/158816/00000016)
Jun 18 20:05:58 hyp06-nl-ams.mydomain.com pmxcfs[33528]: [status] notice: members: 1/158816, 4/1770, 5/39690, 6/33528, 7/3071, 8/1460, 9/1927, 10/91328
Jun 18 20:05:58 hyp06-nl-ams.mydomain.com pmxcfs[33528]: [status] notice: starting data syncronisation
Jun 18 20:06:02 hyp06-nl-ams.mydomain.com pmxcfs[33528]: [status] notice: received sync request (epoch 1/158816/00000016)
Jun 18 20:07:28 hyp06-nl-ams.mydomain.com systemd[1]: pve-cluster.service start-post operation timed out. Stopping.
Jun 18 20:07:39 hyp06-nl-ams.mydomain.com systemd[1]: pve-cluster.service stop-sigterm timed out. Killing.
Jun 18 20:07:39 hyp06-nl-ams.mydomain.com systemd[1]: pve-cluster.service: main process exited, code=killed, status=9/KILL
Jun 18 20:07:39 hyp06-nl-ams.mydomain.com systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Jun 18 20:07:39 hyp06-nl-ams.mydomain.com systemd[1]: Unit pve-cluster.service entered failed state.

All the nodes can communicate with each other within VLAN 512.

Regarding the upgrading of PVE 4.4, we've wanted to get around to this sometime now but we've seen lots of issues Proxmox users have had after the upgrade and since we're running a production environment with lots of users, we're very concerned about what can happen after the upgrade. Would you say most upgrades of PVE throughout a cluster our size go without issues? How complex is the upgrade process and what should we expect?
 
Of course, proceed as @Stoiko Ivanov said above. Also, you might consider:

1. Find the multicast address that corosync is using on the nodes: netstat -uln . You should see 'udp 239.x.x.x:nnnn' under Local Address.

2. On both switches under the VLAN 512 interface, configure ip igmp static-group <mcast group> using the address from step 1.

Repeat the omping commands while specifying that multicast address.

EDIT: bad BB codes

Thanks!

The command is not working, instead we must run this:
Code:
5.3-LAN01(config-vlan-512)#ip igmp snooping vlan 512 static 239.192.34.12 ?
 interface  Specify interface
But which interface should we add after the ip?
 
I am not familiar with that Arista OS syntax, but IP IGMP snooping VLAN <VLAN> static <IP address> interface <list of interfaces which should receive this traffic>.

EDIT: I googled it.
 
Last edited:
igmp snooping command has been setup on all interfaces that are behind the cluster.

Thanks RokaKen.

Any ideas on what should be our next move with this rogue node?
 
Output of omping -c 10000 -i 0.001 -F -q
Code:
hyp06 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.046/0.129/7.753/0.295
hyp06 : multicast, xmt/rcv/%loss = 10000/9996/0% (seq>=5 0%), min/avg/max/std-dev = 0.052/0.137/7.755/0.297
hyp07 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.036/0.132/8.947/0.296
hyp07 : multicast, xmt/rcv/%loss = 10000/9995/0% (seq>=6 0%), min/avg/max/std-dev = 0.040/0.137/8.952/0.296
hyp08 :   unicast, xmt/rcv/%loss = 10000/9995/0%, min/avg/max/std-dev = 0.036/0.179/8.948/0.402
hyp08 : multicast, xmt/rcv/%loss = 10000/9991/0% (seq>=5 0%), min/avg/max/std-dev = 0.046/0.187/8.953/0.403
hyp09 :   unicast, xmt/rcv/%loss = 10000/9960/0%, min/avg/max/std-dev = 0.039/0.233/13.336/0.705
hyp09 : multicast, xmt/rcv/%loss = 10000/9956/0% (seq>=5 0%), min/avg/max/std-dev = 0.043/0.241/13.339/0.708
hyp11 :   unicast, xmt/rcv/%loss = 9786/9786/0%, min/avg/max/std-dev = 0.038/0.124/7.708/0.291
hyp11 : multicast, xmt/rcv/%loss = 9786/9782/0% (seq>=5 0%), min/avg/max/std-dev = 0.043/0.128/7.710/0.291

Output of omping -c 600 -i 1 -q
Code:
hyp05 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.120/0.234/2.801/0.217
hyp05 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.123/0.254/2.804/0.217
hyp06 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.085/0.227/2.748/0.173
hyp06 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.087/0.248/2.752/0.179
hyp07 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.062/0.139/2.715/0.168
hyp07 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.068/0.153/2.730/0.168
hyp09 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.096/0.237/4.421/0.312
hyp09 : multicast, xmt/rcv/%loss = 600/599/0% (seq>=2 0%), min/avg/max/std-dev = 0.104/0.254/4.438/0.315
hyp11 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.073/0.159/2.737/0.172
hyp11 : multicast, xmt/rcv/%loss = 600/599/0% (seq>=2 0%), min/avg/max/std-dev = 0.096/0.175/2.741/0.178

Output of systemctl status pve-cluster pveproxy pvedaemon
Code:
root@hyp09:~# systemctl status pve-cluster pveproxy pvedaemon
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
   Active: failed (Result: signal) since Tue 2019-06-18 21:04:10 CEST; 1h 41min ago
  Process: 65724 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS)
 Main PID: 65733 (code=killed, signal=KILL)
Jun 18 21:03:36 hyp09 pmxcfs[65733]: [dcdb] notice: queue not emtpy - resening 6 messages
Jun 18 21:03:36 hyp09 pmxcfs[65733]: [dcdb] notice: received sync request (epoch 1/158816/00000019)
Jun 18 21:03:36 hyp09 pmxcfs[65733]: [status] notice: members: 1/158816, 4/1770, 5/39690, 6/2521, 7/3071, 8/1460, 9/1927, 10/91328, 11/65733
Jun 18 21:03:36 hyp09 pmxcfs[65733]: [status] notice: queue not emtpy - resening 103767 messages
Jun 18 21:03:42 hyp09 pmxcfs[65733]: [status] notice: received sync request (epoch 1/158816/00000019)
Jun 18 21:03:59 hyp09 systemd[1]: pve-cluster.service start-post operation timed out. Stopping.
Jun 18 21:04:10 hyp09 systemd[1]: pve-cluster.service stop-sigterm timed out. Killing.
Jun 18 21:04:10 hyp09 systemd[1]: pve-cluster.service: main process exited, code=killed, status=9/KILL
Jun 18 21:04:10 hyp09 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Jun 18 21:04:10 hyp09 systemd[1]: Unit pve-cluster.service entered failed state.
● pveproxy.service - PVE API Proxy Server
   Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled)
   Active: active (running) since Tue 2019-06-18 06:25:08 CEST; 16h ago
  Process: 21413 ExecStop=/usr/bin/pveproxy stop (code=exited, status=0/SUCCESS)
  Process: 21426 ExecStart=/usr/bin/pveproxy start (code=exited, status=0/SUCCESS)
 Main PID: 21444 (pveproxy)
   CGroup: /system.slice/pveproxy.service
           ├─ 21444 pveproxy
           ├─ 88546 pveproxy worker
           ├─ 88551 pveproxy worker
           └─133714 pveproxy worker
Jun 18 22:45:19 hyp09 pveproxy[88529]: worker exit
Jun 18 22:45:19 hyp09 pveproxy[21444]: worker 88529 finished
Jun 18 22:45:19 hyp09 pveproxy[21444]: starting 1 worker(s)
Jun 18 22:45:19 hyp09 pveproxy[21444]: worker 88546 started
Jun 18 22:45:19 hyp09 pveproxy[88546]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/sha...ne 1618.
Jun 18 22:45:21 hyp09 pveproxy[88536]: worker exit
Jun 18 22:45:21 hyp09 pveproxy[21444]: worker 88536 finished
Jun 18 22:45:21 hyp09 pveproxy[21444]: starting 1 worker(s)
Jun 18 22:45:21 hyp09 pveproxy[21444]: worker 88551 started
Jun 18 22:45:21 hyp09 pveproxy[88551]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/sha...ne 1618.
● pvedaemon.service - PVE API Daemon
   Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled)
   Active: active (running) since Tue 2019-06-04 23:45:11 CEST; 1 weeks 6 days ago
 Main PID: 3352 (pvedaemon)
   CGroup: /system.slice/pvedaemon.service
           ├─  3352 pvedaemon
           ├─135948 pvedaemon worker
           ├─152837 pvedaemon worker
           └─170193 pvedaemon worker
Jun 18 22:32:41 hyp09 pvedaemon[135948]: <root@pam> successful auth for user 'root@pam'
Jun 18 22:32:41 hyp09 pvedaemon[135948]: writing cluster log failed: ipcc_send_rec failed: Connection refused
Jun 18 22:44:32 hyp09 pvedaemon[170193]: ipcc_send_rec failed: Connection refused
Jun 18 22:44:32 hyp09 pvedaemon[170193]: ipcc_send_rec failed: Connection refused
Jun 18 22:44:32 hyp09 pvedaemon[170193]: ipcc_send_rec failed: Connection refused
Jun 18 22:44:32 hyp09 pvedaemon[170193]: <root@pam> successful auth for user 'root@pam'
Jun 18 22:44:32 hyp09 pvedaemon[170193]: writing cluster log failed: ipcc_send_rec failed: Connection refused
Jun 18 22:44:32 hyp09 pvedaemon[170193]: ipcc_send_rec failed: Connection refused
Jun 18 22:44:32 hyp09 pvedaemon[170193]: ipcc_send_rec failed: Connection refused
Jun 18 22:44:32 hyp09 pvedaemon[170193]: ipcc_send_rec failed: Connection refused
Hint: Some lines were ellipsized, use -l to show in full.

running systemctl restart pve-cluster.service simply hangs. Logs don't show much more than what's shown above.

It's important to note that every time a different node becomes the problematic node.
 
Ok, assuming you were running the omping on all 9 nodes, that looks better. Try the following current "problematic node" and then each that becomes problematic (if it does):

Code:
systemctl restart corosync pve-cluster

Then you'll monitor the status of the services as previously done.
 
Your reply got me thinking, so I run omping with all hypervisors but also all storage nodes.

It seems like hyp11 (one of the hypervisors) and nvme04 (one of the storage nodes) can't communicate over omping:
Code:
root@nvme04:~# omping -c 10000 -i 0.001 -F -q nvme04 hyp11
hyp11 : waiting for response msg
hyp11 : server told us to stop
hyp11 : response message never received

Hyp11 shows
Code:
nvme04 : waiting for response msg

What would cause this? How would you troubleshoot this?
 
I would check ifconfig on nodes (link) and arp for MAC address (of IP for other node). On switch(es) status of ports, traffic -- if switches are MLAG, then status of MLAG, traffic. If problem is ONLY between those two, hard to say remotely, but if one node has problem with ALL other nodes, then change cable, change switchport, etc.
 
The omping issue was a silly /etc/hosts file problem that is now fixed.

Here's our current situation, no VPS are actually down but the red Xs are still visible when accessing

All of the following services are Active on all nodes:
Code:
corosync pve-cluster pveproxy pvedaemon pvestatd

This worked after we killed the corosync process and restarted the services.

However, we're still seeing this in hyp11:
Code:
root@hyp11:~# systemctl status -l  pvestatd
● pvestatd.service - PVE Status Daemon
   Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled)
   Active: active (running) since Wed 2019-06-19 00:30:25 CEST; 1min 25s ago
  Process: 117610 ExecStop=/usr/bin/pvestatd stop (code=exited, status=0/SUCCESS)
  Process: 117633 ExecStart=/usr/bin/pvestatd start (code=exited, status=0/SUCCESS)
 Main PID: 117636 (pvestatd)
   CGroup: /system.slice/pvestatd.service
           └─117636 pvestat
Jun 19 00:30:25 hyp11 pvestatd[117636]: starting server
Jun 19 00:30:25 hyp11 systemd[1]: Started PVE Status Daemon.
Jun 19 00:30:35 hyp11 pvestatd[117636]: ipcc_send_rec failed: Connection refused
Jun 19 00:30:35 hyp11 pvestatd[117636]: ipcc_send_rec failed: Connection refused
Jun 19 00:30:35 hyp11 pvestatd[117636]: ipcc_send_rec failed: Connection refused
Jun 19 00:30:35 hyp11 pvestatd[117636]: ipcc_send_rec failed: Connection refused
Jun 19 00:30:35 hyp11 pvestatd[117636]: ipcc_send_rec failed: Connection refused
Jun 19 00:30:35 hyp11 pvestatd[117636]: ipcc_send_rec failed: Connection refused

How can we troubleshoot this ipcc_send_rec failed: Connection refused?
 
Quick update:
Now all services (corosync pve-cluster pveproxy pvedaemon pvestatd) are running on all nodes but still there are red Xs in Proxmox panel.

The issues seem to be jumping from one node to the next.

omping had perfect results except for 1% loss between hyp11 and nvme03 and between hyp11 and nvme04.

We saw this in one of the nodes:
Code:
Jun 19 00:58:40 hyp09 corosync[106690]:  [QUORUM] Members[9]: 5 6 7 10 11 9 4 8 1
Jun 19 00:58:40 hyp09 corosync[106690]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jun 19 00:59:43 hyp09 corosync[106690]:  [TOTEM ] A processor failed, forming new configuration.
Jun 19 00:59:50 hyp09 corosync[106690]:  [TOTEM ] A new membership (172.16.35.10:32436) was formed. Members left: 10
Jun 19 00:59:50 hyp09 corosync[106690]:  [TOTEM ] Failed to receive the leave message. failed: 10
Jun 19 00:59:50 hyp09 corosync[106690]:  [QUORUM] Members[8]: 5 6 7 11 9 4 8 1
Jun 19 00:59:50 hyp09 corosync[106690]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jun 19 01:00:00 hyp09 corosync[106690]:  [TOTEM ] A new membership (172.16.35.10:32440) was formed. Members joined: 10
Jun 19 01:00:00 hyp09 corosync[106690]:  [QUORUM] Members[9]: 5 6 7 10 11 9 4 8 1
Jun 19 01:00:00 hyp09 corosync[106690]:  [MAIN  ] Completed service synchronization, ready to provide service.

What is going on here?

Any ideas?
 
Quick note:
The only consistent issue that I remember seeing throughout this whole thing is:
Code:
Jun 19 01:16:18 hyp05 pmxcfs[105036]: [status] notice: cpg_send_message retried 100 times
Jun 19 01:16:18 hyp05 pmxcfs[105036]: [status] crit: cpg_send_message failed: 6
Jun 19 01:16:19 hyp05 pmxcfs[105036]: [status] notice: cpg_leave retry 6120
Jun 19 01:16:19 hyp05 pmxcfs[105036]: [status] notice: cpg_send_message retry 10
Jun 19 01:16:20 hyp05 pmxcfs[105036]: [status] notice: cpg_leave retry 6130
Jun 19 01:16:20 hyp05 pmxcfs[105036]: [status] notice: cpg_send_message retry 20
Jun 19 01:16:21 hyp05 pmxcfs[105036]: [status] notice: cpg_leave retry 6140
Jun 19 01:16:21 hyp05 pmxcfs[105036]: [status] notice: cpg_send_message retry 30
Jun 19 01:16:22 hyp05 pmxcfs[105036]: [status] notice: cpg_leave retry 6150
Jun 19 01:16:22 hyp05 pmxcfs[105036]: [status] notice: cpg_send_message retry 40

pvecm status hangs on the one/two nodes that seem to have their turn going crazy.

When we fix one node by killing corosync and starting all the services back up, another one shows same behavior.
 
How can we troubleshoot this ipcc_send_rec failed: Connection refused?
This is a secondary problem - you first want to get:
* corosync up and running so that all nodes see each other (`corosync-cfgtool -s`)
* the pmxcfs (pve-cluster.service) is mounted and running (`ls /etc/pve` shows some output)
* the output of `pvecm status` shows that all nodes are quorate

just keep in mind that your guests are still running, but you won't be able to start them if your cluster is not quorate.
 
Thank you all very much for your help.

We ended up fixing this by enabling igmp snooping, giving the cluster some time and then following your recommended commands.

If anyone finds this on Google, we recommend following the omping command to ensure multicast is set up properly and to kill the corosync process if certain commands hang and prevent progress from being made. Killing corosync was a "life-saving" reboot alternative for us.

Thanks.
 
  • Like
Reactions: Stoiko Ivanov
Glad to hear you managed to get your cluster running again!
Please mark the thread as 'SOLVED' (click on 'Thread Tools' above the first post of the thread, 'Edit Thread' and add a prefix of 'SOLVED')
so that others know what to expect.
Thanks.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!