Some cluster nodes "grey", "System" hangs up (Loading)

encore

Well-Known Member
May 4, 2018
108
1
58
36
Hi,

i have some issues with my cluster. This morning it was totally RIP due to a hardware failure of two nodes.
It took some time to get everything sorted.
Now, everything is green (27 nodes) except 4 nodes. They are grey.
pvecm status
looks fine.
Votequorum information
----------------------
Expected votes: 31
Highest expected: 31
Total votes: 31
Quorum: 16
Flags: Quorate
When I click on "sumary", everything looks fine.
When I click on "system" it stucks at "loading".

omping between those nodes and between one working node:
0% loss, good latency.

systemctl status pve-cluster; systemctl status corosync; systemctl status cron; systemctl status ksmtuned; systemctl status postfix; systemctl status pve-firewall; systemctl status pve-ha-crm; systemctl status pve-ha-lrm; systemctl status pvedaemon; systemctl status pveproxy; systemctl status pvestatd; systemctl status spiceproxy; systemctl status syslog; systemctl status systemd-timesyncd;
tells me, all services are "active" (running).

Any ideas on how to debug it further?
 
The problem has worsened. Currently 6 nodes are affected.
I see:
May 19 10:03:18 captive006-72011-bl09 pveproxy[1604]: proxy detected vanished client connection
May 19 10:03:54 captive006-72011-bl09 pveproxy[1607]: proxy detected vanished client connection
May 19 10:04:11 captive006-72011-bl09 pveproxy[1608]: proxy detected vanished client connection
May 19 10:05:22 captive006-72011-bl09 pveproxy[1604]: proxy detected vanished client connection
May 19 10:05:31 captive006-72011-bl09 pveproxy[1607]: proxy detected vanished client connection
May 19 10:07:31 captive006-72011-bl09 pveproxy[1608]: proxy detected vanished client connection
May 19 10:07:36 captive006-72011-bl09 pveproxy[1608]: proxy detected vanished client connection
May 19 10:07:54 captive006-72011-bl09 pveproxy[1607]: proxy detected vanished client connection
May 19 10:08:16 captive006-72011-bl09 pveproxy[1607]: proxy detected vanished client connection
May 19 10:08:33 captive006-72011-bl09 pveproxy[1607]: proxy detected vanished client connection
on all affected nodes.

and also
May 19 02:16:57 captive006-72011-bl09 pve-ha-lrm[2023]: loop take too long (50 seconds)
May 19 02:46:51 captive006-72011-bl09 pve-ha-lrm[2023]: loop take too long (48 seconds)
May 19 03:16:54 captive006-72011-bl09 pve-ha-lrm[2023]: loop take too long (49 seconds)
May 19 03:48:34 captive006-72011-bl09 pve-ha-lrm[2023]: loop take too long (49 seconds)
May 19 04:18:47 captive006-72011-bl09 pve-ha-lrm[2023]: loop take too long (49 seconds)
May 19 04:48:33 captive006-72011-bl09 pve-ha-lrm[2023]: loop take too long (50 seconds)
May 19 05:18:33 captive006-72011-bl09 pve-ha-lrm[2023]: loop take too long (49 seconds)
May 19 05:48:35 captive006-72011-bl09 pve-ha-lrm[2023]: loop take too long (51 seconds)
May 19 06:18:35 captive006-72011-bl09 pve-ha-lrm[2023]: loop take too long (45 seconds)

Restarting pvedaemon fixes the problem for 5-15mins, then it starts again, node by node until ~6 nodes are affected.

Also after restarting pvedaemon, node remains "grey" in node list. "System" loads fine for the next minutes, everything "running"
 
this is the output of systemctl status pvedaemon:
root@captive015-74050-bl05:~# systemctl status pvedaemon
● pvedaemon.service - PVE API Daemon
Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2019-05-20 02:27:37 CEST; 11h ago
Process: 2473 ExecStart=/usr/bin/pvedaemon start (code=exited, status=0/SUCCESS)
Main PID: 2486 (pvedaemon)
Tasks: 7 (limit: 9830)
Memory: 209.3M
CPU: 12min 11.308s
CGroup: /system.slice/pvedaemon.service
├─ 2486 pvedaemon
├─ 2489 pvedaemon worker
├─ 2490 pvedaemon worker
├─ 2491 pvedaemon worker
├─36381 lxc-info -n 1199234 -p
├─37131 lxc-info -n 1199234 -p
└─37149 lxc-info -n 1199234 -p

May 20 06:40:11 captive015-74050-bl05 pvedaemon[2491]: <root@pam> successful auth for user 'zap@pve'
May 20 06:40:13 captive015-74050-bl05 pvedaemon[2489]: <root@pam> successful auth for user 'zap@pve'
May 20 06:40:16 captive015-74050-bl05 pvedaemon[2489]: <root@pam> successful auth for user 'zap@pve'
May 20 06:40:18 captive015-74050-bl05 pvedaemon[2491]: <root@pam> successful auth for user 'zap@pve'
May 20 06:40:19 captive015-74050-bl05 pvedaemon[2489]: <root@pam> successful auth for user 'zap@pve'
May 20 06:40:20 captive015-74050-bl05 pvedaemon[2491]: <root@pam> successful auth for user 'zap@pve'
May 20 06:40:20 captive015-74050-bl05 pvedaemon[2489]: <root@pam> successful auth for user 'zap@pve'
May 20 06:40:21 captive015-74050-bl05 pvedaemon[2491]: <root@pam> successful auth for user 'zap@pve'
May 20 06:40:21 captive015-74050-bl05 pvedaemon[2489]: <root@pam> successful auth for user 'zap@pve'
May 20 06:40:44 captive015-74050-bl05 pvedaemon[2491]: <zap@pve> end task UPID:captive015-74050-bl05:000089B8:00172ED9:5CE22F72:vzstart:1199234:zap@pve: command 'systemctl start pve-contain
seems like it stucks at "lxc-info -n 1199234 -p"

Restarting pvedaemon helps for some time, then in occurs again.

Some lxc containers are running on that node. Trying to start LXC server what are stopped lead to:
root@captive015-74050-bl05:~# pct start 1012565
Job for pve-container@1012565.service failed because a timeout was exceeded.
See "systemctl status pve-container@1012565.service" and "journalctl -xe" for details.
command 'systemctl start pve-container@1012565' failed: exit code 1

May 20 14:28:26 captive015-74050-bl05 pct[62226]: <root@pam> starting task UPID:captive015-74050-bl05:0000F35A:00422397:5CE29D6A:vzstart:1012565:root@pam:
May 20 14:28:26 captive015-74050-bl05 pct[62298]: starting CT 1012565: UPID:captive015-74050-bl05:0000F35A:00422397:5CE29D6A:vzstart:1012565:root@pam:
May 20 14:28:26 captive015-74050-bl05 systemd[1]: Starting PVE LXC Container: 1012565...
-- Subject: Unit pve-container@1012565.service has begun start-up
-- Unit pve-container@1012565.service has begun starting up.
May 20 14:28:27 captive015-74050-bl05 kernel: IPv6: ADDRCONF(NETDEV_UP): veth1012565i0: link is not ready
May 20 14:28:27 captive015-74050-bl05 kernel: vmbr0: port 73(veth1012565i0) entered blocking state
May 20 14:28:27 captive015-74050-bl05 kernel: vmbr0: port 73(veth1012565i0) entered disabled state
May 20 14:28:27 captive015-74050-bl05 kernel: device veth1012565i0 entered promiscuous mode
May 20 14:29:56 captive015-74050-bl05 systemd[1]: pve-container@1012565.service: Start operation timed out. Terminating.
May 20 14:29:56 captive015-74050-bl05 systemd[1]: Failed to start PVE LXC Container: 1012565.
-- Subject: Unit pve-container@1012565.service has failed
-- Unit pve-container@1012565.service has failed.
May 20 14:29:56 captive015-74050-bl05 systemd[1]: pve-container@1012565.service: Unit entered failed state.
May 20 14:29:56 captive015-74050-bl05 systemd[1]: pve-container@1012565.service: Failed with result 'timeout'.
May 20 14:29:56 captive015-74050-bl05 pct[62298]: command 'systemctl start pve-container@1012565' failed: exit code 1
May 20 14:29:56 captive015-74050-bl05 pct[62226]: <root@pam> end task UPID:captive015-74050-bl05:0000F35A:00422397:5CE29D6A:vzstart:1012565:root@pam: command 'systemctl start pve-container@1012565' failed: exit code 1

Rebooting the node helps, these problem LXC servers work again, but after 30-120mins same stuff happens again.

We are using local storages (dir, ext4 + .raw)

Like I said, it happens on several nodes since friday.

any ideas?
 
Hi,

do you use a proxy in the corosync network?
Do you have a dedicated corosync network?
Because if not (VLAN are not dedicated) the load in the network can rise when 2 nodes are gone.
How much latency you get on the network and the interesting value are the max and not the avg?
 
do you use a proxy in the corosync network?
no

Do you have a dedicated corosync network?
not yet, setting this up with a dedicated internal 10g sfp+ network today.
We don't see any retransmits or high latencies, so I don't think that causes the issue

How much latency you get on the network and the interesting value are the max and not the avg?
less than 1ms

Day by day more Nodes have that issue. Currently 18 nodes are affected :(

All nodes with that issue do have a 100% kworker process:
60927 root 20 0 0 0 0 R 99.7 0.0 1357:07 kworker/u128:2

Resource usage of those nodes is okay. Enough free memory, < 20% cpu load, 0% i/O waits, DIR storages has no fs errors.

Pvedaemon keeps hanging up:
http://prntscr.com/nrashu

@wolfgang
 
Update: Enabling our Cluster Firewall again (we disabled it last week for testing purposes) fixed the issue.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!