Some cluster nodes "grey", "System" hangs up (Loading)

encore · May 18, 2019

Hi,

i have some issues with my cluster. This morning it was totally RIP due to a hardware failure of two nodes.
It took some time to get everything sorted.
Now, everything is green (27 nodes) except 4 nodes. They are grey.
pvecm status
looks fine.

Votequorum information
----------------------
Expected votes: 31
Highest expected: 31
Total votes: 31
Quorum: 16
Flags: Quorate

When I click on "sumary", everything looks fine.
When I click on "system" it stucks at "loading".

omping between those nodes and between one working node:
0% loss, good latency.

systemctl status pve-cluster; systemctl status corosync; systemctl status cron; systemctl status ksmtuned; systemctl status postfix; systemctl status pve-firewall; systemctl status pve-ha-crm; systemctl status pve-ha-lrm; systemctl status pvedaemon; systemctl status pveproxy; systemctl status pvestatd; systemctl status spiceproxy; systemctl status syslog; systemctl status systemd-timesyncd;

tells me, all services are "active" (running).

Any ideas on how to debug it further?

encore · May 19, 2019

The problem has worsened. Currently 6 nodes are affected.
I see:

May 19 10:03:18 captive006-72011-bl09 pveproxy[1604]: proxy detected vanished client connection
May 19 10:03:54 captive006-72011-bl09 pveproxy[1607]: proxy detected vanished client connection
May 19 10:04:11 captive006-72011-bl09 pveproxy[1608]: proxy detected vanished client connection
May 19 10:05:22 captive006-72011-bl09 pveproxy[1604]: proxy detected vanished client connection
May 19 10:05:31 captive006-72011-bl09 pveproxy[1607]: proxy detected vanished client connection
May 19 10:07:31 captive006-72011-bl09 pveproxy[1608]: proxy detected vanished client connection
May 19 10:07:36 captive006-72011-bl09 pveproxy[1608]: proxy detected vanished client connection
May 19 10:07:54 captive006-72011-bl09 pveproxy[1607]: proxy detected vanished client connection
May 19 10:08:16 captive006-72011-bl09 pveproxy[1607]: proxy detected vanished client connection
May 19 10:08:33 captive006-72011-bl09 pveproxy[1607]: proxy detected vanished client connection

on all affected nodes.

and also

May 19 02:16:57 captive006-72011-bl09 pve-ha-lrm[2023]: loop take too long (50 seconds)
May 19 02:46:51 captive006-72011-bl09 pve-ha-lrm[2023]: loop take too long (48 seconds)
May 19 03:16:54 captive006-72011-bl09 pve-ha-lrm[2023]: loop take too long (49 seconds)
May 19 03:48:34 captive006-72011-bl09 pve-ha-lrm[2023]: loop take too long (49 seconds)
May 19 04:18:47 captive006-72011-bl09 pve-ha-lrm[2023]: loop take too long (49 seconds)
May 19 04:48:33 captive006-72011-bl09 pve-ha-lrm[2023]: loop take too long (50 seconds)
May 19 05:18:33 captive006-72011-bl09 pve-ha-lrm[2023]: loop take too long (49 seconds)
May 19 05:48:35 captive006-72011-bl09 pve-ha-lrm[2023]: loop take too long (51 seconds)
May 19 06:18:35 captive006-72011-bl09 pve-ha-lrm[2023]: loop take too long (45 seconds)

Restarting pvedaemon fixes the problem for 5-15mins, then it starts again, node by node until ~6 nodes are affected.

Also after restarting pvedaemon, node remains "grey" in node list. "System" loads fine for the next minutes, everything "running"

encore · May 20, 2019

this is the output of systemctl status pvedaemon:

root@captive015-74050-bl05:~# systemctl status pvedaemon
● pvedaemon.service - PVE API Daemon
Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2019-05-20 02:27:37 CEST; 11h ago
Process: 2473 ExecStart=/usr/bin/pvedaemon start (code=exited, status=0/SUCCESS)
Main PID: 2486 (pvedaemon)
Tasks: 7 (limit: 9830)
Memory: 209.3M
CPU: 12min 11.308s
CGroup: /system.slice/pvedaemon.service
├─ 2486 pvedaemon
├─ 2489 pvedaemon worker
├─ 2490 pvedaemon worker
├─ 2491 pvedaemon worker
├─36381 lxc-info -n 1199234 -p
├─37131 lxc-info -n 1199234 -p
└─37149 lxc-info -n 1199234 -p

May 20 06:40:11 captive015-74050-bl05 pvedaemon[2491]: <root@pam> successful auth for user 'zap@pve'
May 20 06:40:13 captive015-74050-bl05 pvedaemon[2489]: <root@pam> successful auth for user 'zap@pve'
May 20 06:40:16 captive015-74050-bl05 pvedaemon[2489]: <root@pam> successful auth for user 'zap@pve'
May 20 06:40:18 captive015-74050-bl05 pvedaemon[2491]: <root@pam> successful auth for user 'zap@pve'
May 20 06:40:19 captive015-74050-bl05 pvedaemon[2489]: <root@pam> successful auth for user 'zap@pve'
May 20 06:40:20 captive015-74050-bl05 pvedaemon[2491]: <root@pam> successful auth for user 'zap@pve'
May 20 06:40:20 captive015-74050-bl05 pvedaemon[2489]: <root@pam> successful auth for user 'zap@pve'
May 20 06:40:21 captive015-74050-bl05 pvedaemon[2491]: <root@pam> successful auth for user 'zap@pve'
May 20 06:40:21 captive015-74050-bl05 pvedaemon[2489]: <root@pam> successful auth for user 'zap@pve'
May 20 06:40:44 captive015-74050-bl05 pvedaemon[2491]: <zap@pve> end task UPID:captive015-74050-bl05:000089B8:00172ED9:5CE22F72:vzstart:1199234:zap@pve: command 'systemctl start pve-contain

seems like it stucks at "lxc-info -n 1199234 -p"

Restarting pvedaemon helps for some time, then in occurs again.

Some lxc containers are running on that node. Trying to start LXC server what are stopped lead to:

root@captive015-74050-bl05:~# pct start 1012565
Job for pve-container@1012565.service failed because a timeout was exceeded.
See "systemctl status pve-container@1012565.service" and "journalctl -xe" for details.
command 'systemctl start pve-container@1012565' failed: exit code 1

May 20 14:28:26 captive015-74050-bl05 pct[62226]: <root@pam> starting task UPID:captive015-74050-bl05:0000F35A:00422397:5CE29D6A:vzstart:1012565:root@pam:
May 20 14:28:26 captive015-74050-bl05 pct[62298]: starting CT 1012565: UPID:captive015-74050-bl05:0000F35A:00422397:5CE29D6A:vzstart:1012565:root@pam:
May 20 14:28:26 captive015-74050-bl05 systemd[1]: Starting PVE LXC Container: 1012565...
-- Subject: Unit pve-container@1012565.service has begun start-up
-- Unit pve-container@1012565.service has begun starting up.
May 20 14:28:27 captive015-74050-bl05 kernel: IPv6: ADDRCONF(NETDEV_UP): veth1012565i0: link is not ready
May 20 14:28:27 captive015-74050-bl05 kernel: vmbr0: port 73(veth1012565i0) entered blocking state
May 20 14:28:27 captive015-74050-bl05 kernel: vmbr0: port 73(veth1012565i0) entered disabled state
May 20 14:28:27 captive015-74050-bl05 kernel: device veth1012565i0 entered promiscuous mode
May 20 14:29:56 captive015-74050-bl05 systemd[1]: pve-container@1012565.service: Start operation timed out. Terminating.
May 20 14:29:56 captive015-74050-bl05 systemd[1]: Failed to start PVE LXC Container: 1012565.
-- Subject: Unit pve-container@1012565.service has failed
-- Unit pve-container@1012565.service has failed.
May 20 14:29:56 captive015-74050-bl05 systemd[1]: pve-container@1012565.service: Unit entered failed state.
May 20 14:29:56 captive015-74050-bl05 systemd[1]: pve-container@1012565.service: Failed with result 'timeout'.
May 20 14:29:56 captive015-74050-bl05 pct[62298]: command 'systemctl start pve-container@1012565' failed: exit code 1
May 20 14:29:56 captive015-74050-bl05 pct[62226]: <root@pam> end task UPID:captive015-74050-bl05:0000F35A:00422397:5CE29D6A:vzstart:1012565:root@pam: command 'systemctl start pve-container@1012565' failed: exit code 1

Rebooting the node helps, these problem LXC servers work again, but after 30-120mins same stuff happens again.

We are using local storages (dir, ext4 + .raw)

Like I said, it happens on several nodes since friday.

any ideas?

wolfgang · May 21, 2019

Hi,

do you use a proxy in the corosync network?
Do you have a dedicated corosync network?
Because if not (VLAN are not dedicated) the load in the network can rise when 2 nodes are gone.
How much latency you get on the network and the interesting value are the max and not the avg?

encore · May 21, 2019

do you use a proxy in the corosync network?

no

Do you have a dedicated corosync network?

not yet, setting this up with a dedicated internal 10g sfp+ network today.
We don't see any retransmits or high latencies, so I don't think that causes the issue

How much latency you get on the network and the interesting value are the max and not the avg?

less than 1ms

Day by day more Nodes have that issue. Currently 18 nodes are affected

All nodes with that issue do have a 100% kworker process:
60927 root 20 0 0 0 0 R 99.7 0.0 1357:07 kworker/u128:2

Resource usage of those nodes is okay. Enough free memory, < 20% cpu load, 0% i/O waits, DIR storages has no fs errors.

Pvedaemon keeps hanging up:
http://prntscr.com/nrashu

@wolfgang

encore · May 21, 2019

Update: Enabling our Cluster Firewall again (we disabled it last week for testing purposes) fixed the issue.

Search

Search

Some cluster nodes "grey", "System" hangs up (Loading)

encore

Well-Known Member

encore

Well-Known Member

encore

Well-Known Member

wolfgang

Proxmox Retired Staff

encore

Well-Known Member

encore

Well-Known Member

We value your privacy