[SOLVED] One node in cluster going Grey in GUI after upgrading to 8.0.4

helojunkie · Sep 20, 2023

I have a 6-node cluster that I recently (two nights ago) upgraded to 8.0.4. This cluster has been going strong with no issues at all until after the upgrade to 8.0.4.

The upgrade went smoothly with zero issues at all.

Now, one node in the cluster keeps 'greying out' on the GUI. I rebooted the server, and it came back, but scarcely minutes later, it greyed out again. All LXC and VM are running and working fine, but I am having trouble trying to determine what the issue is with it losing connectivity to the GUI.

Corosync traffic is on a dedicated VLAN, all nodes are connected via 10G for Corosync and a separate 10G trunk link for all other traffic. Nothing shares the network with Corosync traffic.

PVE Version all Nodes:
pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-12-pve)

I tried the following to get it back:
systemctl restart pve-cluster (does not bring it back)

systemctl restart pvestatd
This sort of brings it back: all the VMs go green, LXC containers all stay grey as well as the storage listed stays grey. It stays this way for about 3 to 5 minutes then everything goes grey again. Rerunning the command brings it back, but then fails again.

systemctl restart pveproxy does not bring it back

Sep 20 12:46:22 proxmox02 systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Sep 20 12:46:22 proxmox02 pmxcfs[1899301]: [main] notice: resolved node name 'proxmox02' to '10.200.70.3' for default node IP address
Sep 20 12:46:22 proxmox02 pmxcfs[1899301]: [main] notice: resolved node name 'proxmox02' to '10.200.70.3' for default node IP address
Sep 20 12:46:22 proxmox02 pmxcfs[1899303]: [status] notice: update cluster info (cluster name Proxmox, version = 6)
Sep 20 12:46:22 proxmox02 pmxcfs[1899303]: [status] notice: node has quorum
Sep 20 12:46:22 proxmox02 pmxcfs[1899303]: [dcdb] notice: members: 1/5120, 2/1899303, 3/3553, 4/5629, 5/4677, 6/28004
Sep 20 12:46:22 proxmox02 pmxcfs[1899303]: [dcdb] notice: starting data syncronisation
Sep 20 12:46:22 proxmox02 pmxcfs[1899303]: [dcdb] notice: received sync request (epoch 1/5120/0000000C)
Sep 20 12:46:22 proxmox02 pmxcfs[1899303]: [status] notice: members: 1/5120, 2/1899303, 3/3553, 4/5629, 5/4677, 6/28004
Sep 20 12:46:22 proxmox02 pmxcfs[1899303]: [status] notice: starting data syncronisation
Sep 20 12:46:22 proxmox02 pmxcfs[1899303]: [status] notice: received sync request (epoch 1/5120/0000000C)
Sep 20 12:46:22 proxmox02 pmxcfs[1899303]: [dcdb] notice: received all states
Sep 20 12:46:22 proxmox02 pmxcfs[1899303]: [dcdb] notice: leader is 1/5120
Sep 20 12:46:22 proxmox02 pmxcfs[1899303]: [dcdb] notice: synced members: 1/5120, 2/1899303, 3/3553, 4/5629, 5/4677, 6/28004
Sep 20 12:46:22 proxmox02 pmxcfs[1899303]: [dcdb] notice: all data is up to date
Sep 20 12:46:22 proxmox02 pmxcfs[1899303]: [status] notice: received all states
Sep 20 12:46:22 proxmox02 pmxcfs[1899303]: [status] notice: all data is up to date

root@proxmox02:~# pvecm status
Cluster information
-------------------
Name: Proxmox
Config Version: 6
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Wed Sep 20 14:02:37 2023
Quorum provider: corosync_votequorum
Nodes: 6
Node ID: 0x00000002
Ring ID: 1.447
Quorate: Yes

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 6
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.200.80.2
0x00000002 1 10.200.80.3 (local)
0x00000003 1 10.200.80.4
0x00000004 1 10.200.80.5
0x00000005 1 10.200.80.6
0x00000006 1 10.200.80.7

Sep 20 12:53:08 proxmox02 systemd[1]: Stopping pvestatd.service - PVE Status Daemon...
Sep 20 12:53:10 proxmox02 pvestatd[7144]: received signal TERM
Sep 20 12:53:10 proxmox02 pvestatd[7144]: server closing
Sep 20 12:53:10 proxmox02 pvestatd[7144]: server stopped
Sep 20 12:53:11 proxmox02 systemd[1]: pvestatd.service: Deactivated successfully.
Sep 20 12:53:11 proxmox02 systemd[1]: Stopped pvestatd.service - PVE Status Daemon.
Sep 20 12:53:11 proxmox02 systemd[1]: pvestatd.service: Consumed 1min 59.482s CPU time.
Sep 20 12:53:11 proxmox02 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Sep 20 12:53:12 proxmox02 pvestatd[1940505]: starting server
Sep 20 12:53:12 proxmox02 systemd[1]: Started pvestatd.service - PVE Status Daemon.
Sep 20 13:48:28 proxmox02 systemd[1]: Stopping pvestatd.service - PVE Status Daemon...
Sep 20 13:48:29 proxmox02 pvestatd[1940505]: received signal TERM
Sep 20 13:48:29 proxmox02 pvestatd[1940505]: server closing
Sep 20 13:48:29 proxmox02 pvestatd[1940505]: server stopped
Sep 20 13:48:30 proxmox02 systemd[1]: pvestatd.service: Deactivated successfully.
Sep 20 13:48:30 proxmox02 systemd[1]: Stopped pvestatd.service - PVE Status Daemon.
Sep 20 13:48:30 proxmox02 systemd[1]: pvestatd.service: Consumed 3.900s CPU time.
Sep 20 13:48:31 proxmox02 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Sep 20 13:48:32 proxmox02 pvestatd[2274251]: starting server
Sep 20 13:48:32 proxmox02 systemd[1]: Started pvestatd.service - PVE Status Daemon.

● pvestatd.service - PVE Status Daemon
Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; preset: enabled)
Active: active (running) since Wed 2023-09-20 14:06:18 PDT; 33s ago
Process: 2374955 ExecStart=/usr/bin/pvestatd start (code=exited, status=0/SUCCESS)
Main PID: 2375152 (pvestatd)
Tasks: 2 (limit: 309322)
Memory: 83.0M
CPU: 2.334s
CGroup: /system.slice/pvestatd.service
├─2375152 pvestatd
└─2375967 lxc-info -n 120 -p

Sep 20 14:06:17 proxmox02 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Sep 20 14:06:18 proxmox02 pvestatd[2375152]: starting server
Sep 20 14:06:18 proxmox02 systemd[1]: Started pvestatd.service - PVE Status Daemon.

root@proxmox02:~# pveversion -v
proxmox-ve: 8.0.2 (running kernel: 6.2.16-12-pve)
pve-manager: 8.0.4 (running version: 8.0.4/d258a813cfa6b390)
proxmox-kernel-helper: 8.0.3
pve-kernel-5.15: 7.4-6
pve-kernel-5.13: 7.1-9
pve-kernel-5.4: 6.4-15
proxmox-kernel-6.2.16-12-pve: 6.2.16-12
proxmox-kernel-6.2: 6.2.16-12
pve-kernel-5.15.116-1-pve: 5.15.116-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.4.174-2-pve: 5.4.174-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown: residual config
ifupdown2: 3.2.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.5
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.8
libpve-guest-common-perl: 5.0.4
libpve-http-server-perl: 5.0.4
libpve-rs-perl: 0.8.5
libpve-storage-perl: 8.0.2
libqb0: 1.0.5-1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.2-1
proxmox-backup-file-restore: 3.0.2-1
proxmox-kernel-helper: 8.0.3
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.2
proxmox-widget-toolkit: 4.0.6
pve-cluster: 8.0.3
pve-container: 5.0.4
pve-docs: 8.0.4
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.8-2
pve-ha-manager: 4.0.2
pve-i18n: 3.0.5
pve-qemu-kvm: 8.0.2-5
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.7
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1

At this point I am not sure what other information or logs might be useful.

helojunkie · Sep 23, 2023

In case anyone runs across this thread, I determined the problem. It was a single LXC container running Turnkey Core (16.1) as a media docker LXC. It had been running fine for several years, but I think when I updated Proxmox from 7.4 to 8.0.4 this caused an issue.

It took some time to figure it out, but eventually, I started migrating a couple of containers to other nodes, and I noticed that the new node would go down the same way. There was nothing in the logs that I could see. From that point, it was just a matter of tracking down the one. I isolated the offending LXC container and dumped it on our pre-release node, where I didn't care if I had to reboot it.

I also figured out a way to get everything back without rebooting. Basically, you kill the offending container from the cli, but the hard way:
ps aux | grep <ID>, and then kill -9 the <PID> returned. This immediately gave me back full control of the node without rebooting.

So I built a new container using the latest Turnkey core, dropped docker on, migrated my docker stuff, and now no more failures.

Anyway, I hope this helps someone else out who may run across the same issue.

jlecomte · Dec 12, 2023

I've had exactly the same symptoms, but a different issue and resolution. Since I stumbled upon this post while googling these symtoms I guess I might post my resolution to the same thread.

Symptoms were restarting the services would make everything go green for a while and then revert back to the grey question mark.

The issue was created by starting many migrations and cancelling them before they finish. That created a few unkilled LVM process (vgs and lvcreate processes)

Once identified, and killed all those processes, I was enable to restart all services once and for all. I imagine a reboot would have also solved the issue, but I couldn't afford doing that.

Search

Search

[SOLVED] One node in cluster going Grey in GUI after upgrading to 8.0.4

helojunkie

Well-Known Member

helojunkie

Well-Known Member

jlecomte

New Member