I have 4 node proxmox cluster on consumer-grade machines 1 node of it goes unresponsive every time how can I fix this ?
I can able to login on the host
and as per my observation if I kill some process it will turn green
root@pve02:~# pvecm status
Cluster information
-------------------
Name: pve-clus
Config Version: 6
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Thu Apr 25 08:21:10 2024
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000003
Ring ID: 1.372
Quorate: Yes
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.12.0.4
0x00000002 1 10.12.0.3
0x00000003 1 10.12.0.2 (local)
0x00000004 1 10.12.0.6
root@pve02:~#
==
root@pve02:~# systemctl status pvestatd.service
× pvestatd.service - PVE Status Daemon
Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; preset: enabled)
Active: failed (Result: signal) since Thu 2024-04-25 03:58:23 EDT; 4h 25min ago
Duration: 7h 2min 1.990s
Process: 22840 ExecReload=/usr/bin/pvestatd restart (code=exited, status=0/SUCCESS)
Main PID: 982 (code=killed, signal=SEGV)
CPU: 30min 8.321s
Apr 24 20:57:10 pve02 pvestatd[982]: modified cpu set for lxc/119: 0
Apr 24 21:08:59 pve02 systemd[1]: Reloading pvestatd.service - PVE Status Daemon...
Apr 24 21:09:00 pve02 pvestatd[22840]: send HUP to 982
Apr 24 21:09:00 pve02 pvestatd[982]: received signal HUP
Apr 24 21:09:00 pve02 pvestatd[982]: server shutdown (restart)
Apr 24 21:09:00 pve02 systemd[1]: Reloaded pvestatd.service - PVE Status Daemon.
Apr 24 21:09:01 pve02 pvestatd[982]: restarting server
Apr 25 03:58:23 pve02 systemd[1]: pvestatd.service: Main process exited, code=killed, status=11/SEGV
Apr 25 03:58:23 pve02 systemd[1]: pvestatd.service: Failed with result 'signal'.
Apr 25 03:58:23 pve02 systemd[1]: pvestatd.service: Consumed 30min 8.321s CPU time.
root@pve02:~# journalctl -b0 -u pvestatd.service
Apr 24 20:56:19 pve02 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Apr 24 20:56:20 pve02 pvestatd[982]: starting server
Apr 24 20:56:21 pve02 systemd[1]: Started pvestatd.service - PVE Status Daemon.
Apr 24 20:56:44 pve02 pvestatd[982]: modified cpu set for lxc/126: 0
Apr 24 20:57:07 pve02 pvestatd[982]: status update time (36.172 seconds)
Apr 24 20:57:10 pve02 pvestatd[982]: modified cpu set for lxc/119: 0
Apr 24 21:08:59 pve02 systemd[1]: Reloading pvestatd.service - PVE Status Daemon...
Apr 24 21:09:00 pve02 pvestatd[22840]: send HUP to 982
Apr 24 21:09:00 pve02 pvestatd[982]: received signal HUP
Apr 24 21:09:00 pve02 pvestatd[982]: server shutdown (restart)
Apr 24 21:09:00 pve02 systemd[1]: Reloaded pvestatd.service - PVE Status Daemon.
Apr 24 21:09:01 pve02 pvestatd[982]: restarting server
Apr 25 03:58:23 pve02 systemd[1]: pvestatd.service: Main process exited, code=killed, status=11/SEGV
Apr 25 03:58:23 pve02 systemd[1]: pvestatd.service: Failed with result 'signal'.
Apr 25 03:58:23 pve02 systemd[1]: pvestatd.service: Consumed 30min 8.321s CPU time.
root@pve02:~#
Solution: after restarting the process
node back online
root@pve02:~# systemctl start pvestatd.service
root@pve02:~# systemctl status pvestatd.service
● pvestatd.service - PVE Status Daemon
Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; preset: enabled)
Active: active (running) since Thu 2024-04-25 08:26:44 EDT; 3s ago
Process: 287846 ExecStart=/usr/bin/pvestatd start (code=exited, status=0/SUCCESS)
Main PID: 287847 (pvestatd)
Tasks: 1 (limit: 18944)
Memory: 96.5M
CPU: 745ms
CGroup: /system.slice/pvestatd.service
└─287847 pvestatd
Apr 25 08:26:43 pve02 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Apr 25 08:26:44 pve02 pvestatd[287847]: starting server
Apr 25 08:26:44 pve02 systemd[1]: Started pvestatd.service - PVE Status Daemon.
Can someone please suggest fix for this ?
I can able to login on the host
and as per my observation if I kill some process it will turn green
root@pve02:~# pvecm status
Cluster information
-------------------
Name: pve-clus
Config Version: 6
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Thu Apr 25 08:21:10 2024
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000003
Ring ID: 1.372
Quorate: Yes
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.12.0.4
0x00000002 1 10.12.0.3
0x00000003 1 10.12.0.2 (local)
0x00000004 1 10.12.0.6
root@pve02:~#
==
root@pve02:~# systemctl status pvestatd.service
× pvestatd.service - PVE Status Daemon
Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; preset: enabled)
Active: failed (Result: signal) since Thu 2024-04-25 03:58:23 EDT; 4h 25min ago
Duration: 7h 2min 1.990s
Process: 22840 ExecReload=/usr/bin/pvestatd restart (code=exited, status=0/SUCCESS)
Main PID: 982 (code=killed, signal=SEGV)
CPU: 30min 8.321s
Apr 24 20:57:10 pve02 pvestatd[982]: modified cpu set for lxc/119: 0
Apr 24 21:08:59 pve02 systemd[1]: Reloading pvestatd.service - PVE Status Daemon...
Apr 24 21:09:00 pve02 pvestatd[22840]: send HUP to 982
Apr 24 21:09:00 pve02 pvestatd[982]: received signal HUP
Apr 24 21:09:00 pve02 pvestatd[982]: server shutdown (restart)
Apr 24 21:09:00 pve02 systemd[1]: Reloaded pvestatd.service - PVE Status Daemon.
Apr 24 21:09:01 pve02 pvestatd[982]: restarting server
Apr 25 03:58:23 pve02 systemd[1]: pvestatd.service: Main process exited, code=killed, status=11/SEGV
Apr 25 03:58:23 pve02 systemd[1]: pvestatd.service: Failed with result 'signal'.
Apr 25 03:58:23 pve02 systemd[1]: pvestatd.service: Consumed 30min 8.321s CPU time.
root@pve02:~# journalctl -b0 -u pvestatd.service
Apr 24 20:56:19 pve02 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Apr 24 20:56:20 pve02 pvestatd[982]: starting server
Apr 24 20:56:21 pve02 systemd[1]: Started pvestatd.service - PVE Status Daemon.
Apr 24 20:56:44 pve02 pvestatd[982]: modified cpu set for lxc/126: 0
Apr 24 20:57:07 pve02 pvestatd[982]: status update time (36.172 seconds)
Apr 24 20:57:10 pve02 pvestatd[982]: modified cpu set for lxc/119: 0
Apr 24 21:08:59 pve02 systemd[1]: Reloading pvestatd.service - PVE Status Daemon...
Apr 24 21:09:00 pve02 pvestatd[22840]: send HUP to 982
Apr 24 21:09:00 pve02 pvestatd[982]: received signal HUP
Apr 24 21:09:00 pve02 pvestatd[982]: server shutdown (restart)
Apr 24 21:09:00 pve02 systemd[1]: Reloaded pvestatd.service - PVE Status Daemon.
Apr 24 21:09:01 pve02 pvestatd[982]: restarting server
Apr 25 03:58:23 pve02 systemd[1]: pvestatd.service: Main process exited, code=killed, status=11/SEGV
Apr 25 03:58:23 pve02 systemd[1]: pvestatd.service: Failed with result 'signal'.
Apr 25 03:58:23 pve02 systemd[1]: pvestatd.service: Consumed 30min 8.321s CPU time.
root@pve02:~#
Solution: after restarting the process
node back online
root@pve02:~# systemctl start pvestatd.service
root@pve02:~# systemctl status pvestatd.service
● pvestatd.service - PVE Status Daemon
Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; preset: enabled)
Active: active (running) since Thu 2024-04-25 08:26:44 EDT; 3s ago
Process: 287846 ExecStart=/usr/bin/pvestatd start (code=exited, status=0/SUCCESS)
Main PID: 287847 (pvestatd)
Tasks: 1 (limit: 18944)
Memory: 96.5M
CPU: 745ms
CGroup: /system.slice/pvestatd.service
└─287847 pvestatd
Apr 25 08:26:43 pve02 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Apr 25 08:26:44 pve02 pvestatd[287847]: starting server
Apr 25 08:26:44 pve02 systemd[1]: Started pvestatd.service - PVE Status Daemon.
Can someone please suggest fix for this ?
Attachments
Last edited: