pveproxy stuck

greg

Renowned Member
Apr 6, 2011
140
2
83
Greetings

In the first days of this new year, my Proxmox cluster is in bad shape...


In one node, "pveproxy" is badly stuck:

Bash:
root     15639  0.0  0.5 295812 89408 pts/26   D     2021   0:00 /usr/bin/perl -T /usr/bin/pvesr status
root     24233  0.0  0.5 283276 83712 ?        Ds    2021   0:00 /usr/bin/perl -T /usr/bin/pveproxy restart
root     23262  0.1  0.5 283252 92200 ?        Ds   15:58   0:00 /usr/bin/perl -T /usr/bin/pveproxy stop

I cannot even force kill it:

Code:
kill -9 15639 24233 23262

gives nothing. Status is weird:

Code:
● pveproxy.service - PVE API Proxy Server
   Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; vendor preset: enabled)
   Active: failed (Result: timeout) since Thu 2022-01-06 16:05:35 CET; 6min ago
 Main PID: 19825 (code=exited, status=0/SUCCESS)
    Tasks: 2 (limit: 4915)
   Memory: 164.9M
   CGroup: /system.slice/pveproxy.service
           ├─23262 /usr/bin/perl -T /usr/bin/pveproxy stop
           └─24233 /usr/bin/perl -T /usr/bin/pveproxy restart

janv. 06 16:01:04 sysv6 systemd[1]: pveproxy.service: State 'stop-sigterm' timed out. Killing.
janv. 06 16:01:04 sysv6 systemd[1]: pveproxy.service: Killing process 23262 (pveproxy) with signal SIGKILL.
janv. 06 16:01:04 sysv6 systemd[1]: pveproxy.service: Killing process 24233 (pveproxy) with signal SIGKILL.
janv. 06 16:02:35 sysv6 systemd[1]: pveproxy.service: Processes still around after SIGKILL. Ignoring.
janv. 06 16:04:05 sysv6 systemd[1]: pveproxy.service: State 'stop-final-sigterm' timed out. Killing.
janv. 06 16:04:05 sysv6 systemd[1]: pveproxy.service: Killing process 24233 (pveproxy) with signal SIGKILL.
janv. 06 16:04:05 sysv6 systemd[1]: pveproxy.service: Killing process 23262 (pveproxy) with signal SIGKILL.
janv. 06 16:05:35 sysv6 systemd[1]: pveproxy.service: Processes still around after final SIGKILL. Entering failed mode.
janv. 06 16:05:35 sysv6 systemd[1]: pveproxy.service: Failed with result 'timeout'.
janv. 06 16:05:35 sysv6 systemd[1]: Stopped PVE API Proxy Server.

Short of unplugging the server, what can I do?

Thanks in advance

Regards
 
BTW I tried various commands, such as
Code:
systemctl restart pveproxy pvedaemon

pvecm updatecerts

all hang.
 
Also: all pve related process are stuck on half the nodes of the cluster...
 
I had to electrically reboot the server. Now it doesn't start, I guess it's because there are only zfs partitions and grub cannot access any of them.

So basically my server is dead. Happy new year!
 
Good news, with idrac I was able to boot it by forcing UEFI, for some reasons it wasn't...
 
Last edited:
So back to the original problem... what can I do when all pve commands hang?
 
What does journalctl -u pveproxy -u pvedaemon say on these?

This is for nodes of a cluster that all have quorum? On each affected node the pvecm status says Quorate: yes?
 
Maybe you both can tell us more about your setups? Perhaps a description of how this becomes noticeable, whether it is after a certain action, whether it can be reproduced. What might be in all the logs, what does the system itself look like (monitoring/metrics available?)?

There must be a cause for this and it can certainly be solved.
 
Actually for me this was 2 years ago, so my cluster evolved :) Upgrades were made, and this machine has been retired.
I hope @henryd99 will find a solution! (fingers crossed).

Regards
 
The issue was in regards to NTP not being in sync between all machines. The 2 servers it relied on were offline.
This created issues with 1 of the nodes and messed chronosync up, leading to lots of service stuck in a sleep state. This can only be fixed with a reboot. Works now. This is the thread > https://forum.proxmox.com/threads/master-node-in-cluster-cant-restart-pvedeamon-or-pveproxy.138080/
If you have to reboot anything in order to fix what is essentially a time-sync issue, there's something terribly wrong with the setup.
 
Similar situation here. It started with a cluster node coming back online after physical maintenance (node A)


15:20: node A back online, no running VM's


I noticed a question mark icon on two other nodes in the cluster of 5 (node B/C). I've had these before and sometimes it fixed itself after a a day, sometimes a week, I've also rebooted once in the past to get it fixed. (Back then I was able to migrate all vm's away first)




22:02 I was able to migrate a test vm to node A (from D) which worked normally.



22:09 node B: pvedaemon.service: State 'stop-sigterm' timed out. Killing.

  • node B: around midnight LOAD had increased to 50+ due to all the hanging processes, the running VM's still perform as normal.
  • node C: no problems




2024-07-20:


  • 11:07 node B systemd[1]: pveproxy.service: Scheduled restart job, restart counter is at 100.
  • node B: load 100+ from the hanging processes, not actually very busy
  • node E reports pvecm status OK, with all 5 nodes in quorum
  • node b: corosync daemon status OK


node B:

Code:
● pveproxy.service - PVE API Proxy Server
Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; preset: enabled)
Active: deactivating (final-sigkill) (Result: timeout) since Fri 2024-07-19 22:21:44 CEST; 14h ago
Process: 1332441 ExecStartPre=/usr/bin/pvecm updatecerts --silent (code=killed, signal=KILL)
Tasks: 115 (limit: 308748)
Memory: 5.0G
CPU: 332ms
CGroup: /system.slice/pveproxy.service
├─ 925626 /usr/bin/perl -T /usr/bin/pveproxy stop
├─ 933079 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
├─ 936551 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
├─ 940075 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
├─ 943420 /usr/bin/perl /usr/bin/pvecm updatecerts --silent


... and many more lines with updatecerts, followed by a dozen "pveproxy.service: Killing process 1298761 (pvecm) with signal SIGKILL."




Node B: `qm list` also hangs


Node B: `strace -p` on a hanging process gave no new info

Node B: pveversion -v
Code:
proxmox-ve: 8.2.0 (running kernel: 6.8.4-2-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-2
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.0-1
proxmox-backup-file-restore: 3.2.0-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.1
pve-cluster: 8.0.6
pve-container: 5.0.10
pve-docs: 8.2.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.5
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2



Node E # pvecm status
Code:
Cluster information
-------------------
Name:             mox
Config Version:   20
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sat Jul 20 09:53:13 2024
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000004
Ring ID:          1.1691
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3 
Flags:            Quorate

Membership information
----------------------
Nodeid      Votes Name
0x00000001          1 10.0.0.5
0x00000002          1 10.0.0.43
0x00000004          1 10.0.0.42 (local)
0x00000006          1 10.0.0.15
0x00000007          1 10.0.0.239



- NTP is monitored over all nodes, and within 0.0001 seconds

I don't think I can live migrate, UI refuses with "Connection error 595: Connection refused" and cli already hangs with qm list ... so when rebooting I have to take all VM's down with it.
 
TLDR: Solved with `killall -9 pmxcfs; pmxcfs`


I did some more digging...


Code:
root@nodeB:~# mount -o remount /etc/pve
/bin/sh: 1: /dev/fuse: Permission denied


Node B~# find /etc/pve
.. lists normal output ... with "/etc/pve/priv" being the last line and then it hangs


Code:
ls  /etc/pve/priv/lock/
***HANG***


Code:
root@nodeB:~# /usr/bin/pmxcfs
[main] notice: resolved node name 'nodeb' to '10.0.0.43' for default node IP address
[main] notice: unable to acquire pmxcfs lock - trying again

[main] crit: unable to acquire pmxcfs lock: Resource temporarily unavailable
[main] notice: exit proxmox configuration filesystem (-1)


What finally resolved it:
Code:
ps xaf | grep pmxcfs

kill -9 <PID>

# pmxcfs