pveproxy stuck

greg

Renowned Member
Apr 6, 2011
137
2
83
Greetings

In the first days of this new year, my Proxmox cluster is in bad shape...


In one node, "pveproxy" is badly stuck:

Bash:
root     15639  0.0  0.5 295812 89408 pts/26   D     2021   0:00 /usr/bin/perl -T /usr/bin/pvesr status
root     24233  0.0  0.5 283276 83712 ?        Ds    2021   0:00 /usr/bin/perl -T /usr/bin/pveproxy restart
root     23262  0.1  0.5 283252 92200 ?        Ds   15:58   0:00 /usr/bin/perl -T /usr/bin/pveproxy stop

I cannot even force kill it:

Code:
kill -9 15639 24233 23262

gives nothing. Status is weird:

Code:
● pveproxy.service - PVE API Proxy Server
   Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; vendor preset: enabled)
   Active: failed (Result: timeout) since Thu 2022-01-06 16:05:35 CET; 6min ago
 Main PID: 19825 (code=exited, status=0/SUCCESS)
    Tasks: 2 (limit: 4915)
   Memory: 164.9M
   CGroup: /system.slice/pveproxy.service
           ├─23262 /usr/bin/perl -T /usr/bin/pveproxy stop
           └─24233 /usr/bin/perl -T /usr/bin/pveproxy restart

janv. 06 16:01:04 sysv6 systemd[1]: pveproxy.service: State 'stop-sigterm' timed out. Killing.
janv. 06 16:01:04 sysv6 systemd[1]: pveproxy.service: Killing process 23262 (pveproxy) with signal SIGKILL.
janv. 06 16:01:04 sysv6 systemd[1]: pveproxy.service: Killing process 24233 (pveproxy) with signal SIGKILL.
janv. 06 16:02:35 sysv6 systemd[1]: pveproxy.service: Processes still around after SIGKILL. Ignoring.
janv. 06 16:04:05 sysv6 systemd[1]: pveproxy.service: State 'stop-final-sigterm' timed out. Killing.
janv. 06 16:04:05 sysv6 systemd[1]: pveproxy.service: Killing process 24233 (pveproxy) with signal SIGKILL.
janv. 06 16:04:05 sysv6 systemd[1]: pveproxy.service: Killing process 23262 (pveproxy) with signal SIGKILL.
janv. 06 16:05:35 sysv6 systemd[1]: pveproxy.service: Processes still around after final SIGKILL. Entering failed mode.
janv. 06 16:05:35 sysv6 systemd[1]: pveproxy.service: Failed with result 'timeout'.
janv. 06 16:05:35 sysv6 systemd[1]: Stopped PVE API Proxy Server.

Short of unplugging the server, what can I do?

Thanks in advance

Regards
 
BTW I tried various commands, such as
Code:
systemctl restart pveproxy pvedaemon

pvecm updatecerts

all hang.
 
Also: all pve related process are stuck on half the nodes of the cluster...
 
I had to electrically reboot the server. Now it doesn't start, I guess it's because there are only zfs partitions and grub cannot access any of them.

So basically my server is dead. Happy new year!
 
Good news, with idrac I was able to boot it by forcing UEFI, for some reasons it wasn't...
 
Last edited:
So back to the original problem... what can I do when all pve commands hang?
 
What does journalctl -u pveproxy -u pvedaemon say on these?

This is for nodes of a cluster that all have quorum? On each affected node the pvecm status says Quorate: yes?
 
Maybe you both can tell us more about your setups? Perhaps a description of how this becomes noticeable, whether it is after a certain action, whether it can be reproduced. What might be in all the logs, what does the system itself look like (monitoring/metrics available?)?

There must be a cause for this and it can certainly be solved.
 
Actually for me this was 2 years ago, so my cluster evolved :) Upgrades were made, and this machine has been retired.
I hope @henryd99 will find a solution! (fingers crossed).

Regards
 
The issue was in regards to NTP not being in sync between all machines. The 2 servers it relied on were offline.
This created issues with 1 of the nodes and messed chronosync up, leading to lots of service stuck in a sleep state. This can only be fixed with a reboot. Works now. This is the thread > https://forum.proxmox.com/threads/master-node-in-cluster-cant-restart-pvedeamon-or-pveproxy.138080/
If you have to reboot anything in order to fix what is essentially a time-sync issue, there's something terribly wrong with the setup.
 
Similar situation here. It started with a cluster node coming back online after physical maintenance (node A)


15:20: node A back online, no running VM's


I noticed a question mark icon on two other nodes in the cluster of 5 (node B/C). I've had these before and sometimes it fixed itself after a a day, sometimes a week, I've also rebooted once in the past to get it fixed. (Back then I was able to migrate all vm's away first)




22:02 I was able to migrate a test vm to node A (from D) which worked normally.



22:09 node B: pvedaemon.service: State 'stop-sigterm' timed out. Killing.

  • node B: around midnight LOAD had increased to 50+ due to all the hanging processes, the running VM's still perform as normal.
  • node C: no problems




2024-07-20:


  • 11:07 node B systemd[1]: pveproxy.service: Scheduled restart job, restart counter is at 100.
  • node B: load 100+ from the hanging processes, not actually very busy
  • node E reports pvecm status OK, with all 5 nodes in quorum
  • node b: corosync daemon status OK


node B:

Code:
● pveproxy.service - PVE API Proxy Server
Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; preset: enabled)
Active: deactivating (final-sigkill) (Result: timeout) since Fri 2024-07-19 22:21:44 CEST; 14h ago
Process: 1332441 ExecStartPre=/usr/bin/pvecm updatecerts --silent (code=killed, signal=KILL)
Tasks: 115 (limit: 308748)
Memory: 5.0G
CPU: 332ms
CGroup: /system.slice/pveproxy.service
├─ 925626 /usr/bin/perl -T /usr/bin/pveproxy stop
├─ 933079 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
├─ 936551 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
├─ 940075 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
├─ 943420 /usr/bin/perl /usr/bin/pvecm updatecerts --silent


... and many more lines with updatecerts, followed by a dozen "pveproxy.service: Killing process 1298761 (pvecm) with signal SIGKILL."




Node B: `qm list` also hangs


Node B: `strace -p` on a hanging process gave no new info

Node B: pveversion -v
Code:
proxmox-ve: 8.2.0 (running kernel: 6.8.4-2-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-2
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.0-1
proxmox-backup-file-restore: 3.2.0-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.1
pve-cluster: 8.0.6
pve-container: 5.0.10
pve-docs: 8.2.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.5
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2



Node E # pvecm status
Code:
Cluster information
-------------------
Name:             mox
Config Version:   20
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sat Jul 20 09:53:13 2024
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000004
Ring ID:          1.1691
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3 
Flags:            Quorate

Membership information
----------------------
Nodeid      Votes Name
0x00000001          1 10.0.0.5
0x00000002          1 10.0.0.43
0x00000004          1 10.0.0.42 (local)
0x00000006          1 10.0.0.15
0x00000007          1 10.0.0.239



- NTP is monitored over all nodes, and within 0.0001 seconds

I don't think I can live migrate, UI refuses with "Connection error 595: Connection refused" and cli already hangs with qm list ... so when rebooting I have to take all VM's down with it.
 
TLDR: Solved with `killall -9 pmxcfs; pmxcfs`


I did some more digging...


Code:
root@nodeB:~# mount -o remount /etc/pve
/bin/sh: 1: /dev/fuse: Permission denied


Node B~# find /etc/pve
.. lists normal output ... with "/etc/pve/priv" being the last line and then it hangs


Code:
ls  /etc/pve/priv/lock/
***HANG***


Code:
root@nodeB:~# /usr/bin/pmxcfs
[main] notice: resolved node name 'nodeb' to '10.0.0.43' for default node IP address
[main] notice: unable to acquire pmxcfs lock - trying again

[main] crit: unable to acquire pmxcfs lock: Resource temporarily unavailable
[main] notice: exit proxmox configuration filesystem (-1)


What finally resolved it:
Code:
ps xaf | grep pmxcfs

kill -9 <PID>

# pmxcfs
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!