pveproxy become blocked state and cannot be killed

Nope, we have since replaced all production machines this error occurred on as there seems to be no possible fix for this fairly common problem (they were older AMD machines anyway). My testing machine however still runs into this mystery of a problem every few weeks or so if left on long enough...

The only thing I could add is that my suspicion still is with ZFS, because I've only seen the error occur on ZFS RAIDZ machines...
Mine is ZFS mirror on NVME disks, CPU E3-1270v6, ECC RAM. Standalone proxmox is working fine, just clustering have this problem
 
Good afternoon. I ran into the same problem. I have a cluster of 4 servers. On 3 of 4 pveproxy does not work, hangs in the "Ds" state.

root 34194 0.0 0.0 239612 66012 ? Ds 11:44 0:00 /usr/bin/perl -T /usr/bin/pveproxy start

# pveversion -v
proxmox-ve: 4.4-84 (running kernel: 4.4.13-2-pve)
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.13-2-pve: 4.4.13-58
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-109
pve-firmware: 1.1-10
libpve-common-perl: 4.0-94
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-96
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
 
Nfs share is there but it is available.

pvesm status
WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!
hetzner_backup_0 nfs 1 10735915008 86672384 10649226240 1.31%
local dir 1 163036644 21889292 141130968 13.93%
qsan-01 lvm 1 6884945920 4439179264 2445766656 64.98%
ssd-local dir 1 3457442276 3261589540 20201720 99.88%
v3700ctrl1 iscsi 1 0 0 0 100.00%
vmstore lvm 1 6442385408 6320885760 121499648 98.61%
vmstore_01 lvm 1 7516127232 5761925120 1754202112 77.16%
vmstore_02 lvm 1 15355342848 15351152640 4190208 100.47%


ls -la /mnt/pve/hetzner_backup_0
total 32
drwxrwxrwx 5 backup backup 4096 Jul 31 17:14 .
drwxr-xr-x 3 root root 4096 Jun 21 16:26 ..
drwxr-xr-x 3 backup backup 4096 Aug 1 13:11 dump
drwx------ 2 backup backup 16384 May 29 18:16 lost+found
-rw-r--r-- 1 backup backup 0 May 29 18:19 _NFS
drwxr-xr-x 4 backup backup 4096 May 31 14:01 template
 
and if the storage is OK check if you're not actively swapping at the momment.

when executing the command
vmstat 2

the columns si/so (swap in / swap out ) should be around 0 most of the time
NB: swap as seen with free/top is totally fine ! it's *actively* swapping, when the values si/so constantly change, which we try to avoid
 
At the moment swap out activity.

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
14 0 15709016 36706880 198028 1721140 0 0 473 434 0 0 7 2 91 0 0
10 1 15709000 36706784 198028 1721172 12 0 6635 10138 32683 66340 5 27 67 1 0
12 0 15708984 36703236 198032 1721176 12 0 3439 11432 31073 88710 6 21 73 0 0
9 0 15708984 36686760 198032 1721180 0 0 3248 18598 32198 65791 9 23 68 0 0
9 0 15708976 36679480 198032 1721180 4 0 4118 17468 27161 48179 7 22 71 0 0
16 0 15708964 36679168 198040 1721180 6 0 5982 19964 30380 63234 7 16 77 0 0
13 0 15708892 36680800 198040 1721176 44 0 5318 16274 26140 42718 5 16 77 1 0
12 0 15708884 36671328 198048 1721184 4 0 4108 25078 24655 52191 6 15 80 0 0
11 0 15708876 36668220 198048 1721200 6 0 4434 24136 23467 54419 5 23 72 0 0
6 0 15708860 36669632 198052 1721196 12 0 5862 23472 24619 54578 5 17 78 0 0


I will wait when the swap is cleaned and try again.
 
is the process still in D state (that's the important state)
as soon as this process is in D state please send:

dmesg -T | ack "blocked for more than 120 seconds"
first 10 lines of vmstat --unit M 2 output

I would like to know. are you using mechanical hardrives or SSDs.
 
Under the system I have installed a mechanical hard drives.

Below I tried to start pveproxy and followed the recommendations.

# free -h
total used free shared buffers cached
Mem: 314G 196G 118G 4.2G 390M 5.6G
-/+ buffers/cache: 190G 124G
Swap: 34G 0B 34G

# systemctl status pveproxy.service
● pveproxy.service - PVE API Proxy Server
Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled)
Active: failed (Result: timeout) since Tue 2017-08-01 11:48:34 MSK; 1 day 4h ago
Main PID: 106838 (code=exited, status=0/SUCCESS)

Aug 01 11:45:34 pve-01 systemd[1]: pveproxy.service start operation timed out. Terminating.
Aug 01 11:47:04 pve-01 systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
Aug 01 11:48:34 pve-01 systemd[1]: pveproxy.service still around after final SIGKILL. Entering failed mode.
Aug 01 11:48:34 pve-01 systemd[1]: Failed to start PVE API Proxy Server.
Aug 01 11:48:34 pve-01 systemd[1]: Unit pveproxy.service entered failed state.

# systemctl start pveproxy.service
Job for pveproxy.service failed. See 'systemctl status pveproxy.service' and 'journalctl -xn' for details.

# systemctl status pveproxy.service
● pveproxy.service - PVE API Proxy Server
Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled)
Active: failed (Result: timeout) since Wed 2017-08-02 16:01:46 MSK; 11min ago
Main PID: 106838 (code=exited, status=0/SUCCESS)

Aug 02 15:58:46 pve-01 systemd[1]: pveproxy.service start operation timed out. Terminating.
Aug 02 16:00:16 pve-01 systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
Aug 02 16:01:46 pve-01 systemd[1]: pveproxy.service still around after final SIGKILL. Entering failed mode.
Aug 02 16:01:46 pve-01 systemd[1]: Failed to start PVE API Proxy Server.
Aug 02 16:01:46 pve-01 systemd[1]: Unit pveproxy.service entered failed state.

# ps aux | grep proxy
root 84839 0.0 0.0 239688 66060 ? Ds 15:57 0:00 /usr/bin/perl -T /usr/bin/pveproxy start
root 95849 0.0 0.0 12728 2264 pts/0 S+ 16:08 0:00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn proxy

# dmesg -T | ack "blocked for more than 120 seconds"
--> No result

# vmstat --unit M 2
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 0 0 121239 391 5779 0 0 530 423 0 0 4 1 96 0 0
2 0 0 121240 391 5779 0 0 8106 11692 15289 56890 9 1 90 0 0
3 0 0 121239 391 5779 0 0 9950 11780 15257 52021 8 1 92 0 0
5 0 0 121240 391 5779 0 0 17534 12906 15075 52346 7 1 92 0 0
6 0 0 121240 391 5779 0 0 10518 8698 14516 50207 9 1 90 0 0
6 0 0 121240 391 5779 0 0 8857 12230 17148 57759 8 1 91 0 0
3 0 0 121243 391 5779 0 0 8208 11890 14731 51263 7 1 92 0 0
2 0 0 121243 391 5779 0 0 13388 11658 14099 49490 7 1 92 0 0
7 0 0 121243 391 5779 0 0 12550 10178 16194 57570 9 1 90 0 0
2 0 0 121242 391 5779 0 0 17160 10488 15526 51123 8 1 91 0 0
6 0 0 121243 391 5779 0 0 8541 11636 17159 56989 9 1 90 0 0
4 0 0 121244 391 5779 0 0 38004 7576 15762 49063 10 1 90 0 0
3 0 0 121243 391 5779 0 0 9136 12454 16873 60225 9 1 91 0 0
3 0 0 121243 391 5779 0 0 39230 9784 16795 62843 9 1 90 0 0
2 0 0 121243 391 5779 0 0 26886 9410 14884 51273 9 1 90 0 0
4 0 0 121243 391 5779 0 0 19485 11136 17958 62591 10 1 89 0 0
 
is pve proxy the only process in D state ?
try the following command to check that

ps faxl | awk '$10~"D" {print $0}

sorry the command for checking the kernel log is
dmesg | grep "blocked for more than 120 seconds"
 
# ps faxl | awk '$10~"D" {print $0}'
4 0 30942 30941 20 0 19260 3768 iterat D ? 0:00 \_ git add .
4 0 84839 1 20 0 239688 66060 filena Ds ? 0:00 /usr/bin/perl -T /usr/bin/pveproxy start

# dmesg | grep "blocked for more than 120 seconds"
--> no result
 
Good day. I managed to cope with the problem. Using the following commands.

# systemctl restart pve-cluster.service
# pvecm updatecerts -f
# systemctl restart pveproxy.service


Strange situation. Old certificates were valid until 2025.
After pvecm updatecerts -f
Pveproxy started and everything works.

In a couple of days I'll let you know how everything works.
 
This solution does not help for long.

Today on 3 nodes again broke the pveproxy process.

And restarting pveproxy without rebooting pve-cluster is impossible

In what there can be a reason of such behavior?
 
Maybe this is a coincidence for after I have disabled NFS from all nodes. Pveproxy began to work more stable.
 
Maybe it's not related to your problems, but I want to share my experience, so you double check your storage health: on my 2 nodes cluster running Proxmox VE 3.4 with ZFS and DRBD I just had some I/O troubles leading to pveproxy stuck at 100%. After one hour looking for any possible problem, I figured out that one of the spinning disks mirrored with ZFS was experiencing CKSUM errors: the command zpool status showed all the disks as ONLINE but one of them had CKSUM 2 instead of 0 and status reported "One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.".
It came out that the disk had troubles but not enough to be marked as broken by ZFS. So ZFS was periodically saturating I/O because it needed to fetch data from the other disks, checksum and re-write to the partially broken device. I issued a zpool offline rpool sdd3 (it is a 3 way RAID1 mirror, so even offlining one disk I still have redundancy) and finally all the problems are gone.
 
I have had this happen a few times with 4.4. Each time the problem is that corosync isn't running on one of the machines. Its always the machine not showing the 'D' states.
 
even this started ages ago, since the last is from aprili think ill join the party.
had this today on pve 6
3 node cluster, node 1 was sending lot of migration data to node 3.
while in the middle of the migration i had the great idea to migrate an emtpy template (basically just a machine definition with no disk or storages).
this was the moment pveproxy got into lock
i could not restart the cluster service as it told me it cannto access /etc/pve (transport destination not connected)

allright so i tryed to cd into /etc/pve and hey now the shell is in locked state too.
corosync is running. it shows everyhting in fine on the other nodes too.
the migration process is still running (but does nto show any information) but data comes in....
so i really dont wanna reboot, dont wanna reshove the 2 terra again that already on destination....

a little digging brought me to this
https://commitandquit.wordpress.com/2016/10/29/proxmox-etcpve-blocked/

and thats actually the solution. i deleted the /var/lib/pve-cluster/.pmxcfs.lockfile on the machine that was locked.
restarted cluster service (this time it works) and restarted pveproxy.
now everything is up and running again

even the migration process shows now proper data. i dont know if the migration will suceed because its still running, we will see.
at least i got out of lock state without rebooting everything
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!