"pct list" time out

greg

Renowned Member
Apr 6, 2011
140
2
83
Greetings
On my new Proxmox 6.4.13, something got "stuck": pct list or any service pve-<whatever> stop hang. In the log I see lines like:

Code:
systemd[1]: pvestatd.service: Stopping timed out. Terminating.
scwv10 systemd[1]: pvedaemon.service: State 'stop-sigterm' timed out. Killing
scwv10 systemd[1]: pvedaemon.service: Killing process 40734 (pvedaemon) with signal SIGKILL.
scwv10 systemd[1]: pvestatd.service: State 'stop-sigterm' timed out. Killing.

etc

Note that kill -9 <pct process> doesn't work.

I don't understand what could block "pct list"... any idea?

Thanks in advance

Regards
 
Last edited:
it looks like you have an I/O on your node, please post the output of the below commands for more information:

Bash:
uptime
free
pveversion -v
 
Hello
I'm still having this problem across several nodes.

'top' and 'iotop' says the machine is basically idle.

Code:
# uptime
 09:24:30 up 400 days,  9:22, 22 users,  load average: 21,08, 20,82, 20,71
 
 # free
              total        used        free      shared  buff/cache   available
Mem:       65755788    43674868    18835076     1731256     3245844    19641952
Swap:      20971512     2162860    18808652

# pveversion -v                                                                                                                                                                                                                                            [16/44]
proxmox-ve: 6.4-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-12
pve-kernel-helper: 6.4-12
pve-kernel-5.4.162-1-pve: 5.4.162-2
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-4.15: 5.4-19
pve-kernel-4.15.18-30-pve: 4.15.18-58
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.5-pve2~bpo10+1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve2~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
pve-zsync: 2.2
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.7-pve1
 
Hello,

Do you see an IO delay on your node? Datacenter->NodeName->Summary->IO delay section?

Have you tried to reboot the pvestatd service?
Bash:
pvestatd restart
 
Thanks for your answer.
pvestatd restart hangs, like most pve commands.
I cannot see the GUI for the node, it says "Connection failure. Network error or Proxmox VE services not running?" and is greyed out ("permission denied - invalid PVE ticket (401)") on the GUI of the other nodes (it wasn't yesterday).
On the others nodes, iodelay is 0, except for one another node which fluctuate a lot between 0 and 30%.

As an example, I ran pct delsnapshot yesterday and there's no output yet.
 
Hmmm,

Can you attach the Syslog? `/var/log/syslog`

Are you sure from the entries for /etc/hosts and /etc/hsotname?
 
/etc/hosts and /etc/hostname seem to be fine (they are the same as they always have been.

syslog doesn't show anything but this:

Code:
Feb  7 12:58:02 sysv5 corosync[12754]:   [QUORUM] Sync members[7]: 1 2 3 4 5 6 7
Feb  7 12:58:02 sysv5 corosync[12754]:   [TOTEM ] A new membership (1.8685a) was formed. Members
Feb  7 12:58:02 sysv5 corosync[12754]:   [QUORUM] Members[7]: 1 2 3 4 5 6 7
Feb  7 12:58:02 sysv5 corosync[12754]:   [MAIN  ] Completed service synchronization, ready to provide service.

and then:

Code:
Feb  7 13:03:41 sysv5 corosync[12754]:   [KNET  ] link: host: 5 link: 0 is down
Feb  7 13:03:41 sysv5 corosync[12754]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Feb  7 13:03:41 sysv5 corosync[12754]:   [KNET  ] host: host: 5 has no active links
Feb  7 13:03:45 sysv5 corosync[12754]:   [KNET  ] rx: host: 5 link: 0 is up
Feb  7 13:03:45 sysv5 corosync[12754]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)

Not sure what is it but it's always been like this.
 
I was able to "unfreeze" all the hang pve command like this:

- disconnected the private network used for cluster communication
- pvecm e 1
- systemctl restart pve-cluster.service

GUI is now available. Now I'll try to have the node back into the cluster.
 
It seems to be the same on other nodes: to be able to do anything, I have to "isolate" the node... :(