"pct list" time out

greg · Aug 25, 2021

Greetings
On my new Proxmox 6.4.13, something got "stuck": pct list or any service pve-<whatever> stop hang. In the log I see lines like:

Code:

systemd[1]: pvestatd.service: Stopping timed out. Terminating.
scwv10 systemd[1]: pvedaemon.service: State 'stop-sigterm' timed out. Killing
scwv10 systemd[1]: pvedaemon.service: Killing process 40734 (pvedaemon) with signal SIGKILL.
scwv10 systemd[1]: pvestatd.service: State 'stop-sigterm' timed out. Killing.

etc

Note that kill -9 <pct process> doesn't work.

I don't understand what could block "pct list"... any idea?

Thanks in advance

Regards

Moayad · Aug 25, 2021

it looks like you have an I/O on your node, please post the output of the below commands for more information:

Bash:

uptime
free
pveversion -v

greg · Feb 7, 2022

Hello
I'm still having this problem across several nodes.

'top' and 'iotop' says the machine is basically idle.

Code:

# uptime
 09:24:30 up 400 days,  9:22, 22 users,  load average: 21,08, 20,82, 20,71
 
 # free
              total        used        free      shared  buff/cache   available
Mem:       65755788    43674868    18835076     1731256     3245844    19641952
Swap:      20971512     2162860    18808652

# pveversion -v                                                                                                                                                                                                                                            [16/44]
proxmox-ve: 6.4-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-12
pve-kernel-helper: 6.4-12
pve-kernel-5.4.162-1-pve: 5.4.162-2
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-4.15: 5.4-19
pve-kernel-4.15.18-30-pve: 4.15.18-58
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.5-pve2~bpo10+1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve2~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
pve-zsync: 2.2
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.7-pve1

Moayad · Feb 7, 2022

Hello,

Do you see an IO delay on your node? Datacenter->NodeName->Summary->IO delay section?

Have you tried to reboot the pvestatd service?

Bash:

pvestatd restart

greg · Feb 7, 2022

Thanks for your answer.
pvestatd restart hangs, like most pve commands.
I cannot see the GUI for the node, it says "Connection failure. Network error or Proxmox VE services not running?" and is greyed out ("permission denied - invalid PVE ticket (401)") on the GUI of the other nodes (it wasn't yesterday).
On the others nodes, iodelay is 0, except for one another node which fluctuate a lot between 0 and 30%.

As an example, I ran pct delsnapshot yesterday and there's no output yet.

Moayad · Feb 7, 2022

Hmmm,

Can you attach the Syslog? `/var/log/syslog`

Are you sure from the entries for /etc/hosts and /etc/hsotname?

greg · Feb 7, 2022

/etc/hosts and /etc/hostname seem to be fine (they are the same as they always have been.

syslog doesn't show anything but this:

Code:

Feb  7 12:58:02 sysv5 corosync[12754]:   [QUORUM] Sync members[7]: 1 2 3 4 5 6 7
Feb  7 12:58:02 sysv5 corosync[12754]:   [TOTEM ] A new membership (1.8685a) was formed. Members
Feb  7 12:58:02 sysv5 corosync[12754]:   [QUORUM] Members[7]: 1 2 3 4 5 6 7
Feb  7 12:58:02 sysv5 corosync[12754]:   [MAIN  ] Completed service synchronization, ready to provide service.

and then:

Code:

Feb  7 13:03:41 sysv5 corosync[12754]:   [KNET  ] link: host: 5 link: 0 is down
Feb  7 13:03:41 sysv5 corosync[12754]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Feb  7 13:03:41 sysv5 corosync[12754]:   [KNET  ] host: host: 5 has no active links
Feb  7 13:03:45 sysv5 corosync[12754]:   [KNET  ] rx: host: 5 link: 0 is up
Feb  7 13:03:45 sysv5 corosync[12754]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)

Not sure what is it but it's always been like this.

greg · Feb 7, 2022

I was able to "unfreeze" all the hang pve command like this:

- disconnected the private network used for cluster communication
- pvecm e 1
- systemctl restart pve-cluster.service

GUI is now available. Now I'll try to have the node back into the cluster.

greg · Feb 7, 2022

It seems to be the same on other nodes: to be able to do anything, I have to "isolate" the node...

Search

Search

"pct list" time out

greg

Renowned Member

Moayad

Proxmox Staff Member

greg

Renowned Member

Moayad

Proxmox Staff Member

greg

Renowned Member

Moayad

Proxmox Staff Member

greg

Renowned Member

greg

Renowned Member

greg

Renowned Member