pct list endless

decibel83

Renowned Member
Oct 15, 2008
210
1
83
Hi,
on my Proxmox host I cannot run pct list anymore because it's endless and I don't have any output:

Code:
root@node11:~# pct list

(no return to console...)

If I try to run it with strace I get one endless timeout but I cannot realize which program is creating it:

Code:
root@node11:~# strace pct list
execve("/usr/sbin/pct", ["pct", "list"], [/* 19 vars */]) = 0
brk(NULL)                               = 0x562d5ca9d000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff2ffe02000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=39906, ...}) = 0
mmap(NULL, 39906, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7ff2ffdf8000
close(3)                                = 0
[...]
close(5)                                = 0
close(8)                                = 0
close(11)                               = 0
getpid()                                = 4241
close(6)                                = 0
select(16, [7 9], NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
select(16, [7 9], NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
select(16, [7 9], NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
select(16, [7 9], NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
select(16, [7 9], NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
select(16, [7 9], NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[...]

I don't have any errors in the syslog, but this node is displayed as unknown into the Proxmox GUI:

Screenshot 2018-10-21 at 19.21.38.png

This is my pveversion:

Code:
root@node11:/# pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.13.4-1-pve)
pve-manager: 5.2-9 (running version: 5.2-9/4b30e8f9)
pve-kernel-4.15: 5.2-10
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-3-pve: 4.13.13-34
pve-kernel-4.13.13-2-pve: 4.13.13-33
pve-kernel-4.13.4-1-pve: 4.13.4-26
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-28
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-36
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

Could you help me please?
 
Hi,
on my Proxmox host I cannot run pct list anymore because it's endless and I don't have any output:

Code:
root@node11:~# pct list

(no return to console...)

From the screenshot I conlucdr you run a cluster - is the cluster healthy? check by

Code:
pvecm status

If no, check your network etc.

If yes, restart some services:

Code:
systemctl restart pvestatd.service
systemctl restart corosync.service 
systemctl restart pveproxy.service
systemctl restart pve-cluster.service
 
From the screenshot I conlucdr you run a cluster - is the cluster healthy? check by

Yes, the cluster is healthy:

Code:
root@node11:/# pvecm status
Quorum information
------------------
Date:             Fri Oct 26 21:53:55 2018
Quorum provider:  corosync_votequorum
Nodes:            11
Node ID:          0x00000002
Ring ID:          8/2380
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   11
Highest expected: 11
Total votes:      11
Quorum:           6 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000008          1 192.168.60.1
0x0000000a          1 192.168.60.2
0x00000007          1 192.168.60.3
0x00000009          1 192.168.60.4
0x00000001          1 192.168.60.5
0x00000003          1 192.168.60.6
0x00000004          1 192.168.60.7
0x00000005          1 192.168.60.8
0x0000000b          1 192.168.60.9
0x00000006          1 192.168.60.10
0x00000002          1 192.168.60.11 (local)

If no, check your network etc.

If yes, restart some services:

I restarted them and now the node11 is green in the web interface.
But I've already tried it before writing you the first time (sorry I didn't mention this and after some minutes the situation came back to what you see in my screenshot.

And, even if now the node11 is green I cannot run pct at all, and containers are displayed in grey with a question mark:

Screenshot 2018-10-26 at 21.53.35.png
 
  • Like
Reactions: samirfor
Has anyone been able to figure out how to fix this without a full reboot?

This seems to me to be something related to kernel upgrade. Maybe I'm wrong, I don't know.
 
I have exactly the same problem when a backup job starts. It happens randomly.
Were you able to solve this issue ?
 
I have exactly the same problem when a backup job starts. It happens randomly.
Were you able to solve this issue ?
When a backup job starts, in the WebGUI the node gets marked with a grey dot question mark icon and the same is for all resources on it?
 
When a backup job starts, in the WebGUI the node gets marked with a grey dot question mark icon and the same is for all resources on it?
Yes, exactly. I'm suspecting that it's about I/O pressure but I'm not sure.

When I try to debug it, I always have to reboot the whole server eventually to make services available again. It's really annoying.

Some another weird thing that I noticed: Services are not externally accessible for inbound connections, but it seems that they can initial connections on their side.

There is nothing in any journalctl unit when it happens. Restarting pvestatd makes the VMs available again, but containers are still greyed out.
Any attempt to use pct commands will hang.

What I'm suspecting is that pvestatd gets stuck with a lxc-info command, which triggers the grey question marks. However, I don't know why the pct/lxc-info commands are hanging.
 
You may try to limit the speed of a backup. If I remember correctly this limit affects also the read speed on the source side.

Datacenter --> Backup --> Edit:Backup Job --> Advanced --> Bandwidth Limit

The other mechanism to reduce stress is to use Fleecing - in the same dialog - but this does work only for VMs.
 
You may try to limit the speed of a backup. If I remember correctly this limit affects also the read speed on the source side.

Datacenter --> Backup --> Edit:Backup Job --> Advanced --> Bandwidth Limit

The other mechanism to reduce stress is to use Fleecing - in the same dialog - but this does work only for VMs.
So far I've already limited the backup IO.
The problem is when I try to solve it, there is really no activity on the disks as IO is very low
 
Where do you store the backups to, e. g. remote storage?
What I'm suspecting is that pvestatd gets stuck with a lxc-info command, which triggers the grey question marks. However, I don't know why the pct/lxc-info commands are hanging.
You might use strace to check where it hangs:
strace -f -p $(pgrep pvestatd) or
if you know a specific pct command to trigger the stall strace -f pct <command> <ctid> or
lxc-info command strace -f lxc-info -n <ctid>
The output will stop at exit of the process latest or earlier if it stalls at some call.
 
Hi,
another way to find out where pvestatd is at, is to check the process tree (e.g. with ps faxl) if it has spawned a child process. And you can use the following to see where in the Perl code it is:
Code:
apt install perl-stacktrace
perl-stacktrace $(pgrep pvestatd)
 
Ciao everyone,

Sorry for my late answer. The problem did not happen until today. You can find the strace attached.

When running systemctl status pvestatd, I can see that it hangs for lxc-info -n 103 -p:
Code:
root@RBT-HPRV:~# systemctl status pvestatd
● pvestatd.service - PVE Status Daemon
     Loaded: loaded (/usr/lib/systemd/system/pvestatd.service; enabled; preset: enabled)
     Active: active (running) since Thu 2026-03-19 08:41:26 CET; 1 week 4 days ago
 Invocation: a1e6a2eb4a584099a94160f1271bdfc7
   Main PID: 2792 (pvestatd)
      Tasks: 2 (limit: 154181)
     Memory: 212.8M (peak: 242.8M)
        CPU: 2d 17h 8min 55.180s
     CGroup: /system.slice/pvestatd.service
             ├─  2792 pvestatd
             └─744244 lxc-info -n 103 -p

Mar 30 18:20:05 RBT-HPRV pvestatd[2792]: status update time (5.297 seconds)
Mar 30 18:23:05 RBT-HPRV pvestatd[2792]: status update time (5.270 seconds)
Mar 30 18:24:05 RBT-HPRV pvestatd[2792]: status update time (5.040 seconds)
Mar 30 18:26:05 RBT-HPRV pvestatd[2792]: status update time (5.539 seconds)
Mar 30 18:28:05 RBT-HPRV pvestatd[2792]: status update time (5.329 seconds)
Mar 30 18:31:05 RBT-HPRV pvestatd[2792]: status update time (5.317 seconds)
Mar 30 22:57:43 RBT-HPRV pvestatd[2792]: auth key pair too old, rotating..
Mar 31 09:25:40 RBT-HPRV pvestatd[2792]: status update time (6.597 seconds)

Code:
root@RBT-HPRV:~# pstree -a -p 2792
pvestatd,2792
  └─lxc-info,744244 -n 103 -p

Code:
root@RBT-HPRV:~# perl-stacktrace $(lxc-info -n 103 -p)

^C (hangs)

root@RBT-HPRV:~#



Nothing seems to be eating all resources and/or IO

Code:
root@RBT-HPRV:~# iostat

Linux 6.17.2-2-pve (RBT-HPRV)   03/31/26        _x86_64_        (48 CPU)



avg-cpu:  %user   %nice %system %iowait  %steal   %idle

          10.45    0.18    3.73    0.54    0.00   85.11



Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd

sda             114.63       403.60       622.93         0.00  418565942  646022984          0

sdb              94.79       312.36       591.53         0.00  323944614  613462276          0

sdc              90.25       410.44       622.92         0.00  425659338  646010200          0

sdd              91.86       408.95       622.87         0.00  424106362  645963524          0

sde              76.80       327.67       596.69         0.00  339816970  618807116          0

sdf              76.57       328.19       596.76         0.00  340357386  618884876          0

sdg              91.94       408.91       622.90         0.00  424070134  645992524          0

sdh              76.70       327.85       596.75         0.00  340008362  618869276          0

sdi               0.00         0.00         0.00         0.00       2992          0          0

zd0               0.02         8.41         0.00         0.00    8725245        205          0

zd112             0.00         0.05         0.00         0.00      48745          0          0

zd128             0.00         0.01         0.00         0.00       8448          0          0

zd144             8.74         0.39       636.97         0.00     404041  660581532          0

zd16              0.00         0.15         0.00         0.00     160052          0          0

zd160             8.28      1011.72       106.41         0.00 1049233141  110355768          0

zd32              0.29       303.37         0.00         0.00  314614413          0          0

zd48              0.01         1.34         0.00         0.00    1389665        193          0

zd64              1.73      2022.24         0.00         0.00 2097207461          0          0

zd80              0.00         0.01         0.00         0.00      12072          0          0

zd96             75.33       264.64       301.14         0.00  274452938  312303788          0



Code:
root@RBT-HPRV:~# pveperf

CPU BOGOMIPS:      210960.96

REGEX/SECOND:      1124646

HD SIZE:           1410.09 GB (rpool/ROOT/pve-1)



FSYNCS/SECOND:     12127.42

DNS EXT:           15.99 ms

DNS INT:           10.32 ms



What would be the next steps to do ? I will have to reboot the server and wait for the problem to reappear.
 

Attachments

Last edited:
The attached strace just shows something is hanging, but I cant tell what.
I would take a look at the status of the container's monitor process. If State value is D it is waiting for something. If yes, have a look at the call stack to find out what (there are many ways to get the monitor pid, use whatever you prefer, I just like this one).
cat /proc/$(systemctl show -p MainPID --value pve-container@103.service)/status
cat /proc/$(systemctl show -p MainPID --value pve-container@103.service)/stack

Looking at the containers init process could help too. Eventually it is stuck/waiting for something.
cat /proc/$(pgrep -P $(systemctl show -p MainPID --value pve-container@103.service))/status
cat /proc/$(pgrep -P $(systemctl show -p MainPID --value pve-container@103.service))/stack
 
The attached strace just shows something is hanging, but I cant tell what.
I would take a look at the status of the container's monitor process. If State value is D it is waiting for something. If yes, have a look at the call stack to find out what (there are many ways to get the monitor pid, use whatever you prefer, I just like this one).
cat /proc/$(systemctl show -p MainPID --value pve-container@103.service)/status
cat /proc/$(systemctl show -p MainPID --value pve-container@103.service)/stack

Looking at the containers init process could help too. Eventually it is stuck/waiting for something.
cat /proc/$(pgrep -P $(systemctl show -p MainPID --value pve-container@103.service))/status
cat /proc/$(pgrep -P $(systemctl show -p MainPID --value pve-container@103.service))/stack
Thanks a lot for the hint!

Once the problem appears again, I will report back. Usually it's about 2 weeks
 
Ciao again! It happened this morning.

Here are the command outputs (in order) :
Code:
Name:   lxc-start
Umask:  0022
State:  S (sleeping)
Tgid:   988145
Ngid:   0
Pid:    988145
PPid:   1
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 512
Groups:
NStgid: 988145
NSpid:  988145
NSpgid: 988145
NSsid:  988145
Kthread:        0
VmPeak:     6668 kB
VmSize:     6652 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:      3924 kB
VmRSS:      3924 kB
RssAnon:             432 kB
RssFile:            3492 kB
RssShmem:              0 kB
VmData:      284 kB
VmStk:       132 kB
VmExe:        24 kB
VmLib:      4044 kB
VmPTE:        56 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
CoreDumping:    0
THP_enabled:    1
untag_mask:     0xffffffffffffffff
Threads:        1
SigQ:   299/513939
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: fffffffe77fbfab7
SigIgn: 0000000000001000
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
NoNewPrivs:     0
Seccomp:        0
Seccomp_filters:        0
Speculation_Store_Bypass:       vulnerable
SpeculationIndirectBranch:      always enabled
Cpus_allowed:   ffff,ffffffff
Cpus_allowed_list:      0-47
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list:      0-1
voluntary_ctxt_switches:        3690953
nonvoluntary_ctxt_switches:     4313
x86_Thread_features:  
x86_Thread_features_locked:


Code:
[<0>] do_epoll_wait+0x51b/0x550
[<0>] __x64_sys_epoll_wait+0x6c/0x110
[<0>] x64_sys_call+0x1415/0x2330
[<0>] do_syscall_64+0x80/0xa30
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e



Code:
Name:   systemd
Umask:  0000
State:  S (sleeping)
Tgid:   988239
Ngid:   0
Pid:    988239
PPid:   988145
TracerPid:      0
Uid:    100000  100000  100000  100000
Gid:    100000  100000  100000  100000
FDSize: 64
Groups:
NStgid: 988239  1
NSpid:  988239  1
NSpgid: 988239  1
NSsid:  988239  1
Kthread:        0
VmPeak:   237492 kB
VmSize:   172304 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:      9136 kB
VmRSS:      9136 kB
RssAnon:            4324 kB
RssFile:            4812 kB
RssShmem:              0 kB
VmData:    20368 kB
VmStk:       132 kB
VmExe:        44 kB
VmLib:     12788 kB
VmPTE:       104 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
CoreDumping:    0
THP_enabled:    1
untag_mask:     0xffffffffffffffff
Threads:        1
SigQ:   4/513939
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 7fe3c0fe28014a03
SigIgn: 0000000000001000
SigCgt: 00000001000004ec
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
NoNewPrivs:     0
Seccomp:        2
Seccomp_filters:        1
Speculation_Store_Bypass:       vulnerable
SpeculationIndirectBranch:      always enabled
Cpus_allowed:   b2a2,3ff7ff71
Cpus_allowed_list:      0,4-6,8-18,20-29,33,37,39,41,44-45,47
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list:      0-1
voluntary_ctxt_switches:        90292
nonvoluntary_ctxt_switches:     4872
x86_Thread_features:  
x86_Thread_features_locked:


Code:
[<0>] get_signal+0x39e/0x880
[<0>] arch_do_signal_or_restart+0x41/0x260
[<0>] exit_to_user_mode_loop+0x91/0x170
[<0>] do_syscall_64+0x21e/0xa30
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

I afterwards ran `pkill -ef lxc-info` and the container/vm state switched back to normal in the GUI. However, connectivity to the containers was still not working and it was impossible to attach to them.

I had to reboot the host, as usual, to regain a working system
 
Last edited:
a container in zombie state would do what you experience. use ps to find the process (it would be in a permanent 'D' state.)

The good news is that the rest of your containers work normally and you can use pct commands to control them. the bad news is that if you cant remove the process using kill -9, the only way to regain control of the node is hard reset (it will hang on reboot otherwise.)

dmesg will usually be instructive as to why the process is hanging.