qmp command 'query-proxmox-support' failed - got timeout

Cris70

Member
Jun 18, 2020
10
1
23
Hi all,
I have two Proxmox servers running on very very similar hardware (DELL PowerEdge R630 with 2 Xeon E5-2697 CPUs with 14 cores each and HT, 512Gb ECC RAM).
Both server have PVE 7.4-18 (yes I know, I have to upgrade).
Both servers have PBS 2.4-7 installed along with PVE.
Both server mount a NAS disk though NFS, and both have PBS configured to write on that mount point (i.e. the datastore points to that mount point).

The first server works without any problems (and its BPS instance also does hosts backups for another older PVE server, which runs PVE 6.4).
The second server cannot complete a single backup of any one of the 8 VMs hosted, if using PBS as storage. If using "local" storage (even using the same NFS mount point as "directory" type) it completes the backups without issues.
Actually, when using PBS it cannot even start backups.

Here is an example backup job log (the one that is running right now):
Code:
INFO: starting new backup job: vzdump --all 1 --quiet 1 --storage AMENDED --mailnotification failure --prune-backups 'keep-daily=1,keep-last=2,keep-monthly=1,keep-weekly=1,keep-yearly=1' --notes-template '{{guestname}}' --mode snapshot --mailto log@AMENDED
INFO: Starting Backup of VM 100 (qemu)
INFO: Backup started at 2024-07-17 01:00:06
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: AMENDED
INFO: include disk 'scsi0' 'local-lvm:vm-100-disk-0' 40G
INFO: creating Proxmox Backup Server archive 'vm/100/2024-07-16T23:00:06Z'
INFO: starting kvm to execute backup task
ERROR: VM 100 qmp command 'backup' failed - got timeout
INFO: aborting backup job
ERROR: VM 100 qmp command 'backup-cancel' failed - got timeout
VM 100 qmp command 'query-status' failed - got timeout
ERROR: Backup of VM 100 failed - VM 100 qmp command 'backup' failed - got timeout
INFO: Failed at 2024-07-17 01:12:19
INFO: Starting Backup of VM 101 (qemu)
INFO: Backup started at 2024-07-17 01:12:19
INFO: status = running
INFO: VM Name: AMENDED
INFO: include disk 'scsi0' 'local-lvm:vm-101-disk-0' 200G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/101/2024-07-16T23:12:19Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 101 qmp command 'backup' failed - got timeout
INFO: aborting backup job
ERROR: VM 101 qmp command 'backup-cancel' failed - got timeout
INFO: resuming VM again
ERROR: Backup of VM 101 failed - VM 101 qmp command 'cont' failed - got timeout
INFO: Failed at 2024-07-17 01:25:10
INFO: Starting Backup of VM 102 (qemu)
INFO: Backup started at 2024-07-17 01:25:10
INFO: status = running
INFO: VM Name: AMENDED
INFO: include disk 'scsi0' 'vmzfs:vm-102-disk-0' 200G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/102/2024-07-16T23:25:10Z'
INFO: issuing guest-agent 'fs-freeze' command
...

Meanwhile, on PBS, each backup task stays running even when PVE gives up with INFO: aborting backup job.
On PBS one example task log is like this:
Code:
2024-07-17T01:00:17+02:00: starting new backup on datastore 'BlueNAS_PBS': "vm/100/2024-07-16T23:00:06Z"
2024-07-17T01:00:17+02:00: GET /previous: 400 Bad Request: no valid previous backup
and it keeps running forever until I click on the "Stop" button.

Even the Garbage Collection task hangs forever on PBS, and - in this case - I cannot even stop it using the Stop button.
Actually, it seems that the GC process dies quite soon after starting, but somehow PBS doesn't notice this.

This is what I see in /var/log/daemon.log (excerpt):
Code:
Jul 17 01:33:36 host0 pvestatd[3992]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - got timeout
Jul 17 01:33:41 host0 pvestatd[3992]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jul 17 01:33:46 host0 pvestatd[3992]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - got timeout
Jul 17 01:33:47 host0 pvestatd[3992]: status update time (21.759 seconds)
Jul 17 01:33:58 host0 pvestatd[3992]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - got timeout
Jul 17 01:34:03 host0 pvestatd[3992]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - got timeout
Jul 17 01:34:08 host0 pvestatd[3992]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jul 17 01:34:09 host0 pvestatd[3992]: status update time (21.820 seconds)
Jul 17 01:34:20 host0 pvestatd[3992]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - got timeout
Jul 17 01:34:25 host0 pvestatd[3992]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - got timeout
Jul 17 01:34:30 host0 pvestatd[3992]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jul 17 01:34:30 host0 pvestatd[3992]: status update time (21.739 seconds)
Jul 17 01:34:41 host0 pvestatd[3992]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - got timeout
Jul 17 01:34:46 host0 pvestatd[3992]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jul 17 01:34:51 host0 pvestatd[3992]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - got timeout
Jul 17 01:34:52 host0 pvestatd[3992]: status update time (21.759 seconds)

One final note: when PVE gives up on the backup for a VM, the VM keeps running smoothly if it was already running before.
If instead it was powered off before the backup, it stays in a fake "running" state until I give a "qm stop" (qm shutdown gives an error: VM 100 qmp command 'query-status' failed - got timeout).

I have read lots of threads regarding problems with qmp and the "got timeout" error, however I haven't found an exact match for my situation, and even the solved threads have quite sketchy resolutions.

Please help.
I am available to give any missing details if needed.

Thank you in advance!
Cris
 
Adding some useful information:
Code:
root@host0:~# pveversion --verbose
proxmox-ve: 7.4-1 (running kernel: 5.15.158-1-pve)
pve-manager: 7.4-18 (running version: 7.4-18/b1f94095)
pve-kernel-5.15: 7.4-14
pve-kernel-5.11: 7.0-10
pve-kernel-5.15.158-1-pve: 5.15.158-1
pve-kernel-5.15.149-1-pve: 5.15.149-1
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
ceph-fuse: 15.2.14-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.3
libpve-apiclient-perl: 3.2-2
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.7-1
proxmox-backup-file-restore: 2.4.7-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.4
pve-cluster: 7.3-3
pve-container: 4.4-7
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+3
pve-firewall: 4.3-5
pve-firmware: 3.6-6
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.10-1
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-6
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.15-pve1

root@host0:~# kvm --version
QEMU emulator version 7.2.10 (pve-qemu-kvm_7.2.10-1)
Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers
 
So nobody really have the slightest idea, or the slightest suggestion for something to try and debug this problem?
I have tried everything I could think of, and everything that I found suggested in other threads, without success until now.
I even uninstalled and reinstalled PBS completely, after totally cleaning up any leftovers, but it didn't work.
 
Trying to further debug the problem, I tried to backup a local directory using the command-line utility proxmox-backup-client.

Here is the command and the output:
Code:
root@host0:~# proxmox-backup-client backup etc_pve.pxar:/etc/pve/ --repository BlueNAS_PBS --ns ServerFiles
Starting backup: [ServerFiles]:host/host0/2024-07-18T10:43:32Z
Client name: host0
Starting backup protocol: Thu Jul 18 12:43:32 2024
No previous manifest available.
Upload directory '/etc/pve/' to 'BlueNAS_PBS' as etc_pve.pxar.didx

The command has been running for a couple hours, which is obviously wrong as that directory contains only a few kilobytes worth of data.

The PBS web GUI show only this:
1721304762861.png

Looking at the processes I can see that the process is in status Sl+ (uninterruptible sleep).

I tried to look at what the process is doing with strace, here is the result:
Code:
root@host0:~# strace -p 728661
strace: Process 728661 attached
futex(0x7f8a307798d0, FUTEX_WAIT_PRIVATE, 1, NULL
^Cstrace: Process 728661 detached
 <detached ...>

Unfortunately there's no way (I know of) of telling what that futex is.
I found out a systemtap script to try to identify the nature of the futex, but unfortunately the PVE custom kernel makes it impossible to run the tool.

One last bit of information, here is the output of lsof:
Code:
root@host0:~# !577
lsof -p 728661
COMMAND      PID USER   FD      TYPE             DEVICE SIZE/OFF    NODE NAME
proxmox-b 728661 root  cwd       DIR              253,1     4096 3538945 /root
proxmox-b 728661 root  rtd       DIR              253,1     4096       2 /
proxmox-b 728661 root  txt       REG              253,1  9978448 1835682 /usr/bin/proxmox-backup-client
proxmox-b 728661 root  mem       REG              253,1    51696 1839187 /usr/lib/x86_64-linux-gnu/libnss_files-2.31.so
proxmox-b 728661 root  mem       REG              253,1    18688 1839144 /usr/lib/x86_64-linux-gnu/libdl-2.31.so
proxmox-b 728661 root  mem       REG              253,1  1321344 1839145 /usr/lib/x86_64-linux-gnu/libm-2.31.so
proxmox-b 728661 root  mem       REG              253,1   149520 1839190 /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
proxmox-b 728661 root  mem       REG              253,1   100736 1837844 /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
proxmox-b 728661 root  mem       REG              253,1  1901536 1839143 /usr/lib/x86_64-linux-gnu/libc-2.31.so
proxmox-b 728661 root  mem       REG              253,1   890800 1838027 /usr/lib/x86_64-linux-gnu/libzstd.so.1.4.8
proxmox-b 728661 root  mem       REG              253,1    38936 1837760 /usr/lib/x86_64-linux-gnu/libacl.so.1.1.2253
proxmox-b 728661 root  mem       REG              253,1  3076992 1838591 /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
proxmox-b 728661 root  mem       REG              253,1   597792 1838593 /usr/lib/x86_64-linux-gnu/libssl.so.1.1
proxmox-b 728661 root  mem       REG              253,1   256024 1841634 /usr/lib/x86_64-linux-gnu/libfuse3.so.3.10.3
proxmox-b 728661 root  mem       REG              253,1   177928 1836113 /usr/lib/x86_64-linux-gnu/ld-2.31.so
proxmox-b 728661 root    0u      CHR              136,2      0t0       5 /dev/pts/2
proxmox-b 728661 root    1u      CHR              136,2      0t0       5 /dev/pts/2
proxmox-b 728661 root    2u      CHR              136,2      0t0       5 /dev/pts/2
proxmox-b 728661 root    3u  a_inode               0,14        0   12720 [eventpoll]
proxmox-b 728661 root    4u  a_inode               0,14        0   12720 [eventfd]
proxmox-b 728661 root    5u  a_inode               0,14        0   12720 [eventpoll]
proxmox-b 728661 root    6u     unix 0x00000000ef865036      0t0 7703197 type=STREAM
proxmox-b 728661 root    7u     unix 0x000000000e458ad2      0t0 7703198 type=STREAM
proxmox-b 728661 root    8u     unix 0x00000000ef865036      0t0 7703197 type=STREAM
proxmox-b 728661 root    9u  a_inode               0,14        0   12720 [eventpoll]
proxmox-b 728661 root   10u  a_inode               0,14        0   12720 [eventfd]
proxmox-b 728661 root   11u  a_inode               0,14        0   12720 [eventpoll]
proxmox-b 728661 root   12u     unix 0x00000000ef865036      0t0 7703197 type=STREAM
proxmox-b 728661 root   13u     IPv4            7708017      0t0     TCP localhost.localdomain:56932->localhost.localdomain:8007 (ESTABLISHED)
proxmox-b 728661 root   14r      DIR               0,43        0       1 /etc/pve
proxmox-b 728661 root   15r      DIR               0,43        0       9 /etc/pve/priv

Oh, one last thing: investigating on possible problems with the remote nature of my datastore, I ran a bechmark.
Here is the result, which seems more than acceptable to me:
Screenshot_20240718_120744.png

Thank you in advance to anybody who can help shed some light on this problem

Cris
 
Both server mount a NAS disk though NFS, and both have PBS configured to write on that mount point (i.e. the datastore points to that mount point).
Hi,
so you are saying that both PBS instances access the same datastore concurrently? This is asking for trouble, the PBS server is not designed to allow for concurrent access by different instances to the store.

I think you are faced with connectivity issues between the host and the NFS storage, as you state:
Even the Garbage Collection task hangs forever on PBS, and - in this case - I cannot even stop it using the Stop button.
Actually, it seems that the GC process dies quite soon after starting, but somehow PBS doesn't notice this.
As this is purely on the PBS side (no client involved). You could check the output of ps auxwf and look for processes in an uninteruptible sleep state D, which would indicate that the process is waiting on IO.

Both server have PVE 7.4-18 (yes I know, I have to upgrade).
Both servers have PBS 2.4-7 installed along with PVE.
You should most definitely upgrade, but in general you should reconsider the whole setup. It is recommend to use a dedicated host for PBS with fast local storage [0].
Running Proxmox VE and PBS on the same host is not recommend, as the upgrade process might cause issues. Further, if you loose the host, you cannot restore from that PBS instance (which I think was your intention to circumvent by accessing the same datastore from different PBS instances?). And do not allow for concurrent access to the datastore.

[0] https://pbs.proxmox.com/docs/installation.html#recommended-server-system-requirements
 
Hi,
so you are saying that both PBS instances access the same datastore concurrently? This is asking for trouble, the PBS server is not designed to allow for concurrent access by different instances to the store.

Hi Chris, and thank you for your reply!
No, each server is accessing a NAS via NFS, but they're two different NAS (although they're the same make and model).

I think you are faced with connectivity issues between the host and the NFS storage, as you state:

As this is purely on the PBS side (no client involved). You could check the output of ps auxwf and look for processes in an uninteruptible sleep state D, which would indicate that the process is waiting on IO.

Actually it seems that the GC process dies right away, because just a few minutes after start I cannot find the process. But PBS keeps thinking that the process is still there.

You should most definitely upgrade, but in general you should reconsider the whole setup. It is recommend to use a dedicated host for PBS with fast local storage [0].

Yes, I know this is not the recommended setup, but it is serving me quite well for the moment (on the first server).
I would just like to make it work on the second too, at least until I'm ready to dedicate some hardware exclusively to PBS.

Thank you!
Cris
 
No, each server is accessing a NAS via NFS, but they're two different NAS (although they're the same make and model).
Oh, good! Than at least this should not be the issue.

Actually it seems that the GC process dies right away, because just a few minutes after start I cannot find the process. But PBS keeps thinking that the process is still there.
Please share an excerpt of the systemd journal from the timespan of the start of garbage collection until it is not present anymore, journalctl --since <DATETIME> --until <DATETIME> > journal.txt allows you to dump it to a file you can add as attachment here.

Please also share the task log of the garbage collection job.
 
Hi Chris, I just tried once more to launch a garbage collection (normally it would be scheduled, but I've disabled it while I debug the problem).

The behavior is as usual: PBS reports that the GC has started, but if I look for "UPID" in the process list I find nothing. Note that the process ID reported in the web UI is the PID of proxmox-backup-proxy, not the real PID of the job.

I let the GC go on (from the perspective of PBS) for about 10 minutes, then tried to stop it with the "Stop" button in the web UI.
This has no effect, as the process actually does not exist.
The only way to "clean" the task list in PBS is to restart the proxmox-backup-proxy daemon, which has the result of leaving the task in an unknown state.

I am attaching a few screenshots.

This is right after using the "Stop" button:
Screenshot_20240719_120752.png
Screenshot_20240719_120814.png
As you can see, PBS thinks the task is still running, even after using the "Stop" button.

Here is after restarting the proxmox-backup-proxy daemon:
Screenshot_20240719_121050.png
As you can see, PBS now sees the task as finished with an unknow state.
But the "End time" of the task is the timestamp of when I pressed the "Stop" button (you can see it in the first image), not the real timestamp of when I restarted the daemon.

Meanwhile, there's no trace of anything happening in journalctl:
Code:
-- Journal begins at Wed 2023-12-20 15:19:21 CET, ends at Fri 2024-07-19 12:12:32 CEST. --
Jul 19 11:57:32 host0 pvedaemon[2828970]: <root@pam> successful auth for user 'root@pam'
Jul 19 11:58:08 host0 sshd[2903211]: Accepted password for root from IP_AMENDED port 49922 ssh2
Jul 19 11:58:08 host0 sshd[2903211]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jul 19 11:58:08 host0 systemd-logind[2203]: New session 141 of user root.
Jul 19 11:58:08 host0 systemd[1]: Started Session 141 of user root.
Jul 19 12:09:27 host0 systemd[1]: Stopping Proxmox Backup API Proxy Server...
Jul 19 12:09:27 host0 systemd[1]: proxmox-backup-proxy.service: Succeeded.
Jul 19 12:09:27 host0 systemd[1]: Stopped Proxmox Backup API Proxy Server.
Jul 19 12:09:27 host0 systemd[1]: proxmox-backup-proxy.service: Consumed 4min 751ms CPU time.
Jul 19 12:09:27 host0 systemd[1]: Starting Proxmox Backup API Proxy Server...
Jul 19 12:09:27 host0 proxmox-backup-[2932246]: host0 proxmox-backup-proxy[2932246]: service is ready
Jul 19 12:09:27 host0 systemd[1]: Started Proxmox Backup API Proxy Server.
Jul 19 12:09:27 host0 proxmox-backup-[2932246]: host0 proxmox-backup-proxy[2932246]: applied rrd journal (3315 entries in 0.016 seconds)
Jul 19 12:09:27 host0 proxmox-backup-[2932246]: host0 proxmox-backup-proxy[2932246]: write rrd data back to disk
Jul 19 12:09:27 host0 proxmox-backup-[2932246]: host0 proxmox-backup-proxy[2932246]: starting rrd data sync
Jul 19 12:09:27 host0 proxmox-backup-[2932246]: host0 proxmox-backup-proxy[2932246]: rrd journal successfully committed (20 files in 0.012 seconds)
Jul 19 12:12:32 host0 pvedaemon[2822457]: <root@pam> successful auth for user 'root@pam'


Cris
 

Attachments

  • journal.txt
    1.7 KB · Views: 0
Hi @Chris and everybody else.
After my latest reply nobody answered with more suggestions or insights.
I understand this is the time most of us have their vacations, so I'm not complaining.
Actually, I just got back from holidays too, and wanted to check if something new had happened.
I hope however that this problem can be further investigated.
Thank you in advance.
Cris
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!