qmp command 'query-proxmox-support' failed - got timeout

Cris70 · Jul 17, 2024

Hi all,
I have two Proxmox servers running on very very similar hardware (DELL PowerEdge R630 with 2 Xeon E5-2697 CPUs with 14 cores each and HT, 512Gb ECC RAM).
Both server have PVE 7.4-18 (yes I know, I have to upgrade).
Both servers have PBS 2.4-7 installed along with PVE.
Both server mount a NAS disk though NFS, and both have PBS configured to write on that mount point (i.e. the datastore points to that mount point).

The first server works without any problems (and its BPS instance also does hosts backups for another older PVE server, which runs PVE 6.4).
The second server cannot complete a single backup of any one of the 8 VMs hosted, if using PBS as storage. If using "local" storage (even using the same NFS mount point as "directory" type) it completes the backups without issues.
Actually, when using PBS it cannot even start backups.

Here is an example backup job log (the one that is running right now):

Code:

INFO: starting new backup job: vzdump --all 1 --quiet 1 --storage AMENDED --mailnotification failure --prune-backups 'keep-daily=1,keep-last=2,keep-monthly=1,keep-weekly=1,keep-yearly=1' --notes-template '{{guestname}}' --mode snapshot --mailto log@AMENDED
INFO: Starting Backup of VM 100 (qemu)
INFO: Backup started at 2024-07-17 01:00:06
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: AMENDED
INFO: include disk 'scsi0' 'local-lvm:vm-100-disk-0' 40G
INFO: creating Proxmox Backup Server archive 'vm/100/2024-07-16T23:00:06Z'
INFO: starting kvm to execute backup task
ERROR: VM 100 qmp command 'backup' failed - got timeout
INFO: aborting backup job
ERROR: VM 100 qmp command 'backup-cancel' failed - got timeout
VM 100 qmp command 'query-status' failed - got timeout
ERROR: Backup of VM 100 failed - VM 100 qmp command 'backup' failed - got timeout
INFO: Failed at 2024-07-17 01:12:19
INFO: Starting Backup of VM 101 (qemu)
INFO: Backup started at 2024-07-17 01:12:19
INFO: status = running
INFO: VM Name: AMENDED
INFO: include disk 'scsi0' 'local-lvm:vm-101-disk-0' 200G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/101/2024-07-16T23:12:19Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 101 qmp command 'backup' failed - got timeout
INFO: aborting backup job
ERROR: VM 101 qmp command 'backup-cancel' failed - got timeout
INFO: resuming VM again
ERROR: Backup of VM 101 failed - VM 101 qmp command 'cont' failed - got timeout
INFO: Failed at 2024-07-17 01:25:10
INFO: Starting Backup of VM 102 (qemu)
INFO: Backup started at 2024-07-17 01:25:10
INFO: status = running
INFO: VM Name: AMENDED
INFO: include disk 'scsi0' 'vmzfs:vm-102-disk-0' 200G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/102/2024-07-16T23:25:10Z'
INFO: issuing guest-agent 'fs-freeze' command
...

Meanwhile, on PBS, each backup task stays running even when PVE gives up with INFO: aborting backup job.
On PBS one example task log is like this:

Code:

2024-07-17T01:00:17+02:00: starting new backup on datastore 'BlueNAS_PBS': "vm/100/2024-07-16T23:00:06Z"
2024-07-17T01:00:17+02:00: GET /previous: 400 Bad Request: no valid previous backup

and it keeps running forever until I click on the "Stop" button.

Even the Garbage Collection task hangs forever on PBS, and - in this case - I cannot even stop it using the Stop button.
Actually, it seems that the GC process dies quite soon after starting, but somehow PBS doesn't notice this.

This is what I see in /var/log/daemon.log (excerpt):

Code:

Jul 17 01:33:36 host0 pvestatd[3992]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - got timeout
Jul 17 01:33:41 host0 pvestatd[3992]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jul 17 01:33:46 host0 pvestatd[3992]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - got timeout
Jul 17 01:33:47 host0 pvestatd[3992]: status update time (21.759 seconds)
Jul 17 01:33:58 host0 pvestatd[3992]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - got timeout
Jul 17 01:34:03 host0 pvestatd[3992]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - got timeout
Jul 17 01:34:08 host0 pvestatd[3992]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jul 17 01:34:09 host0 pvestatd[3992]: status update time (21.820 seconds)
Jul 17 01:34:20 host0 pvestatd[3992]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - got timeout
Jul 17 01:34:25 host0 pvestatd[3992]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - got timeout
Jul 17 01:34:30 host0 pvestatd[3992]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jul 17 01:34:30 host0 pvestatd[3992]: status update time (21.739 seconds)
Jul 17 01:34:41 host0 pvestatd[3992]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - got timeout
Jul 17 01:34:46 host0 pvestatd[3992]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jul 17 01:34:51 host0 pvestatd[3992]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - got timeout
Jul 17 01:34:52 host0 pvestatd[3992]: status update time (21.759 seconds)

One final note: when PVE gives up on the backup for a VM, the VM keeps running smoothly if it was already running before.
If instead it was powered off before the backup, it stays in a fake "running" state until I give a "qm stop" (qm shutdown gives an error: VM 100 qmp command 'query-status' failed - got timeout).

I have read lots of threads regarding problems with qmp and the "got timeout" error, however I haven't found an exact match for my situation, and even the solved threads have quite sketchy resolutions.

Please help.
I am available to give any missing details if needed.

Thank you in advance!
Cris

Cris70 · Jul 17, 2024

Adding some useful information:

Code:

root@host0:~# pveversion --verbose
proxmox-ve: 7.4-1 (running kernel: 5.15.158-1-pve)
pve-manager: 7.4-18 (running version: 7.4-18/b1f94095)
pve-kernel-5.15: 7.4-14
pve-kernel-5.11: 7.0-10
pve-kernel-5.15.158-1-pve: 5.15.158-1
pve-kernel-5.15.149-1-pve: 5.15.149-1
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
ceph-fuse: 15.2.14-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.3
libpve-apiclient-perl: 3.2-2
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.7-1
proxmox-backup-file-restore: 2.4.7-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.4
pve-cluster: 7.3-3
pve-container: 4.4-7
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+3
pve-firewall: 4.3-5
pve-firmware: 3.6-6
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.10-1
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-6
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.15-pve1

root@host0:~# kvm --version
QEMU emulator version 7.2.10 (pve-qemu-kvm_7.2.10-1)
Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers

Cris70 · Jul 18, 2024

So nobody really have the slightest idea, or the slightest suggestion for something to try and debug this problem?
I have tried everything I could think of, and everything that I found suggested in other threads, without success until now.
I even uninstalled and reinstalled PBS completely, after totally cleaning up any leftovers, but it didn't work.

Cris70 · Jul 18, 2024

Trying to further debug the problem, I tried to backup a local directory using the command-line utility proxmox-backup-client.

Here is the command and the output:

Code:

root@host0:~# proxmox-backup-client backup etc_pve.pxar:/etc/pve/ --repository BlueNAS_PBS --ns ServerFiles
Starting backup: [ServerFiles]:host/host0/2024-07-18T10:43:32Z
Client name: host0
Starting backup protocol: Thu Jul 18 12:43:32 2024
No previous manifest available.
Upload directory '/etc/pve/' to 'BlueNAS_PBS' as etc_pve.pxar.didx

The command has been running for a couple hours, which is obviously wrong as that directory contains only a few kilobytes worth of data.

The PBS web GUI show only this:

Looking at the processes I can see that the process is in status Sl+ (uninterruptible sleep).

I tried to look at what the process is doing with strace, here is the result:

Code:

root@host0:~# strace -p 728661
strace: Process 728661 attached
futex(0x7f8a307798d0, FUTEX_WAIT_PRIVATE, 1, NULL
^Cstrace: Process 728661 detached
 <detached ...>

Unfortunately there's no way (I know of) of telling what that futex is.
I found out a systemtap script to try to identify the nature of the futex, but unfortunately the PVE custom kernel makes it impossible to run the tool.

One last bit of information, here is the output of lsof:

Code:

root@host0:~# !577
lsof -p 728661
COMMAND      PID USER   FD      TYPE             DEVICE SIZE/OFF    NODE NAME
proxmox-b 728661 root  cwd       DIR              253,1     4096 3538945 /root
proxmox-b 728661 root  rtd       DIR              253,1     4096       2 /
proxmox-b 728661 root  txt       REG              253,1  9978448 1835682 /usr/bin/proxmox-backup-client
proxmox-b 728661 root  mem       REG              253,1    51696 1839187 /usr/lib/x86_64-linux-gnu/libnss_files-2.31.so
proxmox-b 728661 root  mem       REG              253,1    18688 1839144 /usr/lib/x86_64-linux-gnu/libdl-2.31.so
proxmox-b 728661 root  mem       REG              253,1  1321344 1839145 /usr/lib/x86_64-linux-gnu/libm-2.31.so
proxmox-b 728661 root  mem       REG              253,1   149520 1839190 /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
proxmox-b 728661 root  mem       REG              253,1   100736 1837844 /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
proxmox-b 728661 root  mem       REG              253,1  1901536 1839143 /usr/lib/x86_64-linux-gnu/libc-2.31.so
proxmox-b 728661 root  mem       REG              253,1   890800 1838027 /usr/lib/x86_64-linux-gnu/libzstd.so.1.4.8
proxmox-b 728661 root  mem       REG              253,1    38936 1837760 /usr/lib/x86_64-linux-gnu/libacl.so.1.1.2253
proxmox-b 728661 root  mem       REG              253,1  3076992 1838591 /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
proxmox-b 728661 root  mem       REG              253,1   597792 1838593 /usr/lib/x86_64-linux-gnu/libssl.so.1.1
proxmox-b 728661 root  mem       REG              253,1   256024 1841634 /usr/lib/x86_64-linux-gnu/libfuse3.so.3.10.3
proxmox-b 728661 root  mem       REG              253,1   177928 1836113 /usr/lib/x86_64-linux-gnu/ld-2.31.so
proxmox-b 728661 root    0u      CHR              136,2      0t0       5 /dev/pts/2
proxmox-b 728661 root    1u      CHR              136,2      0t0       5 /dev/pts/2
proxmox-b 728661 root    2u      CHR              136,2      0t0       5 /dev/pts/2
proxmox-b 728661 root    3u  a_inode               0,14        0   12720 [eventpoll]
proxmox-b 728661 root    4u  a_inode               0,14        0   12720 [eventfd]
proxmox-b 728661 root    5u  a_inode               0,14        0   12720 [eventpoll]
proxmox-b 728661 root    6u     unix 0x00000000ef865036      0t0 7703197 type=STREAM
proxmox-b 728661 root    7u     unix 0x000000000e458ad2      0t0 7703198 type=STREAM
proxmox-b 728661 root    8u     unix 0x00000000ef865036      0t0 7703197 type=STREAM
proxmox-b 728661 root    9u  a_inode               0,14        0   12720 [eventpoll]
proxmox-b 728661 root   10u  a_inode               0,14        0   12720 [eventfd]
proxmox-b 728661 root   11u  a_inode               0,14        0   12720 [eventpoll]
proxmox-b 728661 root   12u     unix 0x00000000ef865036      0t0 7703197 type=STREAM
proxmox-b 728661 root   13u     IPv4            7708017      0t0     TCP localhost.localdomain:56932->localhost.localdomain:8007 (ESTABLISHED)
proxmox-b 728661 root   14r      DIR               0,43        0       1 /etc/pve
proxmox-b 728661 root   15r      DIR               0,43        0       9 /etc/pve/priv

Oh, one last thing: investigating on possible problems with the remote nature of my datastore, I ran a bechmark.
Here is the result, which seems more than acceptable to me:

Thank you in advance to anybody who can help shed some light on this problem

Cris

Chris · Jul 18, 2024

Cris70 said:
Both server mount a NAS disk though NFS, and both have PBS configured to write on that mount point (i.e. the datastore points to that mount point).

Hi,
so you are saying that both PBS instances access the same datastore concurrently? This is asking for trouble, the PBS server is not designed to allow for concurrent access by different instances to the store.

I think you are faced with connectivity issues between the host and the NFS storage, as you state:

Cris70 said:
Even the Garbage Collection task hangs forever on PBS, and - in this case - I cannot even stop it using the Stop button.
Actually, it seems that the GC process dies quite soon after starting, but somehow PBS doesn't notice this.

As this is purely on the PBS side (no client involved). You could check the output of ps auxwf and look for processes in an uninteruptible sleep state D, which would indicate that the process is waiting on IO.

Cris70 said:
Both server have PVE 7.4-18 (yes I know, I have to upgrade).
Both servers have PBS 2.4-7 installed along with PVE.

You should most definitely upgrade, but in general you should reconsider the whole setup. It is recommend to use a dedicated host for PBS with fast local storage [0].
Running Proxmox VE and PBS on the same host is not recommend, as the upgrade process might cause issues. Further, if you loose the host, you cannot restore from that PBS instance (which I think was your intention to circumvent by accessing the same datastore from different PBS instances?). And do not allow for concurrent access to the datastore.

[0] https://pbs.proxmox.com/docs/installation.html#recommended-server-system-requirements

Cris70 · Jul 18, 2024

Chris said:
Hi,
so you are saying that both PBS instances access the same datastore concurrently? This is asking for trouble, the PBS server is not designed to allow for concurrent access by different instances to the store.

Hi Chris, and thank you for your reply!
No, each server is accessing a NAS via NFS, but they're two different NAS (although they're the same make and model).

Chris said:
I think you are faced with connectivity issues between the host and the NFS storage, as you state:

As this is purely on the PBS side (no client involved). You could check the output of ps auxwf and look for processes in an uninteruptible sleep state D, which would indicate that the process is waiting on IO.

Actually it seems that the GC process dies right away, because just a few minutes after start I cannot find the process. But PBS keeps thinking that the process is still there.

Chris said:
You should most definitely upgrade, but in general you should reconsider the whole setup. It is recommend to use a dedicated host for PBS with fast local storage [0].

Yes, I know this is not the recommended setup, but it is serving me quite well for the moment (on the first server).
I would just like to make it work on the second too, at least until I'm ready to dedicate some hardware exclusively to PBS.

Thank you!
Cris

Chris · Jul 19, 2024

Cris70 said:
No, each server is accessing a NAS via NFS, but they're two different NAS (although they're the same make and model).

Oh, good! Than at least this should not be the issue.

Cris70 said:
Actually it seems that the GC process dies right away, because just a few minutes after start I cannot find the process. But PBS keeps thinking that the process is still there.

Please share an excerpt of the systemd journal from the timespan of the start of garbage collection until it is not present anymore, journalctl --since <DATETIME> --until <DATETIME> > journal.txt allows you to dump it to a file you can add as attachment here.

Please also share the task log of the garbage collection job.

Cris70 · Jul 19, 2024

Hi Chris, I just tried once more to launch a garbage collection (normally it would be scheduled, but I've disabled it while I debug the problem).

The behavior is as usual: PBS reports that the GC has started, but if I look for "UPID" in the process list I find nothing. Note that the process ID reported in the web UI is the PID of proxmox-backup-proxy, not the real PID of the job.

I let the GC go on (from the perspective of PBS) for about 10 minutes, then tried to stop it with the "Stop" button in the web UI.
This has no effect, as the process actually does not exist.
The only way to "clean" the task list in PBS is to restart the proxmox-backup-proxy daemon, which has the result of leaving the task in an unknown state.

I am attaching a few screenshots.

This is right after using the "Stop" button:

As you can see, PBS thinks the task is still running, even after using the "Stop" button.

Here is after restarting the proxmox-backup-proxy daemon:

As you can see, PBS now sees the task as finished with an unknow state.
But the "End time" of the task is the timestamp of when I pressed the "Stop" button (you can see it in the first image), not the real timestamp of when I restarted the daemon.

Meanwhile, there's no trace of anything happening in journalctl:

Code:

-- Journal begins at Wed 2023-12-20 15:19:21 CET, ends at Fri 2024-07-19 12:12:32 CEST. --
Jul 19 11:57:32 host0 pvedaemon[2828970]: <root@pam> successful auth for user 'root@pam'
Jul 19 11:58:08 host0 sshd[2903211]: Accepted password for root from IP_AMENDED port 49922 ssh2
Jul 19 11:58:08 host0 sshd[2903211]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jul 19 11:58:08 host0 systemd-logind[2203]: New session 141 of user root.
Jul 19 11:58:08 host0 systemd[1]: Started Session 141 of user root.
Jul 19 12:09:27 host0 systemd[1]: Stopping Proxmox Backup API Proxy Server...
Jul 19 12:09:27 host0 systemd[1]: proxmox-backup-proxy.service: Succeeded.
Jul 19 12:09:27 host0 systemd[1]: Stopped Proxmox Backup API Proxy Server.
Jul 19 12:09:27 host0 systemd[1]: proxmox-backup-proxy.service: Consumed 4min 751ms CPU time.
Jul 19 12:09:27 host0 systemd[1]: Starting Proxmox Backup API Proxy Server...
Jul 19 12:09:27 host0 proxmox-backup-[2932246]: host0 proxmox-backup-proxy[2932246]: service is ready
Jul 19 12:09:27 host0 systemd[1]: Started Proxmox Backup API Proxy Server.
Jul 19 12:09:27 host0 proxmox-backup-[2932246]: host0 proxmox-backup-proxy[2932246]: applied rrd journal (3315 entries in 0.016 seconds)
Jul 19 12:09:27 host0 proxmox-backup-[2932246]: host0 proxmox-backup-proxy[2932246]: write rrd data back to disk
Jul 19 12:09:27 host0 proxmox-backup-[2932246]: host0 proxmox-backup-proxy[2932246]: starting rrd data sync
Jul 19 12:09:27 host0 proxmox-backup-[2932246]: host0 proxmox-backup-proxy[2932246]: rrd journal successfully committed (20 files in 0.012 seconds)
Jul 19 12:12:32 host0 pvedaemon[2822457]: <root@pam> successful auth for user 'root@pam'

Cris

Cris70 · Aug 13, 2024

Hi @Chris and everybody else.
After my latest reply nobody answered with more suggestions or insights.
I understand this is the time most of us have their vacations, so I'm not complaining.
Actually, I just got back from holidays too, and wanted to check if something new had happened.
I hope however that this problem can be further investigated.
Thank you in advance.
Cris

Chris · Sep 3, 2024

Cris70 said:
Hi @Chris and everybody else.
After my latest reply nobody answered with more suggestions or insights.
I understand this is the time most of us have their vacations, so I'm not complaining.
Actually, I just got back from holidays too, and wanted to check if something new had happened.
I hope however that this problem can be further investigated.
Thank you in advance.
Cris

As you are running version which are EOL, did you already performed the upgrade to the latest version for both PBS and PVE hosts as suggested?

Cris70 · Sep 18, 2024

Hi @Chris , at last I could afford some time to take the server offline and upgrade both Proxmox VE and PBS.
Now I'm not PVE 8.2.5 and PBS 3.2-7.
Unfortunately, that did not make any difference.

Note that, in the meantime, I configured PVE to do plain backups to a directory, which in fact is the mount point to the same NFS storage that PBS would use. This setup works without issues, but obviously without all the niceties of PBS.
You can see the two storages here:

And here is the backup configuration:

Today I tried once more to backup to PBS. I did not enable the backup job: instead I simply selected it and pushed the "Run now" button.
It had the same outcome as before:

Code:

INFO: starting new backup job: vzdump --mailto log@AMENDED --storage BaR_PBS_Bluenas --prune-backups 'keep-daily=1,keep-last=2,keep-monthly=1,keep-weekly=1,keep-yearly=1' --bwlimit 20480 --mailnotification failure --mode snapshot --notes-template '{{guestname}}' --node host0 --fleecing '1,storage=vmzfs' --all 1 --exclude 100,999
INFO: Starting Backup of VM 101 (qemu)
INFO: Backup started at 2024-09-18 14:17:50
INFO: status = running
INFO: VM Name: ParcelsFinder
INFO: include disk 'scsi0' 'local-lvm:vm-101-disk-0' 200G
INFO: backup mode: snapshot
INFO: bandwidth limit: 20480 KB/s
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/101/2024-09-18T12:17:50Z'
INFO: drive-scsi0: attaching fleecing image vmzfs:vm-101-fleece-0 to QEMU
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 101 qmp command 'backup' failed - got timeout
INFO: aborting backup job
ERROR: VM 101 qmp command 'backup-cancel' failed - got timeout
INFO: resuming VM again
ERROR: Backup of VM 101 failed - VM 101 qmp command 'cont' failed - got timeout
INFO: Failed at 2024-09-18 14:30:43
INFO: Starting Backup of VM 102 (qemu)
INFO: Backup started at 2024-09-18 14:30:43
INFO: status = running
INFO: VM Name: DlgParcelsFinder
INFO: include disk 'scsi0' 'vmzfs:vm-102-disk-0' 200G
INFO: backup mode: snapshot
INFO: bandwidth limit: 20480 KB/s
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/102/2024-09-18T12:30:43Z'
INFO: drive-scsi0: attaching fleecing image vmzfs:vm-102-fleece-0 to QEMU
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 102 qmp command 'backup' failed - got timeout
INFO: aborting backup job
ERROR: interrupted by signal
INFO: resuming VM again

As you can see I get the usual timeouts.
After the second failure I stopped the backup job.

One thing that I may have overlooked in my past posts: the involved VMs become "sluggish" after the failed backup job is interrupted.
Neither the server nor the VMs show abnormal load, but the VMs that I tried to backup respond slowly or not at all to the GUI commands.
e.g: browsing to see the summary of one of the involved VMs take a few seconds to show the statistics, while the other VMs show instantly; trying to connect to the console fails.
I was still able to SSH into the VMs, so I connected and issued a shutdown. Well, they never completed the shutdown, but I was not able to see what was actually happening because SSH would not connect anymore (obviously) and the console was non-functioning.
A shutdown from the PVE GUI did not work. At last, I had to issue a STOP from the PVE GUI. After restart, the VMs were behaving normally again.

Syslog has been showing messages like these all the time since the failed backups, until I restarted the VMs:

Code:

host0 pvestatd[1529357]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries

I also tried to issue some of the qmp commands that were timing out from the cli of the server (using socat).
I had no problems issuing the query-proxmox-support on one of the other VMs, but the command (nor any other qmp command) did not return anything on the involved VMs until I finally restarted them (but the qmp "greeting" did show as soon as I connected).

One last thing: I noticed just now that if I connect to the qmp socket with socat and leave the connection open without doing anything, the involved VM becomes sluggish as described above. I cannot even connect to the console. As soon as I disconnect from the socket, the VM becomes normally available again.

I hope that this can shed some light on the problem... I have seen many other threads with similar problems in this forum, mostly without solution. For those that did have a solution, it did not apply to my server.

Cris

p.s. Meanwhile the other server keeps working without any problems, backing up to PBS. Still running PVE 7.4-18 and PBS 2.4.7

Chris · Sep 19, 2024

Cris70 said:
Hi @Chris , at last I could afford some time to take the server offline and upgrade both Proxmox VE and PBS.
Now I'm not PVE 8.2.5 and PBS 3.2-7.
Unfortunately, that did not make any difference.

Did you reboot the servers after the upgrade? Did you perform full shutdown or migration of the VM, so that the newly installed qemu version is used? Please post the output of pveversion -v and proxmox-backup-manager version --verbose again, to get the full picture.

Cris70 said:
As you can see I get the usual timeouts.
After the second failure I stopped the backup job.

One thing that I may have overlooked in my past posts: the involved VMs become "sluggish" after the failed backup job is interrupted.
Neither the server nor the VMs show abnormal load, but the VMs that I tried to backup respond slowly or not at all to the GUI commands.
e.g: browsing to see the summary of one of the involved VMs take a few seconds to show the statistics, while the other VMs show instantly; trying to connect to the console fails.
I was still able to SSH into the VMs, so I connected and issued a shutdown. Well, they never completed the shutdown, but I was not able to see what was actually happening because SSH would not connect anymore (obviously) and the console was non-functioning.
A shutdown from the PVE GUI did not work. At last, I had to issue a STOP from the PVE GUI. After restart, the VMs were behaving normally again.

What is running inside the VM? Is the qemu-guest-agent installed and runnnig? Please post also the qm config 101 --current and qm status 101 --verbose right after the failed backup run.

Further, you might want to debug by installing the qemu debug symbols and attach via gdb as described by this post from @fiona https://forum.proxmox.com/threads/e...roxmox-qmp-command-failed.139892/#post-633742

Cris70 · Oct 18, 2024

Hi @Chris ,
sorry for the late reply. As usual, my job keeps me busy enough that I do not have a lot of time to do debugging unfortunately.

Chris said:
Did you reboot the servers after the upgrade? Did you perform full shutdown or migration of the VM, so that the newly installed qemu version is used?

Yes, I did reboot the server. And I did a full shutdown of all the VMs before the upgrade.

Chris said:
Please post the output of pveversion -v and proxmox-backup-manager version --verbose again, to get the full picture.

Tonight I did another test with another VM. I saw in another thread (the one that you mentioned) that one customer had problems with disks bigger than 100Gb, so I chose another VM to do the tests (the previous one had a 200Gb disk).
Although he had problems because he was using CIFS to map the backup storage, and he solved his problem by using NFS.
I have always been using NFS, so it should not apply to me... but since I had a much smaller VM available (32Gb disk), I decided it was best to try with that one.

So, you can find attached the output of the two commands for VM 105.

Chris said:
What is running inside the VM? Is the qemu-guest-agent installed and runnnig? Please post also the qm config 101 --current and qm status 101 --verbose right after the failed backup run.

This VM is running FreeBSD 14.0. The previous one (VM 101) was running Ubuntu 20.04.6 LTS.
I confirm that qemu-guest-agent is installed and running.

I do not have Windows VMs, that seem to be the cause of some other reports similar to mine.

Attached, you find the output of the two command right after the failure of the backup (that is, after the PVE task reports "ERROR: VM 105 qmp command 'backup' failed - got timeout")

Chris said:
Further, you might want to debug by installing the qemu debug symbols and attach via gdb as described by this post from @fiona https://forum.proxmox.com/threads/e...roxmox-qmp-command-failed.139892/#post-633742

This is what I'm going to do as soon as I can. But if you have any insights before I try that, I'd like to hear from you.

Thank you in advance
Cris

Cris70 · Oct 18, 2024

Hi @Chris, find attached the results of the debugging sessions (right after backup start and right after backup failure).

Chris · Oct 18, 2024

Hi,
unfortunately the debug output is still missing the debug symbols related to proxmox-backup, please install also the apt install libproxmox-backup-qemu0-dbgsym symbols and try again. Thank you.

Cris70 · Oct 19, 2024

Hi @Chris ,
here you find the updated debug logs.
Thank you
Cris

Chris · Oct 21, 2024

Thanks for the logs, a quick glance unfortunately does not point to something obvious. May I ask you to try and setup a fleecing device for this backup job, and see if the issue also appears with the fleecing storage while backing up to PBS? I guess the issue is rather located at the PBS/NFS side after all... Please share also the mount parameters you are using on the PBS to mount the NFS share.

Cris70 · Oct 21, 2024

Thank you @Chris once again for your reply.
I have already added fleecing when I upgraded. The last tests that I made for you already included fleecing.
These are the mount parameters:

Code:

rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.55.252,mountvers=3,mountport=45914,port=42049,local_lock=none

Chris · Oct 23, 2024

Can you also check the output of grep 'State' /proc/$(cat /var/run/qemu-server/105.pid)/task/*/status, adapting the VMID accordingly? You did mention the issues with garbage collection before, is that solved after the upgrade or does that issue persist as well?

Cris70 · Oct 23, 2024

Hi @Chris, when should I issue that command? Right after backup failure?
Regarding the garbage collection: I never re-enabled it, but I can try and see what happens.

qmp command 'query-proxmox-support' failed - got timeout

Member

Member

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Attachments

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Attachments

Member

Attachments

Proxmox Staff Member

Member

Attachments

Proxmox Staff Member

Member

Proxmox Staff Member

Member