VM hung during backup caused backup failure. (backup: 3.0.3-1 pvetest)

scyto

Active Member
Aug 8, 2023
310
55
28
Logging this here (or does the team prefer bug reports for test items?)

Edit TL;DR - this is a hung guest VM - no idea why it hung, nothing in dmesg etc - happy to try and figure it out, looking at the log lower down it seems the freeze command was issued and processed.

Then the thaw command was issues and that was not processed.

Code:
INFO: attaching TPM drive to QEMU for backup
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 111 qmp command 'guest-fsfreeze-thaw' failed - got timeout
ERROR: VM 111 qmp command 'backup' failed - got timeout

I assume something hung in the VM or at a KVM/QEMU level.

I cleared this by stopping the node.

----original post---
I thought disabling the pvetest deb would let me go back to previous versions - but i guess not - what ths apt command to revert to preivous version (or do i just uninstall and re-install the backup client and server packages with the pvetest deb disabled)

Code:
INFO: status = running
INFO: VM Name: docker01
VM 111 qmp command 'query-status' failed - unable to connect to VM 111 qmp socket - timeout after 51 retries
ERROR: Backup of VM 111 failed - VM 111 qmp command 'query-version' failed - unable to connect to VM 111 qmp socket - timeout after 51 retries
INFO: Failed at 2023-10-03 14:01:17
INFO: Backup job finished with errors

interestingly everything was fine until 6am this morning.... before that it was backing up fine.... now it has a weird stalled 1B thing...

1696372531110.png

no issue with the backup of another VM on the same node to the same pbs:

1696372605923.png

ahh, this seems to be the first backup that failed and was more details

Code:
INFO: Starting Backup of VM 111 (qemu)
INFO: Backup started at 2023-10-03 06:01:41
INFO: status = running
INFO: VM Name: docker01
INFO: include disk 'scsi0' 'local-lvm:vm-111-disk-2' 127G
INFO: include disk 'scsi1' 'local-lvm:vm-111-disk-3' 128G
INFO: include disk 'efidisk0' 'local-lvm:vm-111-disk-0' 4M
INFO: include disk 'tpmstate0' 'local-lvm:vm-111-disk-1' 4M
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/111/2023-10-03T13:01:41Z'
INFO: attaching TPM drive to QEMU for backup
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 111 qmp command 'guest-fsfreeze-thaw' failed - got timeout
ERROR: VM 111 qmp command 'backup' failed - got timeout
INFO: aborting backup job
ERROR: VM 111 qmp command 'backup-cancel' failed - unable to connect to VM 111 qmp socket - timeout after 5986 retries
INFO: resuming VM again
ERROR: Backup of VM 111 failed - VM 111 qmp command 'cont' failed - unable to connect to VM 111 qmp socket - timeout after 450 retries

Package list
Code:
proxmox-ve: 8.0.2 (running kernel: 6.2.16-15-pve)
pve-manager: 8.0.6 (running version: 8.0.6/57490ff2c6a38448)
pve-kernel-6.2: 8.0.5
proxmox-kernel-helper: 8.0.3
proxmox-kernel-6.2.16-15-pve: 6.2.16-15
proxmox-kernel-6.2: 6.2.16-15
proxmox-kernel-6.2.16-14-pve: 6.2.16-14
proxmox-kernel-6.2.16-12-pve: 6.2.16-12
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph: 18.2.0-pve2
ceph-fuse: 18.2.0-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx5
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.26-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.5
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.9
libpve-guest-common-perl: 5.0.5
libpve-http-server-perl: 5.0.4
libpve-rs-perl: 0.8.6
libpve-storage-perl: 8.0.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.3-1
proxmox-backup-file-restore: 3.0.3-1
proxmox-kernel-helper: 8.0.3
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.8
pve-cluster: 8.0.4
pve-container: 5.0.5
pve-docs: 8.0.5
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.8-2
pve-ha-manager: 4.0.2
pve-i18n: 3.0.7
pve-qemu-kvm: 8.0.2-6
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.7
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.13-pve1
 
Last edited:
Hi,
is the VM responsive at all after the error or completely stuck?

Please share the VM configuration qm config <ID> --current and the output of qm status <ID> --verbose.

To debug the issue further, you can use apt install pve-qemu-kvm-dbgsym libproxmox-backup-qemu0-dbgsym gdb to install the relevant debug symbols and debugger. The next time it freezes or if it is still frozen, you can run
Code:
gdb --batch --ex 't a a bt' -p $(cat /var/run/qemu-server/<ID>.pid)
replacing <ID> with the ID of the stuck VM
 
s the VM responsive at all after the error or completely stuck?
No, it couldn't be pinged, i couldn't ssh in.
It showed with 49.9% CPU usage in the UI and as running.
The QEMU guest agent had stopped reporting the IP addresses up to the UI

So yes completely hung.

I stopped the VM so i could move on, if it happens again i will absolutely run those debug command.
I could only find one example in bugzilla and it seems no one new what did it....

Code:
root@pve1:~# qm config 111 --current
agent: 1
balloon: 2048
bios: ovmf
boot: order=scsi0
cores: 2
cpu: host
efidisk0: local-lvm:vm-111-disk-0,efitype=4m,size=4M
ide2: none,media=cdrom
machine: q35
memory: 3096
meta: creation-qemu=8.0.2,ctime=1695945271
name: docker01
net0: virtio=02:FF:44:DC:54:06,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-lvm:vm-111-disk-2,cache=writethrough,discard=on,iothread=1,size=127G,ssd=1
scsi1: local-lvm:vm-111-disk-3,cache=writethrough,discard=on,iothread=1,size=128G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=62ec1712-c13d-4bcb-96b5-ac00510c821f
sockets: 1
tpmstate0: local-lvm:vm-111-disk-1,size=4M,version=v2.0
vmgenid: a3ac613e-fe75-4773-a42d-15b8a1fcb239


Code:
root@pve1:~# qm config 111 --current
agent: 1
balloon: 2048
bios: ovmf
boot: order=scsi0
cores: 2
cpu: host
efidisk0: local-lvm:vm-111-disk-0,efitype=4m,size=4M
ide2: none,media=cdrom
machine: q35
memory: 3096
meta: creation-qemu=8.0.2,ctime=1695945271
name: docker01
net0: virtio=02:FF:44:DC:54:06,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-lvm:vm-111-disk-2,cache=writethrough,discard=on,iothread=1,size=127G,ssd=1
scsi1: local-lvm:vm-111-disk-3,cache=writethrough,discard=on,iothread=1,size=128G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=62ec1712-c13d-4bcb-96b5-ac00510c821f
sockets: 1
tpmstate0: local-lvm:vm-111-disk-1,size=4M,version=v2.0
vmgenid: a3ac613e-fe75-4773-a42d-15b8a1fcb239
root@pve1:~# qm status 111 --verbose
balloon: 3246391296
balloon_min: 2147483648
ballooninfo:
        actual: 3246391296
        free_mem: 148815872
        last_update: 1696437246
        major_page_faults: 6521
        max_mem: 3246391296
        mem_swapped_in: 20480
        mem_swapped_out: 2752512
        minor_page_faults: 205612560
        total_mem: 3038420992
blockstat:
        efidisk0:
                account_failed: 1
                account_invalid: 1
                failed_flush_operations: 0
                failed_rd_operations: 0
                failed_unmap_operations: 0
                failed_wr_operations: 0
                flush_operations: 0
                flush_total_time_ns: 0
                invalid_flush_operations: 0
                invalid_rd_operations: 0
                invalid_unmap_operations: 0
                invalid_wr_operations: 0
                rd_bytes: 0
                rd_merged: 0
                rd_operations: 0
                rd_total_time_ns: 0
                timed_stats:
                unmap_bytes: 0
                unmap_merged: 0
                unmap_operations: 0
                unmap_total_time_ns: 0
                wr_bytes: 0
                wr_highest_offset: 47104
                wr_merged: 0
                wr_operations: 0
                wr_total_time_ns: 0
        ide2:
                account_failed: 1
                account_invalid: 1
                failed_flush_operations: 0
                failed_rd_operations: 0
                failed_unmap_operations: 0
                failed_wr_operations: 0
                flush_operations: 0
                flush_total_time_ns: 0
                idle_time_ns: 59460854811537
                invalid_flush_operations: 0
                invalid_rd_operations: 0
                invalid_unmap_operations: 0
                invalid_wr_operations: 0
                rd_bytes: 446
                rd_merged: 0
                rd_operations: 21
                rd_total_time_ns: 40977
                timed_stats:
                unmap_bytes: 0
                unmap_merged: 0
                unmap_operations: 0
                unmap_total_time_ns: 0
                wr_bytes: 0
                wr_highest_offset: 0
                wr_merged: 0
                wr_operations: 0
                wr_total_time_ns: 0
        pflash0:
                account_failed: 1
                account_invalid: 1
                failed_flush_operations: 0
                failed_rd_operations: 0
                failed_unmap_operations: 0
                failed_wr_operations: 0
                flush_operations: 0
                flush_total_time_ns: 0
                invalid_flush_operations: 0
                invalid_rd_operations: 0
                invalid_unmap_operations: 0
                invalid_wr_operations: 0
                rd_bytes: 0
                rd_merged: 0
                rd_operations: 0
                rd_total_time_ns: 0
                timed_stats:
                unmap_bytes: 0
                unmap_merged: 0
                unmap_operations: 0
                unmap_total_time_ns: 0
                wr_bytes: 0
                wr_highest_offset: 0
                wr_merged: 0
                wr_operations: 0
                wr_total_time_ns: 0
        scsi0:
                account_failed: 1
                account_invalid: 1
                failed_flush_operations: 0
                failed_rd_operations: 0
                failed_unmap_operations: 0
                failed_wr_operations: 0
                flush_operations: 0
                flush_total_time_ns: 0
                idle_time_ns: 1036585435
                invalid_flush_operations: 0
                invalid_rd_operations: 0
                invalid_unmap_operations: 0
                invalid_wr_operations: 0
                rd_bytes: 1001842688
                rd_merged: 0
                rd_operations: 17228
                rd_total_time_ns: 6923150994
                timed_stats:
                unmap_bytes: 0
                unmap_merged: 0
                unmap_operations: 0
                unmap_total_time_ns: 0
                wr_bytes: 1654521856
                wr_highest_offset: 135476006912
                wr_merged: 0
                wr_operations: 133307
                wr_total_time_ns: 1450825833581
        scsi1:
                account_failed: 1
                account_invalid: 1
                failed_flush_operations: 0
                failed_rd_operations: 0
                failed_unmap_operations: 0
                failed_wr_operations: 0
                flush_operations: 0
                flush_total_time_ns: 0
                idle_time_ns: 1497834655
                invalid_flush_operations: 0
                invalid_rd_operations: 0
                invalid_unmap_operations: 0
                invalid_wr_operations: 0
                rd_bytes: 390184448
                rd_merged: 0
                rd_operations: 48941
                rd_total_time_ns: 27673289312
                timed_stats:
                unmap_bytes: 0
                unmap_merged: 0
                unmap_operations: 0
                unmap_total_time_ns: 0
                wr_bytes: 2005151744
                wr_highest_offset: 134224343040
                wr_merged: 0
                wr_operations: 192873
                wr_total_time_ns: 709038576991
cpus: 2
disk: 0
diskread: 1392027582
diskwrite: 3659673600
freemem: 148815872
maxdisk: 136365211648
maxmem: 3246391296
mem: 2889605120
name: docker01
netin: 4752126353
netout: 2533316986
nics:
        tap111i0:
                netin: 4752126353
                netout: 2533316986
pid: 632638
proxmox-support:
        backup-max-workers: 1
        pbs-dirty-bitmap: 1
        pbs-dirty-bitmap-migration: 1
        pbs-dirty-bitmap-savevm: 1
        pbs-library-version: 1.4.0 (UNKNOWN)
        pbs-masterkey: 1
        query-bitmap-info: 1
qmpstatus: running
running-machine: pc-q35-8.0+pve0
running-qemu: 8.0.2
shares: 1000
status: running
uptime: 59474
vmid: 111
 
what happens between the freeze and thaw command?

If there is a way for me to issue:
  • the freeze command
  • other commands that happen before the thaw command
  • the thaw command
I am happy to run those a few hundred times (if you can supply a looping script) to see if we can force a hang?
 
Last edited:
this bug

https://gitlab.com/qemu-project/qemu/-/issues/520

and the one in the promox bugzilla all mention mounted drives via fstab (external drives /tmp / lopbacks etc) the only thing unusual about my VM is i am running gluster in the VM and that is mounted in fstab...

(i am stabbing in the dark here)

this is my fstab

Code:
alex@Docker01:~$ cat /etc/fstab

# / was on /dev/sda2 during installation
UUID=11b82f08-2174-43a2-bc29-072ed1ed73d7 /               ext4    errors=remount-ro 0       1

# /boot/efi was on /dev/sda1 during installation
UUID=E818-53C2  /boot/efi       vfat    umask=0077      0       1

# swap was on /dev/sda3 during installation
UUID=5e985494-eeed-4de6-9f2c-e22cba753664 none            swap    sw              0       0
/dev/sr0        /media/cdrom0   udf,iso9660 user,noauto     0       0

#mount glusterfs drive
UUID=c78ed594-ef62-445d-9486-10938f49b603 /mnt/glusterfs xfs  defaults  0  0

#mount glusterfs volume
localhost:/gluster-vol1  /mnt/gluster-vol1 glusterfs  defaults,_netdev,noauto,x-systemd.automount        0 0
 
Last edited:
discs FWIW


Code:
root@pve1:~# qm guest cmd 111 get-fsinfo
[
   {
      "disk" : [
         {
            "bus" : 0,
            "bus-type" : "scsi",
            "dev" : "/dev/sda2",
            "pci-controller" : {
               "bus" : 9,
               "domain" : 0,
               "function" : 0,
               "slot" : 1
            },
            "serial" : "0QEMU_QEMU_HARDDISK_drive-scsi0",
            "target" : 0,
            "unit" : 0
         }
      ],
      "mountpoint" : "/var/lib/docker/plugins/cdb8c50f2ecb8df209503537c5bc7354150b1a96409ed8975ce5667fbebf2c56/propagated-mount",
      "name" : "sda2",
      "total-bytes" : 125329215488,
      "type" : "ext4",
      "used-bytes" : 20733399040
   },
   {
      "disk" : [
         {
            "bus" : 0,
            "bus-type" : "scsi",
            "dev" : "/dev/sdb1",
            "pci-controller" : {
               "bus" : 9,
               "domain" : 0,
               "function" : 0,
               "slot" : 2
            },
            "serial" : "0QEMU_QEMU_HARDDISK_drive-scsi1",
            "target" : 0,
            "unit" : 1
         }
      ],
      "mountpoint" : "/mnt/glusterfs",
      "name" : "sdb1",
      "total-bytes" : 137411674112,
      "type" : "xfs",
      "used-bytes" : 3011784704
   },
   {
      "disk" : [
         {
            "bus" : 0,
            "bus-type" : "scsi",
            "dev" : "/dev/sda1",
            "pci-controller" : {
               "bus" : 9,
               "domain" : 0,
               "function" : 0,
               "slot" : 1
            },
            "serial" : "0QEMU_QEMU_HARDDISK_drive-scsi0",
            "target" : 0,
            "unit" : 0
         }
      ],
      "mountpoint" : "/boot/efi",
      "name" : "sda1",
      "total-bytes" : 535801856,
      "type" : "vfat",
      "used-bytes" : 6070272
   },
   {
      "disk" : [
         {
            "bus" : 0,
            "bus-type" : "scsi",
            "dev" : "/dev/sda2",
            "pci-controller" : {
               "bus" : 9,
               "domain" : 0,
               "function" : 0,
               "slot" : 1
            },
            "serial" : "0QEMU_QEMU_HARDDISK_drive-scsi0",
            "target" : 0,
            "unit" : 0
         }
      ],
      "mountpoint" : "/",
      "name" : "sda2",
      "total-bytes" : 125329215488,
      "type" : "ext4",
      "used-bytes" : 20733399040
   }
]
 
what happens between the freeze and thaw command?

If there is a way for me to issue:
  • the freeze command
  • other commands that happen before the thaw command
  • the thaw command
I am happy to run those a few hundred times (if you can supply a looping script) to see if we can force a hang?
The command that is run between freeze and thaw is starting the backup. So you can just run PBS backups in a loop. Since there is dirty bitmap tracking of the drives, that should be fast. Alternatively, you can just cancel the backup task after the thaw each time.

How much load was there on your PBS server and network connection to it around the time the issue happened?

If you can reproduce it, the output of qm status <ID> --verbose and debugger traces with GDB while it's stuck would be interesting.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!