VM crashed on unreachable Backup device

Lioh

New Member
Oct 14, 2020
22
3
3
48
Hi all,

we are backing up our VMs on a NFS storage. Recently this backup target was running full and in consequence not reachable for some time:

Code:
Feb  6 16:27:33 kernel: [5717717.340258] nfs: server x.x.x.x not responding, still trying
Feb  6 16:27:33 kernel: [5717717.340268] nfs: server x.x.x.x not responding, still trying
Feb  6 16:27:33 kernel: [5717717.340289] nfs: server x.x.x.x OK
Feb  6 16:27:33 kernel: [5717717.340292] nfs: server x.x.x.x not responding, still trying

Two relatively big VM (4TB storage each) were just trying to backup which obviously failed. The strange thing is that those VMs crashed during that process. They where no longer reachable via Network or noVNC.

Do you know what could have led to the VM crashes on a failed backup?

Greetings
Lioh
 
Hi,
could you share the output of the backup task and the VM configuration?
 
In the bottom panel in the GUI there' should be an entry for the failed backup. When you double click on it, you'll see the logged output. For the VM configuration, use qm config <ID> replacing the <ID> with the actual number.
 
Thanks, here is the VM configuration. The Backup Log has been deleted during the Backup Share cleanup. Not sure if it has also been written to syslog somehow.

Code:
agent: 1
bootdisk: scsi0
cores: 8
hotplug: disk,network,cpu
ide2: none,media=cdrom
memory: 197636
name: vapp-220-sql
net0: virtio=EA:C3:C1:25:57:3A,bridge=vmbr0,firewall=1
net1: virtio=D6:30:4C:84:8F:2A,bridge=vmbr1,firewall=1
numa: 1
ostype: win10
sata0: none,media=cdrom
sata1: none,media=cdrom
scsi0: san-002-001:vm-220-disk-4,size=100G,ssd=1
scsi1: san-002-001:vm-220-disk-0,size=1G,ssd=1
scsi10: nas-backup-001-001:vm-220-disk-0,size=4095G,ssd=1
scsi2: san-002-001:vm-220-disk-1,size=1G,ssd=1
scsi3: san-002-001:vm-220-disk-2,size=1G,ssd=1
scsi4: san-002-001:vm-220-disk-3,size=1536G,ssd=1
scsi5: san-002-002:vm-220-disk-0,size=400G,ssd=1
scsi6: san-002-002:vm-220-disk-2,size=300G,ssd=1
scsi7: san-002-002:vm-220-disk-1,size=500G,ssd=1
scsi8: san-002-002:vm-220-disk-3,size=200G,ssd=1
scsi9: san-002-002:vm-220-disk-4,size=200G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=f3e4fc16-4e9b-4f16-a5a9-b8a022305450
sockets: 2
vcpus: 12
vmgenid: 6b47aa34-a3e3-4b71-9c7f-be099ca83a8a
 
This is the only thing I have found in the syslog at the relevant time:

Code:
Feb  6 16:30:17 pvestatd[2972]: VM 220 qmp command failed - VM 220 qmp command 'query-proxmox-support' failed - unable to connect to VM 220 qmp socket - timeout after 31 retries
Feb  6 16:30:18 pvestatd[2972]: status update time (6.820 seconds)
Feb  6 16:30:27 pvestatd[2972]: VM 220 qmp command failed - VM 220 qmp command 'query-proxmox-support' failed - unable to connect to VM 220 qmp socket - timeout after 31 retries
Feb  6 16:30:27 pvestatd[2972]: status update time (6.782 seconds)
Feb  6 16:30:37 pvestatd[2972]: VM 220 qmp command failed - VM 220 qmp command 'query-proxmox-support' failed - unable to connect to VM 220 qmp socket - timeout after 31 retries
Feb  6 16:30:38 pvestatd[2972]: status update time (6.772 seconds)
Feb  6 16:30:47 pvestatd[2972]: VM 220 qmp command failed - VM 220 qmp command 'query-proxmox-support' failed - unable to connect to VM 220 qmp socket - timeout after 31 retries
Feb  6 16:30:48 pvestatd[2972]: status update time (6.654 seconds)
Feb  6 16:30:57 pvestatd[2972]: VM 220 qmp command failed - VM 220 qmp command 'query-proxmox-support' failed - unable to connect to VM 220 qmp socket - timeout after 31 retries
Feb  6 16:30:57 pvestatd[2972]: status update time (6.640 seconds)
 
Task_Log.png
The task log should also be available at the bottom in the GUI when you click twice on the relevant entry.

This is the only thing I have found in the syslog at the relevant time:

Code:
Feb  6 16:30:17 pvestatd[2972]: VM 220 qmp command failed - VM 220 qmp command 'query-proxmox-support' failed - unable to connect to VM 220 qmp socket - timeout after 31 retries
Feb  6 16:30:18 pvestatd[2972]: status update time (6.820 seconds)
Feb  6 16:30:27 pvestatd[2972]: VM 220 qmp command failed - VM 220 qmp command 'query-proxmox-support' failed - unable to connect to VM 220 qmp socket - timeout after 31 retries
Feb  6 16:30:27 pvestatd[2972]: status update time (6.782 seconds)
Feb  6 16:30:37 pvestatd[2972]: VM 220 qmp command failed - VM 220 qmp command 'query-proxmox-support' failed - unable to connect to VM 220 qmp socket - timeout after 31 retries
Feb  6 16:30:38 pvestatd[2972]: status update time (6.772 seconds)
Feb  6 16:30:47 pvestatd[2972]: VM 220 qmp command failed - VM 220 qmp command 'query-proxmox-support' failed - unable to connect to VM 220 qmp socket - timeout after 31 retries
Feb  6 16:30:48 pvestatd[2972]: status update time (6.654 seconds)
Feb  6 16:30:57 pvestatd[2972]: VM 220 qmp command failed - VM 220 qmp command 'query-proxmox-support' failed - unable to connect to VM 220 qmp socket - timeout after 31 retries
Feb  6 16:30:57 pvestatd[2972]: status update time (6.640 seconds)
Seems like the VM was already hanging/crashed at that time. The good news is that with QEMU 5.2 more operations will be asynchronous, but I can't tell for sure yet if that would've prevented your particular problem.
 
Okay. In the meanwhile I have found the relevant Backup logs as well:


Code:
INFO: creating vzdump archive '/mnt/pve/../dump/vzdump-qemu-221-2021_02_06-06_37_47.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task 'fb6c5416-a9bc-47b6-a2f7-e1055e128920'
INFO: resuming VM again
INFO:   0% (634.4 MiB of 7.7 TiB) in  3s, read: 211.5 MiB/s, write: 167.9 MiB/s
INFO:   1% (78.4 GiB of 7.7 TiB) in  7m  6s, read: 188.4 MiB/s, write: 5.6 MiB/s
INFO:   2% (156.9 GiB of 7.7 TiB) in 13m 48s, read: 199.7 MiB/s, write: 52.0 KiB/s
INFO:   3% (235.1 GiB of 7.7 TiB) in 20m 11s, read: 209.1 MiB/s, write: 54.6 KiB/s
INFO:   4% (313.5 GiB of 7.7 TiB) in 27m 22s, read: 186.3 MiB/s, write: 24.3 MiB/s
INFO:   5% (391.7 GiB of 7.7 TiB) in 34m 18s, read: 192.6 MiB/s, write: 10.0 MiB/s
INFO:   6% (470.1 GiB of 7.7 TiB) in 40m 34s, read: 213.4 MiB/s, write: 231.2 KiB/s
INFO:   7% (548.5 GiB of 7.7 TiB) in 47m 29s, read: 193.6 MiB/s, write: 50.4 KiB/s
INFO:   8% (626.8 GiB of 7.7 TiB) in 54m 12s, read: 198.8 MiB/s, write: 54.4 KiB/s
INFO:   9% (705.1 GiB of 7.7 TiB) in  1h  1m  4s, read: 194.7 MiB/s, write: 7.2 MiB/s
INFO:  10% (783.4 GiB of 7.7 TiB) in  1h 10m 47s, read: 137.5 MiB/s, write: 76.3 MiB/s
INFO:  11% (861.9 GiB of 7.7 TiB) in  1h 17m 17s, read: 206.1 MiB/s, write: 35.8 KiB/s
INFO:  12% (940.1 GiB of 7.7 TiB) in  1h 23m 54s, read: 201.7 MiB/s, write: 35.1 KiB/s
INFO:  13% (1018.6 GiB of 7.7 TiB) in  1h 30m 40s, read: 197.9 MiB/s, write: 36.8 KiB/s
INFO:  14% (1.1 TiB of 7.7 TiB) in  1h 37m 20s, read: 200.2 MiB/s, write: 34.9 KiB/s
INFO:  15% (1.1 TiB of 7.7 TiB) in  1h 43m 59s, read: 201.0 MiB/s, write: 34.9 KiB/s
INFO:  16% (1.2 TiB of 7.7 TiB) in  1h 50m 22s, read: 209.5 MiB/s, write: 211.4 KiB/s
INFO:  17% (1.3 TiB of 7.7 TiB) in  1h 57m  2s, read: 200.6 MiB/s, write: 34.9 KiB/s
INFO:  18% (1.4 TiB of 7.7 TiB) in  2h  4m 21s, read: 182.6 MiB/s, write: 22.8 MiB/s
INFO:  19% (1.5 TiB of 7.7 TiB) in  2h 12m 32s, read: 163.8 MiB/s, write: 58.0 MiB/s
INFO:  20% (1.5 TiB of 7.7 TiB) in  2h 18m 55s, read: 209.1 MiB/s, write: 57.2 KiB/s
INFO:  21% (1.6 TiB of 7.7 TiB) in  2h 25m 33s, read: 201.6 MiB/s, write: 52.6 KiB/s
INFO:  22% (1.7 TiB of 7.7 TiB) in  2h 33m 24s, read: 170.2 MiB/s, write: 45.2 MiB/s
INFO:  23% (1.8 TiB of 7.7 TiB) in  2h 45m 47s, read: 108.0 MiB/s, write: 108.0 MiB/s
INFO:  24% (1.8 TiB of 7.7 TiB) in  2h 56m 57s, read: 119.7 MiB/s, write: 119.7 MiB/s
INFO:  25% (1.9 TiB of 7.7 TiB) in  3h  8m 58s, read: 111.4 MiB/s, write: 111.4 MiB/s
INFO:  26% (2.0 TiB of 7.7 TiB) in  3h 20m 50s, read: 112.6 MiB/s, write: 112.6 MiB/s
INFO:  27% (2.1 TiB of 7.7 TiB) in  3h 28m 31s, read: 173.9 MiB/s, write: 34.3 MiB/s
INFO:  28% (2.1 TiB of 7.7 TiB) in  3h 40m 59s, read: 107.3 MiB/s, write: 107.3 MiB/s
INFO:  29% (2.2 TiB of 7.7 TiB) in  3h 53m  8s, read: 110.1 MiB/s, write: 110.1 MiB/s
INFO:  30% (2.3 TiB of 7.7 TiB) in  4h  5m 44s, read: 106.1 MiB/s, write: 106.1 MiB/s
INFO:  31% (2.4 TiB of 7.7 TiB) in  4h 17m 42s, read: 111.6 MiB/s, write: 111.6 MiB/s
INFO:  32% (2.4 TiB of 7.7 TiB) in  4h 29m 34s, read: 112.7 MiB/s, write: 112.6 MiB/s
INFO:  33% (2.5 TiB of 7.7 TiB) in  4h 41m 34s, read: 111.5 MiB/s, write: 111.5 MiB/s
INFO:  34% (2.6 TiB of 7.7 TiB) in  4h 53m 23s, read: 113.2 MiB/s, write: 112.9 MiB/s
INFO:  35% (2.7 TiB of 7.7 TiB) in  5h  5m 47s, read: 107.8 MiB/s, write: 106.5 MiB/s
INFO:  36% (2.8 TiB of 7.7 TiB) in  5h 17m 42s, read: 112.3 MiB/s, write: 111.7 MiB/s
INFO:  37% (2.8 TiB of 7.7 TiB) in  5h 29m 34s, read: 112.6 MiB/s, write: 112.4 MiB/s
INFO:  38% (2.9 TiB of 7.7 TiB) in  5h 41m 17s, read: 114.2 MiB/s, write: 114.1 MiB/s
INFO:  39% (3.0 TiB of 7.7 TiB) in  5h 52m 52s, read: 115.4 MiB/s, write: 115.2 MiB/s
INFO:  40% (3.1 TiB of 7.7 TiB) in  6h  5m 16s, read: 107.8 MiB/s, write: 106.5 MiB/s
INFO:  41% (3.1 TiB of 7.7 TiB) in  6h 18m  5s, read: 104.3 MiB/s, write: 97.9 MiB/s
INFO:  42% (3.2 TiB of 7.7 TiB) in  6h 31m  0s, read: 103.5 MiB/s, write: 102.3 MiB/s
INFO:  43% (3.3 TiB of 7.7 TiB) in  6h 43m 37s, read: 106.0 MiB/s, write: 106.0 MiB/s
INFO:  44% (3.4 TiB of 7.7 TiB) in  6h 56m 30s, read: 103.8 MiB/s, write: 103.8 MiB/s
INFO:  45% (3.4 TiB of 7.7 TiB) in  7h  7m 31s, read: 121.3 MiB/s, write: 79.9 MiB/s
INFO:  46% (3.5 TiB of 7.7 TiB) in  7h 20m 37s, read: 102.1 MiB/s, write: 102.1 MiB/s
INFO:  47% (3.6 TiB of 7.7 TiB) in  7h 33m 53s, read: 100.8 MiB/s, write: 98.1 MiB/s
INFO:  48% (3.7 TiB of 7.7 TiB) in  7h 46m 58s, read: 102.2 MiB/s, write: 102.2 MiB/s
INFO:  49% (3.7 TiB of 7.7 TiB) in  8h  0m 35s, read: 98.2 MiB/s, write: 98.2 MiB/s
INFO:  50% (3.8 TiB of 7.7 TiB) in 10h 16m 27s, read: 9.8 MiB/s, write: 9.8 MiB/s
Warning: unable to close filehandle GEN67463 properly: No space left on device at /usr/share/perl5/PVE/VZDump/QemuServer.pm line 754.
ERROR: interrupted by signal
INFO: aborting backup job
TASK ERROR: got unexpected control message: