VM backup failure error 125

johnvick

Member
Mar 1, 2022
27
3
8
65
Using 7-1.12.

Backup of Ubuntu VM repeatedly fails with error 125 at 65% - same with stop and snapshot backups, same with local and two LAN storage drives. LXCs backup without error. Trying to clone the drive also fails. Any advice appreciated.

Code:
VMID    NAME    STATUS    TIME    SIZE    FILENAME
105    Jammy    FAILED    00:06:23    job failed with err -125 - Operation canceled
TOTAL    00:06:23    0KB   


Detailed backup logs:

vzdump 105 --remove 0 --storage vm --node pve --compress zstd --mode stop --mailto *****@gmail.com


105: 2022-05-06 19:30:01 INFO: Starting Backup of VM 105 (qemu)
105: 2022-05-06 19:30:01 INFO: status = running
105: 2022-05-06 19:30:01 INFO: backup mode: stop
105: 2022-05-06 19:30:01 INFO: ionice priority: 7
105: 2022-05-06 19:30:01 INFO: VM Name: Jammy
105: 2022-05-06 19:30:01 INFO: include disk 'scsi0' 'vm:105/vm-105-disk-0.qcow2' 80G
105: 2022-05-06 19:30:01 INFO: include disk 'scsi1' 'vm:105/vm-105-disk-1.qcow2' 100G
105: 2022-05-06 19:30:01 INFO: stopping virtual guest
105: 2022-05-06 19:30:15 INFO: creating vzdump archive '/mnt/storage/vm/dump/vzdump-qemu-105-2022_05_06-19_30_01.vma.zst'
105: 2022-05-06 19:30:15 INFO: starting kvm to execute backup task
105: 2022-05-06 19:30:16 INFO: started backup task '0c50dc63-2ede-45b6-8c04-7a65b77db7f7'
105: 2022-05-06 19:30:16 INFO: resuming VM again after 15 seconds
105: 2022-05-06 19:30:19 INFO:   0% (1.2 GiB of 180.0 GiB) in 3s, read: 406.0 MiB/s, write: 351.9 MiB/s
105: 2022-05-06 19:30:22 INFO:   1% (2.3 GiB of 180.0 GiB) in 6s, read: 392.7 MiB/s, write: 350.0 MiB/s
105: 2022-05-06 19:30:31 INFO:   2% (3.7 GiB of 180.0 GiB) in 15s, read: 155.9 MiB/s, write: 154.1 MiB/s
105: 2022-05-06 19:30:36 INFO:   3% (5.6 GiB of 180.0 GiB) in 20s, read: 391.5 MiB/s, write: 365.9 MiB/s

105: 2022-05-06 19:36:09 INFO:  64% (115.2 GiB of 180.0 GiB) in 5m 53s, read: 452.2 MiB/s, write: 270.2 MiB/s
105: 2022-05-06 19:36:16 INFO:  65% (117.1 GiB of 180.0 GiB) in 6m, read: 272.4 MiB/s, write: 270.5 MiB/s
105: 2022-05-06 19:36:21 INFO:  65% (118.2 GiB of 180.0 GiB) in 6m 5s, read: 232.3 MiB/s, write: 187.3 MiB/s
105: 2022-05-06 19:36:21 ERROR: job failed with err -125 - Operation canceled
105: 2022-05-06 19:36:21 INFO: aborting backup job
105: 2022-05-06 19:36:21 INFO: resuming VM again
105: 2022-05-06 19:36:24 ERROR: Backup of VM 105 failed - job failed with err -125 - Operation canceled
 
Last edited:
job failed with err -125 - Operation canceled
This sounds like the backup job is stopped by your system for some reason, perhaps the OOM Killer? You could try to monitor memory/system resource usage during the backup, are there any abnormalities?

Still, some more information would be nice. Please attach your VM config qm config <vmid>. Furthermore, journalctl may give you some more insights what exactly is going wrong.
 
I take it you mean out of memory killer? I'm not sure how to monitor system resources but it is an Intel i7 11 gen with 1 TB NVME and 32 GB ram so not short on resource.

The qm config is:

root@pve:~# qm config 105 agent: 1,fstrim_cloned_disks=1 boot: order=scsi0;net0 cores: 8 memory: 8192 meta: creation-qemu=6.1.1,ctime=1649458020 name: Jammy net0: virtio=CA:30:6A:32:38:51,bridge=vmbr0,firewall=1 numa: 0 onboot: 1 ostype: l26 scsi0: vm:105/vm-105-disk-0.qcow2,discard=on,size=80G,ssd=1 scsi1: vm:105/vm-105-disk-1.qcow2,discard=on,size=100G,ssd=1 scsihw: virtio-scsi-pci smbios1: uuid=15221f6b-f18a-4654-a900-66f27cef30a6 sockets: 1 vmgenid: 2c1e134c-88b9-4845-85da-ba6235c13330

journalctl shows:

May 06 20:56:03 pve kernel: blk_update_request: critical medium error, dev nvme0n1, sector 416706816 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 2

Which looks like a hardware error on the NVME drive?
 
critical medium error, dev nvme0n1, sector 416706816 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 2
You are right, this very much sounds like a problem with your device.

Please run smartctl -a <device>, this should be able to tell you if something is wrong.
 
Last edited:
  • Like
Reactions: Stoiko Ivanov
Code:
root@pve:~# smartctl -a /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.30-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       ShiJi 1TB M.2-NVMe
Serial Number:                      2021121400001
Firmware Version:                   U0520A0L
PCI Vendor/Subsystem ID:            0x126f
IEEE OUI Identifier:                0x000001
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            000001 0000000000
Local Time is:                      Fri May  6 21:17:47 2022 NZST
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x0015):     Comp DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     83 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W       -        -    0  0  0  0        0       0
 1 +     4.60W       -        -    1  1  1  1        0       0
 2 +     3.80W       -        -    2  2  2  2        0       0
 3 -   0.0250W       -        -    3  3  3  3     2200    3000
 4 -   0.0120W       -        -    4  4  4  4    43000   43000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

/CODE]

I'm no expert here but this looks OK?
 
SMART overall-health self-assessment test result: PASSED seems to imply that your device is healthy. Though, when I run smartctl locally there are a lot more entries after === START OF SMART DATA SECTION ===. Especially entries after Error Information might be of note here. Did you accidentally truncate that?

What I meant by "monitoring system resources" would be to check e.g. how much RAM and CPU your system is using and if it is reaching some kind of threshold. You can check your memory usage simply with free, something more sophisticated like htop might give you some more insights though.
 
The above has all the smartctl info, nil truncated. I removed the virtual 100 GB drive from the backup config and now I get a new error:

ERROR: Backup of VM 105 failed - job failed with err -61 - No data available

Googling this leads to several dead ends. Reboot does not change anything and I believe I have eliminated the LAN backup device as the problem.

The VM is working fine and all I have done since last backup is add a couple of Docker apps. The device has 8 GB allocated, plenty for what I have installed I think. Increase to 16 GB no difference.

Trying to clone the VM gives error:

drive-scsi0: transferred 16.8 GiB of 80.0 GiB (20.99%) in 28s
drive-scsi0: Cancelling block job
drive-scsi0: Done.
TASK ERROR: clone failed: block job (mirror) error: drive-scsi0: 'mirror' has been cancelled

Following this advice:

https://supportportal.juniper.net/s...ntegrity-of-QCOW2-file-for-VNF?language=en_US

No errors were found. Rebooting to earlier kernel has not made any difference either.

Thanks for your input.
 
Last edited:
Ran smartctl -a /dev/nvme0n1 again - I must have missed below first time.

SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 45 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0% Data Units Read: 12,164,103 [6.22 TB] Data Units Written: 11,558,119 [5.91 TB] Host Read Commands: 65,126,755 Host Write Commands: 143,713,681 Controller Busy Time: 4,762 Power Cycles: 53 Power On Hours: 249 Unsafe Shutdowns: 9 Media and Data Integrity Errors: 127 Error Information Log Entries: 127 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Thermal Temp. 1 Transition Count: 88 Thermal Temp. 1 Total Time: 1465 Error Information (NVMe Log 0x01, 16 of 64 entries) No Errors Logged
 
Just to finish this off in case anyone is interested or has a similar problem I think the problem was bad blocks on the NVME drive.

badblocks -v /dev/nvme0n1 > ~/bad_sectors.txt found eight bad blocks. This site proved useful:

https://www.debugpoint.com/2020/07/scan-repair-bad-sector-disk-linux/

I created a new VM with same specs as old one, rsync'd my stuff over and the new VM is working and able to be backed up.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!