VM backup failure error 125

johnvick · May 6, 2022

Using 7-1.12.

Backup of Ubuntu VM repeatedly fails with error 125 at 65% - same with stop and snapshot backups, same with local and two LAN storage drives. LXCs backup without error. Trying to clone the drive also fails. Any advice appreciated.

Code:

VMID    NAME    STATUS    TIME    SIZE    FILENAME
105    Jammy    FAILED    00:06:23    job failed with err -125 - Operation canceled
TOTAL    00:06:23    0KB   


Detailed backup logs:

vzdump 105 --remove 0 --storage vm --node pve --compress zstd --mode stop --mailto *****@gmail.com


105: 2022-05-06 19:30:01 INFO: Starting Backup of VM 105 (qemu)
105: 2022-05-06 19:30:01 INFO: status = running
105: 2022-05-06 19:30:01 INFO: backup mode: stop
105: 2022-05-06 19:30:01 INFO: ionice priority: 7
105: 2022-05-06 19:30:01 INFO: VM Name: Jammy
105: 2022-05-06 19:30:01 INFO: include disk 'scsi0' 'vm:105/vm-105-disk-0.qcow2' 80G
105: 2022-05-06 19:30:01 INFO: include disk 'scsi1' 'vm:105/vm-105-disk-1.qcow2' 100G
105: 2022-05-06 19:30:01 INFO: stopping virtual guest
105: 2022-05-06 19:30:15 INFO: creating vzdump archive '/mnt/storage/vm/dump/vzdump-qemu-105-2022_05_06-19_30_01.vma.zst'
105: 2022-05-06 19:30:15 INFO: starting kvm to execute backup task
105: 2022-05-06 19:30:16 INFO: started backup task '0c50dc63-2ede-45b6-8c04-7a65b77db7f7'
105: 2022-05-06 19:30:16 INFO: resuming VM again after 15 seconds
105: 2022-05-06 19:30:19 INFO:   0% (1.2 GiB of 180.0 GiB) in 3s, read: 406.0 MiB/s, write: 351.9 MiB/s
105: 2022-05-06 19:30:22 INFO:   1% (2.3 GiB of 180.0 GiB) in 6s, read: 392.7 MiB/s, write: 350.0 MiB/s
105: 2022-05-06 19:30:31 INFO:   2% (3.7 GiB of 180.0 GiB) in 15s, read: 155.9 MiB/s, write: 154.1 MiB/s
105: 2022-05-06 19:30:36 INFO:   3% (5.6 GiB of 180.0 GiB) in 20s, read: 391.5 MiB/s, write: 365.9 MiB/s

105: 2022-05-06 19:36:09 INFO:  64% (115.2 GiB of 180.0 GiB) in 5m 53s, read: 452.2 MiB/s, write: 270.2 MiB/s
105: 2022-05-06 19:36:16 INFO:  65% (117.1 GiB of 180.0 GiB) in 6m, read: 272.4 MiB/s, write: 270.5 MiB/s
105: 2022-05-06 19:36:21 INFO:  65% (118.2 GiB of 180.0 GiB) in 6m 5s, read: 232.3 MiB/s, write: 187.3 MiB/s
105: 2022-05-06 19:36:21 ERROR: job failed with err -125 - Operation canceled
105: 2022-05-06 19:36:21 INFO: aborting backup job
105: 2022-05-06 19:36:21 INFO: resuming VM again
105: 2022-05-06 19:36:24 ERROR: Backup of VM 105 failed - job failed with err -125 - Operation canceled

datschlatscher · May 6, 2022

johnvick said:
job failed with err -125 - Operation canceled

This sounds like the backup job is stopped by your system for some reason, perhaps the OOM Killer? You could try to monitor memory/system resource usage during the backup, are there any abnormalities?

Still, some more information would be nice. Please attach your VM config qm config <vmid>. Furthermore, journalctl may give you some more insights what exactly is going wrong.

johnvick · May 6, 2022

I take it you mean out of memory killer? I'm not sure how to monitor system resources but it is an Intel i7 11 gen with 1 TB NVME and 32 GB ram so not short on resource.

The qm config is:


root@pve:~# qm config 105
agent: 1,fstrim_cloned_disks=1
boot: order=scsi0;net0
cores: 8
memory: 8192
meta: creation-qemu=6.1.1,ctime=1649458020
name: Jammy
net0: virtio=CA:30:6A:32:38:51,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: vm:105/vm-105-disk-0.qcow2,discard=on,size=80G,ssd=1
scsi1: vm:105/vm-105-disk-1.qcow2,discard=on,size=100G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=15221f6b-f18a-4654-a900-66f27cef30a6
sockets: 1
vmgenid: 2c1e134c-88b9-4845-85da-ba6235c13330

journalctl shows:


May 06 20:56:03 pve kernel: blk_update_request: critical medium error, dev nvme0n1, sector 416706816 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 2

Which looks like a hardware error on the NVME drive?

datschlatscher · May 6, 2022

johnvick said:
critical medium error, dev nvme0n1, sector 416706816 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 2

You are right, this very much sounds like a problem with your device.

Please run smartctl -a <device>, this should be able to tell you if something is wrong.

johnvick · May 6, 2022

Code:

root@pve:~# smartctl -a /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.30-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       ShiJi 1TB M.2-NVMe
Serial Number:                      2021121400001
Firmware Version:                   U0520A0L
PCI Vendor/Subsystem ID:            0x126f
IEEE OUI Identifier:                0x000001
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            000001 0000000000
Local Time is:                      Fri May  6 21:17:47 2022 NZST
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x0015):     Comp DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     83 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W       -        -    0  0  0  0        0       0
 1 +     4.60W       -        -    1  1  1  1        0       0
 2 +     3.80W       -        -    2  2  2  2        0       0
 3 -   0.0250W       -        -    3  3  3  3     2200    3000
 4 -   0.0120W       -        -    4  4  4  4    43000   43000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

/CODE]

I'm no expert here but this looks OK?

datschlatscher · May 6, 2022

SMART overall-health self-assessment test result: PASSED seems to imply that your device is healthy. Though, when I run smartctl locally there are a lot more entries after === START OF SMART DATA SECTION ===. Especially entries after Error Information might be of note here. Did you accidentally truncate that?

What I meant by "monitoring system resources" would be to check e.g. how much RAM and CPU your system is using and if it is reaching some kind of threshold. You can check your memory usage simply with free, something more sophisticated like htop might give you some more insights though.

johnvick · May 6, 2022

The above has all the smartctl info, nil truncated. I removed the virtual 100 GB drive from the backup config and now I get a new error:

ERROR: Backup of VM 105 failed - job failed with err -61 - No data available

Googling this leads to several dead ends. Reboot does not change anything and I believe I have eliminated the LAN backup device as the problem.

The VM is working fine and all I have done since last backup is add a couple of Docker apps. The device has 8 GB allocated, plenty for what I have installed I think. Increase to 16 GB no difference.

Trying to clone the VM gives error:

drive-scsi0: transferred 16.8 GiB of 80.0 GiB (20.99%) in 28s
drive-scsi0: Cancelling block job
drive-scsi0: Done.
TASK ERROR: clone failed: block job (mirror) error: drive-scsi0: 'mirror' has been cancelled

Following this advice:

https://supportportal.juniper.net/s...ntegrity-of-QCOW2-file-for-VNF?language=en_US

No errors were found. Rebooting to earlier kernel has not made any difference either.

Thanks for your input.

johnvick · May 6, 2022

Somewhere in the midst of several reboots I am now on on version 7.2-3

johnvick · May 6, 2022

Ran smartctl -a /dev/nvme0n1 again - I must have missed below first time.


SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        45 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    12,164,103 [6.22 TB]
Data Units Written:                 11,558,119 [5.91 TB]
Host Read Commands:                 65,126,755
Host Write Commands:                143,713,681
Controller Busy Time:               4,762
Power Cycles:                       53
Power On Hours:                     249
Unsafe Shutdowns:                   9
Media and Data Integrity Errors:    127
Error Information Log Entries:      127
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Thermal Temp. 1 Transition Count:   88
Thermal Temp. 1 Total Time:         1465

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

johnvick · May 7, 2022

Just to finish this off in case anyone is interested or has a similar problem I think the problem was bad blocks on the NVME drive.

badblocks -v /dev/nvme0n1 > ~/bad_sectors.txt found eight bad blocks. This site proved useful:

https://www.debugpoint.com/2020/07/scan-repair-bad-sector-disk-linux/

I created a new VM with same specs as old one, rsync'd my stuff over and the new VM is working and able to be backed up.

Search

Search

VM backup failure error 125

johnvick

Member

datschlatscher

Proxmox Retired Staff

johnvick

Member

datschlatscher

Proxmox Retired Staff

johnvick

Member

datschlatscher

Proxmox Retired Staff

johnvick

Member

johnvick

Member

johnvick

Member

johnvick

Member

We value your privacy