VM backup fails for unknown reason as of today?

Wouter Samaey · Jan 3, 2020

Starting today, backup up a VM fails on our Proxmox host.

I have no idea why and see no clear cause. The disk is not full and the VM is very small. It has worked every time before.

This is a single server (no cluster) and the backup of the small VM is saved locally. The filesystem is ZFS.

Below is the log output, but it doesn't say much.

Where should I go from here?

Version info:
root@s3:~# pveversion
pve-manager/6.0-4/2a719255 (running kernel: 5.0.15-1-pve)

INFO: starting new backup job: vzdump 100 --mode snapshot --compress lzo --remove 0 --node s3 --storage local
INFO: Starting Backup of VM 100 (qemu)
INFO: Backup started at 2020-01-03 10:21:28
INFO: status = running
INFO: update VM 100: -lock backup
INFO: VM Name: roompot
INFO: include disk 'scsi0' 'local-zfs:vm-100-disk-0' 50G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/var/lib/vz/dump/vzdump-qemu-100-2020_01_03-10_21_28.vma.lzo'
INFO: started backup task '593ab682-63d8-46fe-9aa9-4345bdc8027a'
INFO: status: 2% (1246756864/53687091200), sparse 0% (140873728), duration 3, read/write 415/368 MB/s
INFO: status: 4% (2337669120/53687091200), sparse 0% (265097216), duration 6, read/write 363/322 MB/s
INFO: status: 8% (4508680192/53687091200), sparse 2% (1341657088), duration 9, read/write 723/364 MB/s
INFO: status: 10% (5652217856/53687091200), sparse 2% (1399963648), duration 12, read/write 381/361 MB/s
INFO: status: 12% (6814498816/53687091200), sparse 2% (1533128704), duration 15, read/write 387/343 MB/s
INFO: status: 14% (7888502784/53687091200), sparse 2% (1533128704), duration 18, read/write 358/358 MB/s
INFO: status: 17% (9191030784/53687091200), sparse 3% (1681793024), duration 21, read/write 434/384 MB/s
INFO: status: 19% (10289545216/53687091200), sparse 3% (1681793024), duration 24, read/write 366/366 MB/s
INFO: status: 21% (11429871616/53687091200), sparse 3% (1720545280), duration 27, read/write 380/367 MB/s
INFO: status: 23% (12516982784/53687091200), sparse 3% (1742905344), duration 30, read/write 362/354 MB/s
INFO: status: 26% (14132445184/53687091200), sparse 4% (2275192832), duration 33, read/write 538/361 MB/s
INFO: status: 29% (15766913024/53687091200), sparse 4% (2426220544), duration 36, read/write 544/494 MB/s
INFO: status: 32% (17413242880/53687091200), sparse 5% (2760151040), duration 39, read/write 548/437 MB/s
INFO: status: 35% (18990235648/53687091200), sparse 5% (2831286272), duration 42, read/write 525/501 MB/s
INFO: status: 37% (19955712000/53687091200), sparse 5% (2869846016), duration 45, read/write 321/308 MB/s
INFO: status: 39% (21376204800/53687091200), sparse 5% (2920669184), duration 48, read/write 473/456 MB/s
INFO: status: 42% (22943367168/53687091200), sparse 5% (2999848960), duration 51, read/write 522/495 MB/s
INFO: status: 45% (24350162944/53687091200), sparse 5% (3094470656), duration 54, read/write 468/437 MB/s
INFO: status: 48% (25797984256/53687091200), sparse 5% (3113127936), duration 57, read/write 482/476 MB/s
INFO: status: 50% (27102871552/53687091200), sparse 5% (3172519936), duration 60, read/write 434/415 MB/s
ERROR: job failed with err -5 - Input/output error
INFO: aborting backup job
ERROR: Backup of VM 100 failed - job failed with err -5 - Input/output error
INFO: Failed at 2020-01-03 10:22:28
INFO: Backup job finished with errors
TASK ERROR: job errors

oguz · Jan 3, 2020

hi,

Wouter Samaey said:
ERROR: Backup of VM 100 failed - job failed with err -5 - Input/output error

hints to a disk issue, or maybe there's some data corruption going on.

first i'd start by checking the health of ZFS and the disk (use smartctl). if those don't help then try to upgrade to the latest packages and reboot.

if none of that helps let me know

Wouter Samaey · Jan 3, 2020

Hi @oguz,

I ran the smartctl command on both NVMe drives I have, but no issues were found. Then I updated both the host and the VM (apt update && apt upgrade) and rebooted both, but still no luck. The backup process still fails at 50% and the error is the same.

Below is the output of smartcrl:

root@s3:~# smartctl -a /dev/nvme0
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.0.15-1-pve] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: SAMSUNG MZQLB960HAJR-00007
Serial Number: S437NX0M404609
Firmware Version: EDA5202Q
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 960,197,124,096 [960 GB]
Unallocated NVM Capacity: 0
Controller ID: 4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 960,197,124,096 [960 GB]
Namespace 1 Utilization: 953,906,315,264 [953 GB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Fri Jan 3 19:52:56 2020 CET
Firmware Updates (0x17): 3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x000f): Security Format Frmw_DL NS_Mngmt
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 87 Celsius
Critical Comp. Temp. Threshold: 88 Celsius
Namespace 1 Features (0x02): NA_Fields

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 10.60W - - 0 0 0 0 0 0

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
1 - 4096 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 37 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 15,465,783 [7.91 TB]
Data Units Written: 22,505,765 [11.5 TB]
Host Read Commands: 653,036,404
Host Write Commands: 547,515,147
Controller Busy Time: 478
Power Cycles: 17
Power On Hours: 4,090
Unsafe Shutdowns: 10
Media and Data Integrity Errors: 0
Error Information Log Entries: 2
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 37 Celsius
Temperature Sensor 2: 41 Celsius
Temperature Sensor 3: 46 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

root@s3:~# smartctl -a /dev/nvme1
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.0.15-1-pve] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: SAMSUNG MZQLB960HAJR-00007
Serial Number: S437NX0M404608
Firmware Version: EDA5202Q
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 960,197,124,096 [960 GB]
Unallocated NVM Capacity: 0
Controller ID: 4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 960,197,124,096 [960 GB]
Namespace 1 Utilization: 953,906,315,264 [953 GB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Fri Jan 3 19:52:58 2020 CET
Firmware Updates (0x17): 3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x000f): Security Format Frmw_DL NS_Mngmt
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 87 Celsius
Critical Comp. Temp. Threshold: 88 Celsius
Namespace 1 Features (0x02): NA_Fields

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 10.60W - - 0 0 0 0 0 0

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
1 - 4096 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 35 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 14,845,521 [7.60 TB]
Data Units Written: 22,505,662 [11.5 TB]
Host Read Commands: 511,637,709
Host Write Commands: 516,378,113
Controller Busy Time: 477
Power Cycles: 17
Power On Hours: 4,090
Unsafe Shutdowns: 10
Media and Data Integrity Errors: 0
Error Information Log Entries: 2
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 35 Celsius
Temperature Sensor 2: 39 Celsius
Temperature Sensor 3: 44 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

Wouter Samaey · Jan 7, 2020

Unfortunately the backups are still failing every day and I have no clue.
What should be the next step @oguz ?

In the meantime I have provided an alternative backup.

oguz · Jan 7, 2020

can you also post your VM config? (qm config VMID

Wouter Samaey said:
What should be the next step @oguz

i would check for data corruption inside the virtual disk

Wouter Samaey · Jan 7, 2020

@oguz I've never done this kind of thing before and searching online doesn't tell me much.

Which commands would you suggest I run?

This is the disk in Proxmox:

As you can see, Proxmox uses ZFS. The disk is formatted as ext4 in the VM.

And this is the output on the VM:

Code:

root@roompot:~# df -h
Filesystem                    Size  Used Avail Use% Mounted on
udev                          4.9G     0  4.9G   0% /dev
tmpfs                        1000M  696K  999M   1% /run
/dev/mapper/roompot--vg-root   49G   30G   18G  64% /
tmpfs                         4.9G     0  4.9G   0% /dev/shm
tmpfs                         5.0M     0  5.0M   0% /run/lock
tmpfs                         4.9G     0  4.9G   0% /sys/fs/cgroup
tmpfs                        1000M     0 1000M   0% /run/user/0

Code:

root@roompot:~# fdisk -l
Disk /dev/sda: 50 GiB, 53687091200 bytes, 104857600 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xea0a8bed

Device     Boot Start       End   Sectors Size Id Type
/dev/sda1  *     2048 104855551 104853504  50G 8e Linux LVM


Disk /dev/mapper/roompot--vg-root: 50 GiB, 53682896896 bytes, 104849408 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Code:

cat /etc/fstab 
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point>   <type>  <options>       <dump>  <pass>
/dev/mapper/roompot--vg-root /               ext4    errors=remount-ro 0       1
#/dev/mapper/roompot--vg-swap_1 none            swap    sw              0       0

# Mount swapfile
/swapfile swap swap defaults 0 0

oguz · Jan 13, 2020

Wouter Samaey said:
@oguz I've never done this kind of thing before and searching online doesn't tell me much.

boot into the vm and check the filesystems with fsck

Wouter Samaey · Jan 15, 2020

Hi @oguz ,

I have done a few things, but no success.

When i run fsck on the VM in dry-run mode, it reports a few problems:

Code:

fsck -fn /dev/mapper/roompot--vg-root
fsck from util-linux 2.31.1
e2fsck 1.44.1 (24-Mar-2018)
Warning!  /dev/mapper/roompot--vg-root is mounted.
Warning: skipping journal recovery because doing a read-only filesystem check.
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 3014684 has zero dtime.  Fix? no

Inodes that were part of a corrupted orphan linked list found.  Fix? no

Inode 3014685 was part of the orphaned inode list.  IGNORED.
Inode 3014697 was part of the orphaned inode list.  IGNORED.
Inode 3014698 was part of the orphaned inode list.  IGNORED.
Inode 3014699 was part of the orphaned inode list.  IGNORED.
Inode 3014700 was part of the orphaned inode list.  IGNORED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (5023101, counted=7054672).
Fix? no

Inode bitmap differences:  -(3014684--3014685) -(3014697--3014700)
Fix? no

Free inodes count wrong (997319, counted=2969886).
Fix? no


/dev/mapper/roompot--vg-root: ********** WARNING: Filesystem still has errors **********

/dev/mapper/roompot--vg-root: 2279481/3276800 files (0.0% non-contiguous), 8083075/13106176 blocks

But I have booted the VM using an Ubuntu Live CD and when I check and repair the disk there, it reports there are no errors?!? Very strange. I did this from the GUI using the "Disks" app in Ubuntu. Both the "Check" and "Repair" actions report no issues.

Backing up the VM still fails unfortunately.

What can I try next?

oguz · Jan 16, 2020

maybe you can try to run fsck again in the vm but not in dry run. try that and see if that makes anything better.

i'm also running out of ideas here but we're probably missing something crucial

when did this start happening? maybe something you did on the vm or some setting you changed is causing it.

Wouter Samaey · Jan 16, 2020

@oguz this started happening for no reason. It's a server that we don't touch. Both the hypervisor or the VM were unchanged for months.

Running fsck without dry-run on the mounted filesystem does not work. It refuses to do so.

Only thing I can think of is that maybe the disk repair from the Ubuntu 19.04 Live CD is different than fsck? Sounds far fetched, I know.

In any case, I am not alone with this issue: https://serverfault.com/questions/724471/fsck-finds-errors-in-os-but-not-using-livecd
But there is no solution here.

One thing that I did discover was that I had 1 dir with a ginormous amount of small files. This was clearly out of control (caused by bad configuration of an app on my part). I have deleted these small files, but the backup issue remains.

Any chance we could see a disk repair function in the Proxmox GUI? That would be amazing!
And yes, there are ways to mount the disk in the hypervisor, but I didn't go this way. Might be some security issues with that though...

Also, maybe you could change the error message in the Proxmox backup logs? Error -5 could be replaced with an easier to understand message and/or advice?

oguz · Jan 16, 2020

i'm guessing this might be a problem with your storage/disk/hardware

because for the backup process, the filesystem inside the VM shouldn't matter that much (you can have completely random data with no filesystem for example, and it should still backup)

what to do?

* try to make a second VM on the same storage and see if you can backup that.

* also if you can, adding another storage and trying to migrate your VM to it

that way we know where the problem lies

adamb · Jan 16, 2020

I am hitting a very similar issue on some machines with failing SSD drives. How does smartctl look, if you run a smartctl test does it complete ok?

Wouter Samaey · Jan 16, 2020

@oguz and @adamb turns out the disk image is damaged.
I ran a ZFS scrub and it found that 1 file was damages: the VM's disk image:

Code:

zpool status -v rpool
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 0 days 00:01:45 with 6 errors on Thu Jan 16 20:08:00 2020
config:

    NAME                                                 STATE     READ WRITE CKSUM
    rpool                                                ONLINE       0     0     0
      mirror-0                                           ONLINE       0     0     0
        nvme-eui.343337304d4046090025385800000001-part3  ONLINE       0     0   356
        nvme-eui.343337304d4046080025385800000001-part3  ONLINE       0     0   356

errors: Permanent errors have been detected in the following files:

        //var/lib/vz/dump/vzdump-qemu-100-2020_01_02-03_00_02.vma.lzo
        rpool/data/vm-100-disk-0:<0x1>

Since I have NVMe disks, running smartctl shouldn't be that interesting.

Is there a way to repair / recover the VM disk image? I started noticing this problem because I can't backup the VM, so I don't have a full VM backup.

What would you suggest at this point @oguz ? Can I attach a 2nd disk to the VM and clone from A to B or something like that?

oguz · Jan 17, 2020

Wouter Samaey said:
What would you suggest at this point @oguz ? Can I attach a 2nd disk to the VM and clone from A to B or something like that?

do a "move disk" first, to another storage or another zfs pool

afterwards do another scrub and see if the new disk works

if it works, remove both the problematic files/zvols and do a zpool clear

VM backup fails for unknown reason as of today?

Wouter Samaey

Well-Known Member

oguz

Famous Member

Wouter Samaey

Well-Known Member

Wouter Samaey

Well-Known Member

oguz

Famous Member

Wouter Samaey

Well-Known Member

oguz

Famous Member

Wouter Samaey

Well-Known Member

oguz

Famous Member

Wouter Samaey

Well-Known Member

oguz

Famous Member

adamb

Famous Member

Wouter Samaey

Well-Known Member

oguz

Famous Member

We value your privacy