Backup fails for one specific VM only at always the same time

Nov 6, 2018
21
1
43
Netherlands
www.inento.nl
Hi,

Recently we changed our network storage with a new server and connected a new storage via SMB/CIFS.

Now when running the backup schedule or manually we get an error always at the second 15% with same error-name/code as below. We tried to use local storage or different storage but it keeps giving the same error even if we move the disk to a different storage.

The error is:
INFO: starting new backup job: vzdump 107 --compress zstd --mode snapshot --node inegielc1-proxa --remove 0 --notes-template '{{guestname}}' --storage VM-Library-PRD
INFO: Starting Backup of VM 107 (qemu)
INFO: Backup started at 2023-10-12 16:34:16
INFO: status = running
INFO: include disk 'scsi0' 'data2:vm-107-disk-0' 100G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/pve/VM-Library-PRD/dump/vzdump-qemu-107-2023_10_12-16_34_16.vma.zst'
INFO: started backup task '04a05974-7183-436a-8ae7-ce17a4aee2be'
INFO: resuming VM again
INFO: 2% (2.2 GiB of 100.0 GiB) in 3s, read: 767.3 MiB/s, write: 218.2 MiB/s
|...
INFO: 15% (15.2 GiB of 100.0 GiB) in 36s, read: 276.4 MiB/s, write: 276.2 MiB/s
INFO: 15% (15.4 GiB of 100.0 GiB) in 37s, read: 183.1 MiB/s, write: 181.3 MiB/s
ERROR: job failed with err -61 - No data available
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 107 failed - job failed with err -61 - No data available
INFO: Failed at 2023-10-12 16:35:02
INFO: Backup job finished with errors
TASK ERROR: job errors

In the logs we see:
1697121882309.png

The config of the VM is:
root@inegielc1-proxa:~# qm config 107
boot: order=scsi0;ide2;net0
cores: 2
ide2: none,media=cdrom
memory: 4092
meta: creation-qemu=6.1.0,ctime=1662992740
name: inegielc1-zabpb
net0: virtio=16:EF:8C:C8:6D:72,bridge=vmbr1,firewall=1,tag=255
numa: 0
onboot: 1
ostype: l26
scsi0: data2:vm-107-disk-0,cache=none,size=100G
scsihw: virtio-scsi-pci
smbios1: uuid=da1ea533-60a3-484c-b6d0-d410ae2923b7
sockets: 1
vmgenid: 9b453976-7037-46c0-90a8-f47b97a7b3c8

Our version is:
root@inegielc1-proxa:~# pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.15.116-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-6
pve-kernel-5.13: 7.1-9
pve-kernel-5.11: 7.0-10
pve-kernel-5.15.116-1-pve: 5.15.116-1
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-5-pve: 5.11.22-10
pve-kernel-5.11.22-3-pve: 5.11.22-7
pve-kernel-5.11.22-1-pve: 5.11.22-2
ceph-fuse: 15.2.13-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.3-1
proxmox-backup-file-restore: 2.4.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

What we tried:
  • Restart or Shutdown VM and Proxmox Server.
    Use a manual backup instead of schedule (other have no problems)
  • Select a different disk, even a local attached one.
  • Move disk --> we also get an error: qemu-img: error while reading at byte 16496194048: Input/output error
    1697122255661.png
  • Set Async IO to threads.
  • Stop the VM and use manual backup with also stop mode and GZIP compression (instead of ZSTD)
Nothing seems to work, im a bit stuck on what to do. Any help would be highly appreciated.
 
The errors indicate a hardware issue with your NVMe disk. I'd recommend getting a replacement and evacuating data off the disk asap. You may need to use data recovery tools that can ignore IO errors to salvage the available data. I.e. ddrescure

good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
The errors indicate a hardware issue with your NVMe disk. I'd recommend getting a replacement and evacuating data off the disk asap. You may need to use data recovery tools that can ignore IO errors to salvage the available data. I.e. ddrescure

good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
When looking at the S.M.A.R.T Values i dont see any problems, also all other VMs work fine. The VMs are working properly only the backup is failing:
1697123920617.png

Is this really the case here?
 
Last edited:
https://serverfault.com/questions/519726/how-reliable-is-hdd-smart-data

You have IO read errors, they always happen in the same spot according to you, you can't move the disk image (file) to another location due to IO read errors. Thats all a pretty solid indication of bad disk. Use dd on the host to read the disk and write it out to /dev/null - does disk read fail?
https://unix.stackexchange.com/questions/651444/resuming-dd-with-read-errors-skip-seek-numbers

The error is reported by Linux Kernel, there is nothing PVE/Qemu can do to fix it.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
https://serverfault.com/questions/519726/how-reliable-is-hdd-smart-data

You have IO read errors, they always happen in the same spot according to you, you can't move the disk image (file) to another location due to IO read errors. Thats all a pretty solid indication of bad disk. Use dd on the host to read the disk and write it out to /dev/null - does disk read fail?
https://unix.stackexchange.com/questions/651444/resuming-dd-with-read-errors-skip-seek-numbers

The error is reported by Linux Kernel, there is nothing PVE/Qemu can do to fix it.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox


Just so that im making the correct assumptions. I have done the below:

using lsblk check what the drive name is:
1697205955969.png

execute dd command to read disk:
# dd if=/dev/nvme0n1 of=/dev/null bs=1G count=1 oflag=dsync 0+1 records in 0+1 records out 4370432 bytes (4.4 MB, 4.2 MiB) copied, 0.309916 s, 14.1 MB/s

Check 2:
# dd if=/dev/nvme0n1 of=/dev/null bs=512 count=1000 oflag=dsync 1000+0 records in 1000+0 records out 512000 bytes (512 kB, 500 KiB) copied, 0.00037509 s, 1.4 GB/s

Im not seeing issues, am i missing something here?
 

Attachments

  • 1697206003970.png
    1697206003970.png
    5.3 KB · Views: 2
Well, you've "read" 1 count of 1gb (1gb total) of a 1tb disk.
The error message you presented said that it happened after reading 15gb of a 100gb virtual disk. We dont know where all the pieces are on the physical disk. Chances are its not the first 1gb.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
Well, you've "read" 1 count of 1gb (1gb total) of a 1tb disk.
The error message you presented said that it happened after reading 15gb of a 100gb virtual disk. We dont know where all the pieces are on the physical disk. Chances are its not the first 1gb.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Thanks for your very quick responses. So the second check i do is not sufficiënt also? I did the single count just to see if i dont get errors. Later i tried a 1000 on 512.
 
Thanks for your very quick responses. So the second check i do is not sufficiënt also? I did the single count just to see if i dont get errors. Later i tried a 1000 on 512.
512000 bytes (512 kB, 500 KiB) copied, 0.00037509 s, 1.4 GB/s
The program tells you how much data was copied, ie 500kib. Which is what you asked it to do: 1000 counts of 512 byte blocks

Why not just: # dd if=/dev/nvme0n1 of=/dev/null bs=32k


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Apologies, now i understand, thanks for you patience.

I can see the error now:
# dd if=/dev/nvme0n1 of=/dev/null bs=32k
dd: error reading '/dev/nvme0n1': Input/output error
133+1 records in
133+1 records out
4370432 bytes (4.4 MB, 4.2 MiB) copied, 0.460827 s, 9.5 MB/s

I managed to move the VMs to a different drive and move this one with restoring a backup on the other drive. So thanks for you tips so far.

Is there any command i can use to try to repair this drive or do you recommend just replacing it?
 
Is there any command i can use to try to repair this drive or do you recommend just replacing it?
If the internal logic of the disk did not catch this, there is nothing you can do to repair. Check your warranty, may be you can utilize that. If not, get a new disk.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!