backup error. disk error? often issues but never clear reason

Iacov

Member
Jan 24, 2024
37
0
6
hey

i have two pve machines (pve1 and pve2)
pve1 is rock solid... pve2 has often had issues with backup errors, often caused because the disk switched into readonly mode for no apparent reason

backup failed tonight again and one of my vms seems to be corrupt
I never could run a smart test because the currently installed smart tool does not support (at least a month ago when I tried) nvme drives (I use a wd red 500gb nvme ssd)

the log says
Code:
Aug 24 03:35:00 pve2 kernel: kvm: kvm [3116309]: ignored rdmsr: 0xc0011029 data 0x0
Aug 24 03:35:00 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:00 pve2 kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0x7bdfc91000 [fault reason 0x05] PTE Write access is not set
Aug 24 03:35:00 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:00 pve2 kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0x364565000 [fault reason 0x05] PTE Write access is not set
Aug 24 03:35:00 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:00 pve2 kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0x5e4e2d3000 [fault reason 0x05] PTE Write access is not set
Aug 24 03:35:00 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:05 pve2 kernel: dmar_fault: 2123 callbacks suppressed
Aug 24 03:35:05 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:05 pve2 kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0x6ddfc81000 [fault reason 0x05] PTE Write access is not set
Aug 24 03:35:05 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:05 pve2 kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0x1dd810d000 [fault reason 0x05] PTE Write access is not set
Aug 24 03:35:05 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:05 pve2 kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xe1529000 [fault reason 0x05] PTE Write access is not set
Aug 24 03:35:05 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:10 pve2 kernel: dmar_fault: 668 callbacks suppressed
Aug 24 03:35:10 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:10 pve2 kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0x19f579f000 [fault reason 0x05] PTE Write access is not set
Aug 24 03:35:10 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:10 pve2 kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0x57a8245000 [fault reason 0x05] PTE Write access is not set
Aug 24 03:35:10 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:10 pve2 kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0x40d2d69000 [fault reason 0x05] PTE Write access is not set
Aug 24 03:35:10 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:11 pve2 kernel: device-mapper: btree spine: node_check failed: blocknr 0 != wanted 2369
Aug 24 03:35:11 pve2 kernel: device-mapper: block manager: btree_node validator check failed for block 2369
Aug 24 03:35:11 pve2 kernel: device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -15
Aug 24 03:35:12 pve2 pvescheduler[3115269]: ERROR: Backup of VM 210 failed - job failed with err -5 - Input/output error
Aug 24 03:35:12 pve2 pvescheduler[3115269]: INFO: Backup job finished with errors
Aug 24 03:35:12 pve2 pvescheduler[3115269]: job errors

the gotify log says
Code:
210: 2024-08-24 03:34:51 INFO: Starting Backup of VM 210 (qemu)
210: 2024-08-24 03:34:51 INFO: status = running
210: 2024-08-24 03:34:51 INFO: backup mode: stop
210: 2024-08-24 03:34:51 INFO: ionice priority: 7
210: 2024-08-24 03:34:51 INFO: VM Name: PiHole2
210: 2024-08-24 03:34:51 INFO: include disk 'scsi0' 'local-lvm:vm-210-disk-0' 10G
210: 2024-08-24 03:34:51 INFO: stopping virtual guest
210: 2024-08-24 03:34:53 INFO: creating vzdump archive '/mnt/bu_ssd/dump/vzdump-qemu-210-2024_08_24-03_34_51.vma.zst'
210: 2024-08-24 03:34:53 INFO: starting kvm to execute backup task
210: 2024-08-24 03:34:54 INFO: started backup task '40b95e10-168f-490c-a7e5-af29c443c27a'
210: 2024-08-24 03:34:54 INFO: resuming VM again after 3 seconds
210: 2024-08-24 03:34:57 INFO:   7% (783.0 MiB of 10.0 GiB) in 3s, read: 261.0 MiB/s, write: 205.9 MiB/s
210: 2024-08-24 03:35:00 INFO:  13% (1.4 GiB of 10.0 GiB) in 6s, read: 205.4 MiB/s, write: 181.1 MiB/s
210: 2024-08-24 03:35:03 INFO:  19% (2.0 GiB of 10.0 GiB) in 9s, read: 201.8 MiB/s, write: 174.2 MiB/s
210: 2024-08-24 03:35:06 INFO:  27% (2.7 GiB of 10.0 GiB) in 12s, read: 253.8 MiB/s, write: 213.0 MiB/s
210: 2024-08-24 03:35:09 INFO:  32% (3.3 GiB of 10.0 GiB) in 15s, read: 201.8 MiB/s, write: 175.6 MiB/s
210: 2024-08-24 03:35:12 INFO:  50% (5.0 GiB of 10.0 GiB) in 18s, read: 598.2 MiB/s, write: 79.5 MiB/s
210: 2024-08-24 03:35:12 ERROR: job failed with err -5 - Input/output error
210: 2024-08-24 03:35:12 INFO: aborting backup job
210: 2024-08-24 03:35:12 INFO: resuming VM again
210: 2024-08-24 03:35:12 ERROR: Backup of VM 210 failed - job failed with err -5 - Input/output error

i have no idea why this keeps happening and I don't k oe how to debug further

can anyone help me please?
 
Hello,

same issue here. I'm also getting (since two months or so) sometimes the same error:

Code:
2024-10-27T11:55:54.928977+01:00 myproxmoxhost1 pvedaemon[1155651]: ERROR: Backup of VM 210 failed - job failed with err -5 - Input/output error
2024-10-27T11:55:54.929808+01:00 myproxmoxhost1 pvedaemon[1155651]: INFO: Backup job finished with errors
2024-10-27T11:55:54.986583+01:00 myproxmoxhost1 perl[1155651]: notified via target `mail-to-root`
2024-10-27T11:55:54.987231+01:00 myproxmoxhost1 pvedaemon[1155651]: job errors

But I cannot find anything suspicious around that in the syslog (neither on the Proxmox host hosting this VM neither on the host hosting the PBS VM). My ZFS file system also reports no errors.

My first questions are:
  • Where to search for the input/output error? On the side of the VM to be backuped or on the backup target (PBS in my case)?
  • Any advice how to narrow down the problem?
At least a reboot of the VM did not solve anything. The issue is easily reproducible and present at the moment.

On PBS side the syslog shows

Code:
2024-10-27T12:11:55.039209+01:00 pbshost proxmox-backup-proxy[154]: starting new backup on datastore 'mybackup' from ::ffff:10.16.113.249: "vm/210/2024-10-27T11:11:54Z"
2024-10-27T12:11:55.039720+01:00 pbshost proxmox-backup-proxy[154]: download 'index.json.blob' from previous backup.
2024-10-27T12:11:55.041354+01:00 pbshost proxmox-backup-proxy[154]: register chunks in 'drive-scsi0.img.fidx' from previous backup.
2024-10-27T12:11:55.068576+01:00 pbshost proxmox-backup-proxy[154]: download 'drive-scsi0.img.fidx' from previous backup.
2024-10-27T12:11:55.386476+01:00 pbshost proxmox-backup-proxy[154]: created new fixed index 1 ("vm/210/2024-10-27T11:11:54Z/drive-scsi0.img.fidx")
2024-10-27T12:11:55.453931+01:00 pbshost proxmox-backup-proxy[154]: add blob "/mnt/pbsbackups/vm/210/2024-10-27T11:11:54Z/qemu-server.conf.blob" (410 bytes, comp: 410)
2024-10-27T12:12:57.350714+01:00 pbshost proxmox-backup-proxy[154]: backup ended and finish failed: backup ended but finished flag is not set.
2024-10-27T12:12:57.350966+01:00 pbshost proxmox-backup-proxy[154]: removing unfinished backup
2024-10-27T12:12:57.351046+01:00 pbshost proxmox-backup-proxy[154]: TASK ERROR: backup ended but finished flag is not set.
2024-10-27T12:12:57.351145+01:00 pbshost proxmox-backup-proxy[154]: removing backup snapshot "/mnt/pbsbackups/vm/210/2024-10-27T11:11:54Z"
2024-10-27T12:12:57.355495+01:00 pbshost proxmox-backup-proxy[154]: POST /fixed_chunk: 400 Bad Request: error reading a body from connection: stream closed because of a broken pipe
2024-10-27T12:12:57.363306+01:00 pbshost proxmox-backup-proxy[154]: POST /fixed_chunk: 400 Bad Request: backup already marked as finished.
2024-10-27T12:12:57.374639+01:00 pbshost proxmox-backup-proxy[154]: POST /fixed_chunk: 400 Bad Request: backup already marked as finished.

I'm using everywhere the latest versions of PVE and PBS.

BR,
Jens
 
Last edited:
I had a similar issue where backup failed when the backup storage had no space (not PBS, just regular drive mounted over nfs), and then the VM FS (ext4) is left in invalid state and goes into RO.
Had to fsck and reboot it, and this happened twice in the last two weeks.

Edit: I have cleared some space on the backup location and tried re-running the backup of the system, but it started exhibiting similar behavior. It did only 4% of the backup 20 minutes in, and it was reading at 20Mbit/s from enterprise SSD in HW raid 1. At that point I stopped the backup to avoid any fs damage again.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!