backup error. disk error? often issues but never clear reason

Iacov · Aug 24, 2024

hey

i have two pve machines (pve1 and pve2)
pve1 is rock solid... pve2 has often had issues with backup errors, often caused because the disk switched into readonly mode for no apparent reason

backup failed tonight again and one of my vms seems to be corrupt
I never could run a smart test because the currently installed smart tool does not support (at least a month ago when I tried) nvme drives (I use a wd red 500gb nvme ssd)

the log says

Code:

Aug 24 03:35:00 pve2 kernel: kvm: kvm [3116309]: ignored rdmsr: 0xc0011029 data 0x0
Aug 24 03:35:00 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:00 pve2 kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0x7bdfc91000 [fault reason 0x05] PTE Write access is not set
Aug 24 03:35:00 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:00 pve2 kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0x364565000 [fault reason 0x05] PTE Write access is not set
Aug 24 03:35:00 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:00 pve2 kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0x5e4e2d3000 [fault reason 0x05] PTE Write access is not set
Aug 24 03:35:00 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:05 pve2 kernel: dmar_fault: 2123 callbacks suppressed
Aug 24 03:35:05 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:05 pve2 kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0x6ddfc81000 [fault reason 0x05] PTE Write access is not set
Aug 24 03:35:05 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:05 pve2 kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0x1dd810d000 [fault reason 0x05] PTE Write access is not set
Aug 24 03:35:05 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:05 pve2 kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xe1529000 [fault reason 0x05] PTE Write access is not set
Aug 24 03:35:05 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:10 pve2 kernel: dmar_fault: 668 callbacks suppressed
Aug 24 03:35:10 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:10 pve2 kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0x19f579f000 [fault reason 0x05] PTE Write access is not set
Aug 24 03:35:10 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:10 pve2 kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0x57a8245000 [fault reason 0x05] PTE Write access is not set
Aug 24 03:35:10 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:10 pve2 kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0x40d2d69000 [fault reason 0x05] PTE Write access is not set
Aug 24 03:35:10 pve2 kernel: DMAR: DRHD: handling fault status reg 3
Aug 24 03:35:11 pve2 kernel: device-mapper: btree spine: node_check failed: blocknr 0 != wanted 2369
Aug 24 03:35:11 pve2 kernel: device-mapper: block manager: btree_node validator check failed for block 2369
Aug 24 03:35:11 pve2 kernel: device-mapper: thin: process_cell: dm_thin_find_block() failed: error = -15
Aug 24 03:35:12 pve2 pvescheduler[3115269]: ERROR: Backup of VM 210 failed - job failed with err -5 - Input/output error
Aug 24 03:35:12 pve2 pvescheduler[3115269]: INFO: Backup job finished with errors
Aug 24 03:35:12 pve2 pvescheduler[3115269]: job errors

the gotify log says

Code:

210: 2024-08-24 03:34:51 INFO: Starting Backup of VM 210 (qemu)
210: 2024-08-24 03:34:51 INFO: status = running
210: 2024-08-24 03:34:51 INFO: backup mode: stop
210: 2024-08-24 03:34:51 INFO: ionice priority: 7
210: 2024-08-24 03:34:51 INFO: VM Name: PiHole2
210: 2024-08-24 03:34:51 INFO: include disk 'scsi0' 'local-lvm:vm-210-disk-0' 10G
210: 2024-08-24 03:34:51 INFO: stopping virtual guest
210: 2024-08-24 03:34:53 INFO: creating vzdump archive '/mnt/bu_ssd/dump/vzdump-qemu-210-2024_08_24-03_34_51.vma.zst'
210: 2024-08-24 03:34:53 INFO: starting kvm to execute backup task
210: 2024-08-24 03:34:54 INFO: started backup task '40b95e10-168f-490c-a7e5-af29c443c27a'
210: 2024-08-24 03:34:54 INFO: resuming VM again after 3 seconds
210: 2024-08-24 03:34:57 INFO:   7% (783.0 MiB of 10.0 GiB) in 3s, read: 261.0 MiB/s, write: 205.9 MiB/s
210: 2024-08-24 03:35:00 INFO:  13% (1.4 GiB of 10.0 GiB) in 6s, read: 205.4 MiB/s, write: 181.1 MiB/s
210: 2024-08-24 03:35:03 INFO:  19% (2.0 GiB of 10.0 GiB) in 9s, read: 201.8 MiB/s, write: 174.2 MiB/s
210: 2024-08-24 03:35:06 INFO:  27% (2.7 GiB of 10.0 GiB) in 12s, read: 253.8 MiB/s, write: 213.0 MiB/s
210: 2024-08-24 03:35:09 INFO:  32% (3.3 GiB of 10.0 GiB) in 15s, read: 201.8 MiB/s, write: 175.6 MiB/s
210: 2024-08-24 03:35:12 INFO:  50% (5.0 GiB of 10.0 GiB) in 18s, read: 598.2 MiB/s, write: 79.5 MiB/s
210: 2024-08-24 03:35:12 ERROR: job failed with err -5 - Input/output error
210: 2024-08-24 03:35:12 INFO: aborting backup job
210: 2024-08-24 03:35:12 INFO: resuming VM again
210: 2024-08-24 03:35:12 ERROR: Backup of VM 210 failed - job failed with err -5 - Input/output error

i have no idea why this keeps happening and I don't k oe how to debug further

can anyone help me please?

LostInTranslation · Oct 27, 2024

Hello,

same issue here. I'm also getting (since two months or so) sometimes the same error:

Code:

2024-10-27T11:55:54.928977+01:00 myproxmoxhost1 pvedaemon[1155651]: ERROR: Backup of VM 210 failed - job failed with err -5 - Input/output error
2024-10-27T11:55:54.929808+01:00 myproxmoxhost1 pvedaemon[1155651]: INFO: Backup job finished with errors
2024-10-27T11:55:54.986583+01:00 myproxmoxhost1 perl[1155651]: notified via target `mail-to-root`
2024-10-27T11:55:54.987231+01:00 myproxmoxhost1 pvedaemon[1155651]: job errors

But I cannot find anything suspicious around that in the syslog (neither on the Proxmox host hosting this VM neither on the host hosting the PBS VM). My ZFS file system also reports no errors.

My first questions are:

Where to search for the input/output error? On the side of the VM to be backuped or on the backup target (PBS in my case)?
Any advice how to narrow down the problem?

At least a reboot of the VM did not solve anything. The issue is easily reproducible and present at the moment.

On PBS side the syslog shows

Code:

2024-10-27T12:11:55.039209+01:00 pbshost proxmox-backup-proxy[154]: starting new backup on datastore 'mybackup' from ::ffff:10.16.113.249: "vm/210/2024-10-27T11:11:54Z"
2024-10-27T12:11:55.039720+01:00 pbshost proxmox-backup-proxy[154]: download 'index.json.blob' from previous backup.
2024-10-27T12:11:55.041354+01:00 pbshost proxmox-backup-proxy[154]: register chunks in 'drive-scsi0.img.fidx' from previous backup.
2024-10-27T12:11:55.068576+01:00 pbshost proxmox-backup-proxy[154]: download 'drive-scsi0.img.fidx' from previous backup.
2024-10-27T12:11:55.386476+01:00 pbshost proxmox-backup-proxy[154]: created new fixed index 1 ("vm/210/2024-10-27T11:11:54Z/drive-scsi0.img.fidx")
2024-10-27T12:11:55.453931+01:00 pbshost proxmox-backup-proxy[154]: add blob "/mnt/pbsbackups/vm/210/2024-10-27T11:11:54Z/qemu-server.conf.blob" (410 bytes, comp: 410)
2024-10-27T12:12:57.350714+01:00 pbshost proxmox-backup-proxy[154]: backup ended and finish failed: backup ended but finished flag is not set.
2024-10-27T12:12:57.350966+01:00 pbshost proxmox-backup-proxy[154]: removing unfinished backup
2024-10-27T12:12:57.351046+01:00 pbshost proxmox-backup-proxy[154]: TASK ERROR: backup ended but finished flag is not set.
2024-10-27T12:12:57.351145+01:00 pbshost proxmox-backup-proxy[154]: removing backup snapshot "/mnt/pbsbackups/vm/210/2024-10-27T11:11:54Z"
2024-10-27T12:12:57.355495+01:00 pbshost proxmox-backup-proxy[154]: POST /fixed_chunk: 400 Bad Request: error reading a body from connection: stream closed because of a broken pipe
2024-10-27T12:12:57.363306+01:00 pbshost proxmox-backup-proxy[154]: POST /fixed_chunk: 400 Bad Request: backup already marked as finished.
2024-10-27T12:12:57.374639+01:00 pbshost proxmox-backup-proxy[154]: POST /fixed_chunk: 400 Bad Request: backup already marked as finished.

I'm using everywhere the latest versions of PVE and PBS.

BR,
Jens

shunkica · Oct 27, 2024

I had a similar issue where backup failed when the backup storage had no space (not PBS, just regular drive mounted over nfs), and then the VM FS (ext4) is left in invalid state and goes into RO.
Had to fsck and reboot it, and this happened twice in the last two weeks.

Edit: I have cleared some space on the backup location and tried re-running the backup of the system, but it started exhibiting similar behavior. It did only 4% of the backup 20 minutes in, and it was reading at 20Mbit/s from enterprise SSD in HW raid 1. At that point I stopped the backup to avoid any fs damage again.

shunkica · Oct 28, 2024

I have tried running the job in Stop mode instead of Snapshot mode but it made no difference:

Code:

INFO:   0% (1.1 GiB of 1.2 TiB) in 3s, read: 385.5 MiB/s, write: 372.1 MiB/s
INFO:   1% (12.8 GiB of 1.2 TiB) in 33s, read: 398.1 MiB/s, write: 390.2 MiB/s
INFO:   2% (26.1 GiB of 1.2 TiB) in 1m 4s, read: 440.1 MiB/s, write: 424.5 MiB/s
INFO:   3% (38.2 GiB of 1.2 TiB) in 3m 34s, read: 82.7 MiB/s, write: 82.1 MiB/s

And now it has just been stuck at 3% for 15 minutes. iotop on the node shows absolutely nothing is going on, and there are no errors in the log.

LostInTranslation · Nov 3, 2024

Hi,

at the moment my problem is persistent: the ZFS file system on the source Proxmox PVE is fine (zpool scrub rpool without errors). Also the ZFS file system on the target Proxmox PVE (with the PBS LXC) is fine. Both the source VM (to be backed up) and the PBS VM have been rebooted.

The ext4 file system of the source VM has been checked and during reboot and did not report any errors.

But: a backup always fails at 70% with I/O error.

Any advice?

BR, Jens

fiona · Nov 4, 2024

Hi,

LostInTranslation said:
Hi,

at the moment my problem is persistent: the ZFS file system on the source Proxmox PVE is fine (zpool scrub rpool without errors). Also the ZFS file system on the target Proxmox PVE (with the PBS LXC) is fine. Both the source VM (to be backed up) and the PBS VM have been rebooted.

The ext4 file system of the source VM has been checked and during reboot and did not report any errors.

But: a backup always fails at 70% with I/O error.

Any advice?

BR, Jens

does reading the ZFS volume from start to finish work? E.g. you can use pvesm path <storage>:vm-<ID>-disk-<N> to get the path and then dd if=/path/to/volume of=/dev/null bs=1M. If not, you might want to use qemu-img convert with the --salvage option or similar to save the readable parts.

LostInTranslation · Nov 14, 2024

Hi,

reading the ZFS volume from start to finish did work as adviced. And the problem disappears by itself in some cases.

Example today: the scheduled backup job did fail tonight. This morning I did check the mounted drive within the VM (there are five drives mounted) with `fsck /dev/sdb` without any error reports. Trying to backup the VM afterwards also worked again. Also `zfs scrub rpool` did not show any issues.

Are you sure there is no problem on PVE/PBS somewhere causing this error message?

BR, Jens

fiona · Nov 15, 2024

LostInTranslation said:
Are you sure there is no problem on PVE/PBS somewhere causing this error message?

No, I can't be sure. I'd just expect many more reports if it was a software issue. Sometimes disk errors are transient (depending on load/temperature/etc.).

Search

Search

backup error. disk error? often issues but never clear reason

Iacov

Member

LostInTranslation

Member

shunkica

New Member

shunkica

New Member

LostInTranslation

Member

fiona

Proxmox Staff Member

LostInTranslation

Member

fiona

Proxmox Staff Member

We value your privacy