Consistent Backup Failure with just one Virtual Machine

alanjohnwilliams

New Member
Feb 2, 2022
4
0
1
53
I'm running a home lab single instance proxmox server with 6 containers and 3 VMs. PVE 7.4-16. Over the last month one of my VMs (homeassistant) is consistently failing backup with:
INFO: 36% (72.7 GiB of 200.0 GiB) in 7m 39s, read: 127.0 MiB/s, write: 126.9 MiB/s
ERROR: job failed with err -5 - Input/output error

This occurs at the same "36%" checkpoint every time and regardless of storage target (local, NAS, PBS). I'm not seeing anything of note to help troubleshoot in the system logs:
Jul 27 10:04:57 pve zed[1355831]: eid=72 class=data pool='rpool' priority=0 err=52 flags=0x808881 bookmark=70744:1:0:9527277
Jul 27 10:04:58 pve kernel: zd48: p1 p2 p3 p4 p5 p6 p7 p8
Jul 27 10:04:58 pve kernel: vmbr0v100: port 2(tap101i0) entered disabled state
Jul 27 10:04:58 pve qmeventd[1963]: read: Connection reset by peer
Jul 27 10:04:58 pve systemd[1]: 101.scope: Succeeded.
Jul 27 10:04:58 pve systemd[1]: 101.scope: Consumed 1min 10.577s CPU time.
Jul 27 10:04:59 pve pvedaemon[619978]: ERROR: Backup of VM 101 failed - job failed with err -5 - Input/output error
Jul 27 10:04:59 pve pvedaemon[619978]: INFO: Backup job finished with errors
Jul 27 10:04:59 pve pvedaemon[619978]: job errors
Jul 27 10:04:59 pve pvedaemon[615168]: <root@pam> end task UPID:pve:000975CA:0081BC9B:64C293DE:vzdump:101:root@pam: job errors

The backup job also fails if the VM is fully stopped.

Any suggestions on where to look for troubleshooting?
 
Here is the full backup log.

root@pve:/var/log/pve/tasks# cat ./3/UPID:pve:00290B24:08727564:64D6E7A3:vzdump::root@pam:
INFO: starting new backup job: vzdump --quiet 1 --mode snapshot --storage local --mailnotification failure --notes-template '{{guestname}}' --prune-backups 'keep-last=2,keep-weekly=2' --all 1 --mailto alan+pve@thewilliamsfamily.org --compress zstd
INFO: filesystem type on dumpdir is 'zfs' -using /var/tmp/vzdumptmp2689828_100 for temporary files
INFO: Starting Backup of VM 100 (lxc)
INFO: Backup started at 2023-08-11 20:00:03
INFO: status = running
INFO: CT Name: docker-mcs
INFO: including mount point rootfs ('/') in backup
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
INFO: creating vzdump archive '/var/lib/vz/dump/vzdump-lxc-100-2023_08_11-20_00_03.tar.zst'
INFO: Total bytes written: 43493601280 (41GiB, 71MiB/s)
INFO: archive file size: 19.43GB
INFO: adding notes to backup
INFO: prune older backups with retention: keep-last=2, keep-weekly=2
INFO: removing backup 'local:backup/vzdump-lxc-100-2023_08_09-20_00_03.tar.zst'
INFO: pruned 1 backup(s) not covered by keep-retention policy
INFO: cleanup temporary 'vzdump' snapshot
INFO: Finished Backup of VM 100 (00:09:54)
INFO: Backup finished at 2023-08-11 20:09:57
INFO: Starting Backup of VM 101 (qemu)
INFO: Backup started at 2023-08-11 20:09:57
INFO: status = running
INFO: VM Name: homeassistant
INFO: include disk 'scsi0' 'local-zfs:vm-101-disk-1' 200G
INFO: include disk 'efidisk0' 'local-zfs:vm-101-disk-0' 1M
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/var/lib/vz/dump/vzdump-qemu-101-2023_08_11-20_09_57.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '4e6f4dc4-6b3f-435c-853d-22e9952a07f2'
INFO: resuming VM again
INFO: 0% (888.1 MiB of 200.0 GiB) in 3s, read: 296.0 MiB/s, write: 219.8 MiB/s
INFO: 1% (2.0 GiB of 200.0 GiB) in 9s, read: 197.6 MiB/s, write: 159.9 MiB/s
INFO: 2% (4.0 GiB of 200.0 GiB) in 20s, read: 186.2 MiB/s, write: 145.3 MiB/s
INFO: 3% (6.1 GiB of 200.0 GiB) in 35s, read: 139.2 MiB/s, write: 135.4 MiB/s
INFO: 4% (8.2 GiB of 200.0 GiB) in 49s, read: 154.3 MiB/s, write: 147.4 MiB/s
INFO: 5% (10.0 GiB of 200.0 GiB) in 1m 2s, read: 144.7 MiB/s, write: 137.0 MiB/s
INFO: 6% (12.1 GiB of 200.0 GiB) in 1m 16s, read: 149.9 MiB/s, write: 125.3 MiB/s
INFO: 7% (14.1 GiB of 200.0 GiB) in 1m 31s, read: 135.8 MiB/s, write: 130.4 MiB/s
INFO: 8% (16.0 GiB of 200.0 GiB) in 1m 43s, read: 169.8 MiB/s, write: 160.4 MiB/s
INFO: 9% (18.1 GiB of 200.0 GiB) in 1m 52s, read: 239.2 MiB/s, write: 139.6 MiB/s
INFO: 10% (20.0 GiB of 200.0 GiB) in 2m 4s, read: 158.6 MiB/s, write: 136.4 MiB/s
INFO: 11% (22.1 GiB of 200.0 GiB) in 2m 22s, read: 116.8 MiB/s, write: 108.3 MiB/s
INFO: 12% (24.1 GiB of 200.0 GiB) in 2m 38s, read: 132.8 MiB/s, write: 128.1 MiB/s
INFO: 13% (26.2 GiB of 200.0 GiB) in 2m 47s, read: 237.2 MiB/s, write: 226.7 MiB/s
INFO: 14% (28.0 GiB of 200.0 GiB) in 2m 55s, read: 232.5 MiB/s, write: 220.7 MiB/s
INFO: 15% (30.1 GiB of 200.0 GiB) in 3m 8s, read: 160.3 MiB/s, write: 153.1 MiB/s
INFO: 16% (32.0 GiB of 200.0 GiB) in 3m 20s, read: 168.3 MiB/s, write: 160.4 MiB/s
INFO: 17% (34.1 GiB of 200.0 GiB) in 3m 41s, read: 102.0 MiB/s, write: 100.2 MiB/s
INFO: 18% (36.0 GiB of 200.0 GiB) in 3m 56s, read: 128.1 MiB/s, write: 125.1 MiB/s
INFO: 19% (38.1 GiB of 200.0 GiB) in 4m 10s, read: 150.4 MiB/s, write: 146.8 MiB/s
INFO: 20% (40.1 GiB of 200.0 GiB) in 4m 24s, read: 149.7 MiB/s, write: 146.7 MiB/s
INFO: 21% (42.0 GiB of 200.0 GiB) in 4m 41s, read: 115.1 MiB/s, write: 113.6 MiB/s
INFO: 22% (44.1 GiB of 200.0 GiB) in 4m 56s, read: 144.0 MiB/s, write: 139.9 MiB/s
INFO: 23% (46.0 GiB of 200.0 GiB) in 5m 11s, read: 131.0 MiB/s, write: 129.2 MiB/s
INFO: 24% (48.1 GiB of 200.0 GiB) in 5m 28s, read: 123.7 MiB/s, write: 120.7 MiB/s
INFO: 25% (50.0 GiB of 200.0 GiB) in 5m 43s, read: 130.3 MiB/s, write: 123.6 MiB/s
INFO: 26% (52.1 GiB of 200.0 GiB) in 6m 2s, read: 110.7 MiB/s, write: 105.5 MiB/s
INFO: 27% (54.1 GiB of 200.0 GiB) in 6m 17s, read: 136.1 MiB/s, write: 130.0 MiB/s
INFO: 28% (56.1 GiB of 200.0 GiB) in 6m 33s, read: 132.0 MiB/s, write: 126.0 MiB/s
INFO: 29% (58.1 GiB of 200.0 GiB) in 6m 48s, read: 133.6 MiB/s, write: 127.2 MiB/s
INFO: 30% (60.0 GiB of 200.0 GiB) in 7m 8s, read: 98.4 MiB/s, write: 93.4 MiB/s
INFO: 31% (62.0 GiB of 200.0 GiB) in 7m 22s, read: 148.9 MiB/s, write: 142.1 MiB/s
INFO: 32% (64.1 GiB of 200.0 GiB) in 7m 31s, read: 232.0 MiB/s, write: 221.0 MiB/s
INFO: 33% (66.1 GiB of 200.0 GiB) in 7m 45s, read: 147.6 MiB/s, write: 140.6 MiB/s
INFO: 34% (68.0 GiB of 200.0 GiB) in 8m, read: 131.7 MiB/s, write: 125.1 MiB/s
INFO: 35% (70.1 GiB of 200.0 GiB) in 8m 16s, read: 133.2 MiB/s, write: 126.6 MiB/s
INFO: 36% (72.1 GiB of 200.0 GiB) in 8m 26s, read: 199.4 MiB/s, write: 189.4 MiB/s
INFO: 36% (72.7 GiB of 200.0 GiB) in 8m 32s, read: 110.2 MiB/s, write: 110.1 MiB/s
ERROR: job failed with err -5 - Input/output error
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 101 failed - job failed with err -5 - Input/output error
INFO: Failed at 2023-08-11 20:18:29

... logs for other VMs and Containers ...

INFO: Backup finished at 2023-08-11 21:02:16
INFO: Backup job finished with errors
TASK ERROR: job errors
 
pool status -v shows:

root@pve:~# zpool status -v
pool: rpool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 00:54:06 with 1 errors on Sun Jul 9 01:18:07 2023
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-eui.0025385a11906b34-part3 ONLINE 0 0 146
ata-WDC_WDS200T2B0A-00SM50_214202A0013C-part3 ONLINE 0 0 146

errors: Permanent errors have been detected in the following files:

rpool/data/vm-101-disk-1:<0x1>

So looks like I have a zfs issue. I have an older backup of the VM and it looks like I can restore that to a new VM and hopefully remove the corrupt vm-101-disk-1. Before doing that I'm running badblocks on both drives.

Smart info on both drives looks fine (running mirrored).

root@pve:~# smartctl -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.107-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: WD Blue / Red / Green SSDs
Device Model: WDC WDS200T2B0A-00SM50
Serial Number: 214202A0013C
LU WWN Device Id: 5 001b44 8bc7320d9
Firmware Version: 415020WD
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Aug 12 13:30:14 2023 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 10) minutes.

SMART Attributes Data Structure revision number: 4
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 --- Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 --- Old_age Always - 11546
12 Power_Cycle_Count 0x0032 100 100 --- Old_age Always - 17
165 Block_Erase_Count 0x0032 100 100 --- Old_age Always - 7317268823244
166 Minimum_PE_Cycles_TLC 0x0032 100 100 --- Old_age Always - 267
167 Max_Bad_Blocks_per_Die 0x0032 100 100 --- Old_age Always - 173
168 Maximum_PE_Cycles_TLC 0x0032 100 100 --- Old_age Always - 496
169 Total_Bad_Blocks 0x0032 100 100 --- Old_age Always - 1021
170 Grown_Bad_Blocks 0x0032 100 100 --- Old_age Always - 0
171 Program_Fail_Count 0x0032 100 100 --- Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 --- Old_age Always - 0
173 Average_PE_Cycles_TLC 0x0032 100 100 --- Old_age Always - 334
174 Unexpected_Power_Loss 0x0032 100 100 --- Old_age Always - 8
184 End-to-End_Error 0x0032 100 100 --- Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 --- Old_age Always - 0
188 Command_Timeout 0x0032 100 100 --- Old_age Always - 0
194 Temperature_Celsius 0x0022 033 076 --- Old_age Always - 67 (Min/Max 19/76)
199 UDMA_CRC_Error_Count 0x0032 100 100 --- Old_age Always - 0
230 Media_Wearout_Indicator 0x0032 063 063 --- Old_age Always - 0x3f5921283f59
232 Available_Reservd_Space 0x0033 100 100 004 Pre-fail Always - 100
233 NAND_GB_Written_TLC 0x0032 100 100 --- Old_age Always - 690973
234 NAND_GB_Written_SLC 0x0032 100 100 --- Old_age Always - 712819
241 Host_Writes_GiB 0x0030 253 253 --- Old_age Offline - 97681
242 Host_Reads_GiB 0x0030 253 253 --- Old_age Offline - 131409
244 Temp_Throttle_Status 0x0032 000 100 --- Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

Selective Self-tests/Logging not supported

root@pve:~# smartctl -a /dev/nvme0
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.107-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 970 EVO Plus 2TB
Serial Number: S59CNM0RA08955F
Firmware Version: 2B2QEXM7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 4
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization: 1,993,407,467,520 [1.99 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 5a11906b34
Local Time is: Sat Aug 12 13:31:15 2023 MDT
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 85 Celsius
Critical Comp. Temp. Threshold: 85 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.50W - - 0 0 0 0 0 0
1 + 5.90W - - 1 1 1 1 0 0
2 + 3.60W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 79 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 20%
Data Units Read: 791,216,367 [405 TB]
Data Units Written: 506,840,965 [259 TB]
Host Read Commands: 15,723,492,940
Host Write Commands: 6,355,848,820
Controller Busy Time: 22,541
Power Cycles: 22
Power On Hours: 6,297
Unsafe Shutdowns: 10
Media and Data Integrity Errors: 0
Error Information Log Entries: 100
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 79 Celsius
Temperature Sensor 2: 85 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 100 0 0x001d 0x4004 - 0 0 -