Consistent Backup Failure with just one Virtual Machine

alanjohnwilliams

New Member
Feb 2, 2022
4
0
1
52
I'm running a home lab single instance proxmox server with 6 containers and 3 VMs. PVE 7.4-16. Over the last month one of my VMs (homeassistant) is consistently failing backup with:
INFO: 36% (72.7 GiB of 200.0 GiB) in 7m 39s, read: 127.0 MiB/s, write: 126.9 MiB/s
ERROR: job failed with err -5 - Input/output error

This occurs at the same "36%" checkpoint every time and regardless of storage target (local, NAS, PBS). I'm not seeing anything of note to help troubleshoot in the system logs:
Jul 27 10:04:57 pve zed[1355831]: eid=72 class=data pool='rpool' priority=0 err=52 flags=0x808881 bookmark=70744:1:0:9527277
Jul 27 10:04:58 pve kernel: zd48: p1 p2 p3 p4 p5 p6 p7 p8
Jul 27 10:04:58 pve kernel: vmbr0v100: port 2(tap101i0) entered disabled state
Jul 27 10:04:58 pve qmeventd[1963]: read: Connection reset by peer
Jul 27 10:04:58 pve systemd[1]: 101.scope: Succeeded.
Jul 27 10:04:58 pve systemd[1]: 101.scope: Consumed 1min 10.577s CPU time.
Jul 27 10:04:59 pve pvedaemon[619978]: ERROR: Backup of VM 101 failed - job failed with err -5 - Input/output error
Jul 27 10:04:59 pve pvedaemon[619978]: INFO: Backup job finished with errors
Jul 27 10:04:59 pve pvedaemon[619978]: job errors
Jul 27 10:04:59 pve pvedaemon[615168]: <root@pam> end task UPID:pve:000975CA:0081BC9B:64C293DE:vzdump:101:root@pam: job errors

The backup job also fails if the VM is fully stopped.

Any suggestions on where to look for troubleshooting?
 
Here is the full backup log.

root@pve:/var/log/pve/tasks# cat ./3/UPID:pve:00290B24:08727564:64D6E7A3:vzdump::root@pam:
INFO: starting new backup job: vzdump --quiet 1 --mode snapshot --storage local --mailnotification failure --notes-template '{{guestname}}' --prune-backups 'keep-last=2,keep-weekly=2' --all 1 --mailto alan+pve@thewilliamsfamily.org --compress zstd
INFO: filesystem type on dumpdir is 'zfs' -using /var/tmp/vzdumptmp2689828_100 for temporary files
INFO: Starting Backup of VM 100 (lxc)
INFO: Backup started at 2023-08-11 20:00:03
INFO: status = running
INFO: CT Name: docker-mcs
INFO: including mount point rootfs ('/') in backup
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
INFO: creating vzdump archive '/var/lib/vz/dump/vzdump-lxc-100-2023_08_11-20_00_03.tar.zst'
INFO: Total bytes written: 43493601280 (41GiB, 71MiB/s)
INFO: archive file size: 19.43GB
INFO: adding notes to backup
INFO: prune older backups with retention: keep-last=2, keep-weekly=2
INFO: removing backup 'local:backup/vzdump-lxc-100-2023_08_09-20_00_03.tar.zst'
INFO: pruned 1 backup(s) not covered by keep-retention policy
INFO: cleanup temporary 'vzdump' snapshot
INFO: Finished Backup of VM 100 (00:09:54)
INFO: Backup finished at 2023-08-11 20:09:57
INFO: Starting Backup of VM 101 (qemu)
INFO: Backup started at 2023-08-11 20:09:57
INFO: status = running
INFO: VM Name: homeassistant
INFO: include disk 'scsi0' 'local-zfs:vm-101-disk-1' 200G
INFO: include disk 'efidisk0' 'local-zfs:vm-101-disk-0' 1M
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/var/lib/vz/dump/vzdump-qemu-101-2023_08_11-20_09_57.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '4e6f4dc4-6b3f-435c-853d-22e9952a07f2'
INFO: resuming VM again
INFO: 0% (888.1 MiB of 200.0 GiB) in 3s, read: 296.0 MiB/s, write: 219.8 MiB/s
INFO: 1% (2.0 GiB of 200.0 GiB) in 9s, read: 197.6 MiB/s, write: 159.9 MiB/s
INFO: 2% (4.0 GiB of 200.0 GiB) in 20s, read: 186.2 MiB/s, write: 145.3 MiB/s
INFO: 3% (6.1 GiB of 200.0 GiB) in 35s, read: 139.2 MiB/s, write: 135.4 MiB/s
INFO: 4% (8.2 GiB of 200.0 GiB) in 49s, read: 154.3 MiB/s, write: 147.4 MiB/s
INFO: 5% (10.0 GiB of 200.0 GiB) in 1m 2s, read: 144.7 MiB/s, write: 137.0 MiB/s
INFO: 6% (12.1 GiB of 200.0 GiB) in 1m 16s, read: 149.9 MiB/s, write: 125.3 MiB/s
INFO: 7% (14.1 GiB of 200.0 GiB) in 1m 31s, read: 135.8 MiB/s, write: 130.4 MiB/s
INFO: 8% (16.0 GiB of 200.0 GiB) in 1m 43s, read: 169.8 MiB/s, write: 160.4 MiB/s
INFO: 9% (18.1 GiB of 200.0 GiB) in 1m 52s, read: 239.2 MiB/s, write: 139.6 MiB/s
INFO: 10% (20.0 GiB of 200.0 GiB) in 2m 4s, read: 158.6 MiB/s, write: 136.4 MiB/s
INFO: 11% (22.1 GiB of 200.0 GiB) in 2m 22s, read: 116.8 MiB/s, write: 108.3 MiB/s
INFO: 12% (24.1 GiB of 200.0 GiB) in 2m 38s, read: 132.8 MiB/s, write: 128.1 MiB/s
INFO: 13% (26.2 GiB of 200.0 GiB) in 2m 47s, read: 237.2 MiB/s, write: 226.7 MiB/s
INFO: 14% (28.0 GiB of 200.0 GiB) in 2m 55s, read: 232.5 MiB/s, write: 220.7 MiB/s
INFO: 15% (30.1 GiB of 200.0 GiB) in 3m 8s, read: 160.3 MiB/s, write: 153.1 MiB/s
INFO: 16% (32.0 GiB of 200.0 GiB) in 3m 20s, read: 168.3 MiB/s, write: 160.4 MiB/s
INFO: 17% (34.1 GiB of 200.0 GiB) in 3m 41s, read: 102.0 MiB/s, write: 100.2 MiB/s
INFO: 18% (36.0 GiB of 200.0 GiB) in 3m 56s, read: 128.1 MiB/s, write: 125.1 MiB/s
INFO: 19% (38.1 GiB of 200.0 GiB) in 4m 10s, read: 150.4 MiB/s, write: 146.8 MiB/s
INFO: 20% (40.1 GiB of 200.0 GiB) in 4m 24s, read: 149.7 MiB/s, write: 146.7 MiB/s
INFO: 21% (42.0 GiB of 200.0 GiB) in 4m 41s, read: 115.1 MiB/s, write: 113.6 MiB/s
INFO: 22% (44.1 GiB of 200.0 GiB) in 4m 56s, read: 144.0 MiB/s, write: 139.9 MiB/s
INFO: 23% (46.0 GiB of 200.0 GiB) in 5m 11s, read: 131.0 MiB/s, write: 129.2 MiB/s
INFO: 24% (48.1 GiB of 200.0 GiB) in 5m 28s, read: 123.7 MiB/s, write: 120.7 MiB/s
INFO: 25% (50.0 GiB of 200.0 GiB) in 5m 43s, read: 130.3 MiB/s, write: 123.6 MiB/s
INFO: 26% (52.1 GiB of 200.0 GiB) in 6m 2s, read: 110.7 MiB/s, write: 105.5 MiB/s
INFO: 27% (54.1 GiB of 200.0 GiB) in 6m 17s, read: 136.1 MiB/s, write: 130.0 MiB/s
INFO: 28% (56.1 GiB of 200.0 GiB) in 6m 33s, read: 132.0 MiB/s, write: 126.0 MiB/s
INFO: 29% (58.1 GiB of 200.0 GiB) in 6m 48s, read: 133.6 MiB/s, write: 127.2 MiB/s
INFO: 30% (60.0 GiB of 200.0 GiB) in 7m 8s, read: 98.4 MiB/s, write: 93.4 MiB/s
INFO: 31% (62.0 GiB of 200.0 GiB) in 7m 22s, read: 148.9 MiB/s, write: 142.1 MiB/s
INFO: 32% (64.1 GiB of 200.0 GiB) in 7m 31s, read: 232.0 MiB/s, write: 221.0 MiB/s
INFO: 33% (66.1 GiB of 200.0 GiB) in 7m 45s, read: 147.6 MiB/s, write: 140.6 MiB/s
INFO: 34% (68.0 GiB of 200.0 GiB) in 8m, read: 131.7 MiB/s, write: 125.1 MiB/s
INFO: 35% (70.1 GiB of 200.0 GiB) in 8m 16s, read: 133.2 MiB/s, write: 126.6 MiB/s
INFO: 36% (72.1 GiB of 200.0 GiB) in 8m 26s, read: 199.4 MiB/s, write: 189.4 MiB/s
INFO: 36% (72.7 GiB of 200.0 GiB) in 8m 32s, read: 110.2 MiB/s, write: 110.1 MiB/s
ERROR: job failed with err -5 - Input/output error
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 101 failed - job failed with err -5 - Input/output error
INFO: Failed at 2023-08-11 20:18:29

... logs for other VMs and Containers ...

INFO: Backup finished at 2023-08-11 21:02:16
INFO: Backup job finished with errors
TASK ERROR: job errors
 
pool status -v shows:

root@pve:~# zpool status -v
pool: rpool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 00:54:06 with 1 errors on Sun Jul 9 01:18:07 2023
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-eui.0025385a11906b34-part3 ONLINE 0 0 146
ata-WDC_WDS200T2B0A-00SM50_214202A0013C-part3 ONLINE 0 0 146

errors: Permanent errors have been detected in the following files:

rpool/data/vm-101-disk-1:<0x1>

So looks like I have a zfs issue. I have an older backup of the VM and it looks like I can restore that to a new VM and hopefully remove the corrupt vm-101-disk-1. Before doing that I'm running badblocks on both drives.

Smart info on both drives looks fine (running mirrored).

root@pve:~# smartctl -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.107-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: WD Blue / Red / Green SSDs
Device Model: WDC WDS200T2B0A-00SM50
Serial Number: 214202A0013C
LU WWN Device Id: 5 001b44 8bc7320d9
Firmware Version: 415020WD
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Aug 12 13:30:14 2023 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 10) minutes.

SMART Attributes Data Structure revision number: 4
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 --- Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 --- Old_age Always - 11546
12 Power_Cycle_Count 0x0032 100 100 --- Old_age Always - 17
165 Block_Erase_Count 0x0032 100 100 --- Old_age Always - 7317268823244
166 Minimum_PE_Cycles_TLC 0x0032 100 100 --- Old_age Always - 267
167 Max_Bad_Blocks_per_Die 0x0032 100 100 --- Old_age Always - 173
168 Maximum_PE_Cycles_TLC 0x0032 100 100 --- Old_age Always - 496
169 Total_Bad_Blocks 0x0032 100 100 --- Old_age Always - 1021
170 Grown_Bad_Blocks 0x0032 100 100 --- Old_age Always - 0
171 Program_Fail_Count 0x0032 100 100 --- Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 --- Old_age Always - 0
173 Average_PE_Cycles_TLC 0x0032 100 100 --- Old_age Always - 334
174 Unexpected_Power_Loss 0x0032 100 100 --- Old_age Always - 8
184 End-to-End_Error 0x0032 100 100 --- Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 --- Old_age Always - 0
188 Command_Timeout 0x0032 100 100 --- Old_age Always - 0
194 Temperature_Celsius 0x0022 033 076 --- Old_age Always - 67 (Min/Max 19/76)
199 UDMA_CRC_Error_Count 0x0032 100 100 --- Old_age Always - 0
230 Media_Wearout_Indicator 0x0032 063 063 --- Old_age Always - 0x3f5921283f59
232 Available_Reservd_Space 0x0033 100 100 004 Pre-fail Always - 100
233 NAND_GB_Written_TLC 0x0032 100 100 --- Old_age Always - 690973
234 NAND_GB_Written_SLC 0x0032 100 100 --- Old_age Always - 712819
241 Host_Writes_GiB 0x0030 253 253 --- Old_age Offline - 97681
242 Host_Reads_GiB 0x0030 253 253 --- Old_age Offline - 131409
244 Temp_Throttle_Status 0x0032 000 100 --- Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

Selective Self-tests/Logging not supported

root@pve:~# smartctl -a /dev/nvme0
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.107-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 970 EVO Plus 2TB
Serial Number: S59CNM0RA08955F
Firmware Version: 2B2QEXM7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 4
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization: 1,993,407,467,520 [1.99 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 5a11906b34
Local Time is: Sat Aug 12 13:31:15 2023 MDT
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 85 Celsius
Critical Comp. Temp. Threshold: 85 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.50W - - 0 0 0 0 0 0
1 + 5.90W - - 1 1 1 1 0 0
2 + 3.60W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 79 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 20%
Data Units Read: 791,216,367 [405 TB]
Data Units Written: 506,840,965 [259 TB]
Host Read Commands: 15,723,492,940
Host Write Commands: 6,355,848,820
Controller Busy Time: 22,541
Power Cycles: 22
Power On Hours: 6,297
Unsafe Shutdowns: 10
Media and Data Integrity Errors: 0
Error Information Log Entries: 100
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 79 Celsius
Temperature Sensor 2: 85 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 100 0 0x001d 0x4004 - 0 0 -
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!