Backup Issues - NVME Drive failing ?

RNab

Member
Jun 20, 2021
31
3
13
35
Hi all,

Hope i can get some help here. Recently, when backing up one of my VM, it would always fail at 78% with error -125. I checked multiple times, it doesnt seem to be related to resources (OOM or else). I restored an old backup, and it was "all ok" for probably around a month or 2, until the same issue started to happen again.

In the syslog, after the backup, I see these lines :


Code:
Jul 24 22:41:53 proxmox kernel: [25604.027744] blk_update_request: critical medium error, dev nvme0n1, sector 404283648 op 0x0:(READ) flags 0x0 phys_seg 16 prio class 0
Jul 24 22:41:53 proxmox kernel: [25604.029399] blk_update_request: critical medium error, dev nvme0n1, sector 404283520 op 0x0:(READ) flags 0x0 phys_seg 16 prio class 0
Jul 24 22:41:53 proxmox kernel: [25604.029473] blk_update_request: critical medium error, dev nvme0n1, sector 404283392 op 0x0:(READ) flags 0x0 phys_seg 16 prio class 0
Jul 24 22:41:53 proxmox kernel: [25604.031500] blk_update_request: critical medium error, dev nvme0n1, sector 404327936 op 0x0:(READ) flags 0x0 phys_seg 16 prio class 0
Jul 24 22:41:53 proxmox kernel: [25604.033304] blk_update_request: critical medium error, dev nvme0n1, sector 404327808 op 0x0:(READ) flags 0x0 phys_seg 16 prio class 0
Jul 24 22:41:53 proxmox kernel: [25604.033769] blk_update_request: critical medium error, dev nvme0n1, sector 404361600 op 0x0:(READ) flags 0x0 phys_seg 9 prio class 0

Which sounds like its not a good news.

However, when I run the SMART on the GUI I get these :


Code:
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        52 Celsius
Available Spare:                    95%
Available Spare Threshold:          5%
Percentage Used:                    4%
Data Units Read:                    324,217,443 [165 TB]
Data Units Written:                 59,578,220 [30.5 TB]
Host Read Commands:                 2,711,635,111
Host Write Commands:                2,019,567,071
Controller Busy Time:               26,677
Power Cycles:                       61
Power On Hours:                     18,327
Unsafe Shutdowns:                   16
Media and Data Integrity Errors:    612
Error Information Log Entries:      671
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

I'm fairly new to all this, but it looks reasonably ok ?
How can I fix this issue ?

I've already started (continued) to backup everything, in the event of a hard failure, but I'm hoping i could still use this drive (its barely 2years old).

One thing I just recall now : my proxmox server suffered one or two hard resets without any notice thanks to my 1 year old daughter that thought pressing on the blinking button would be a fun thing to do. It might or might not have been after that that it started to have these issues.

Thanks