PBS crashes SSD

Scrubber6927

Member
Aug 14, 2023
15
0
6
Hello, I hope someone can help me with my problem.
Since I installed version 3.3 of PBS, my backups no longer work.

In the task, it looks as follows:

Code:
INFO:  33% (343.1 GiB of 1.0 TiB) in 55m 9s, read: 22.0 MiB/s, write: 21.6 MiB/s 
ERROR: job failed with err -5 - Input/output error 
INFO: aborting backup job 
INFO: resuming VM again 
ERROR: Backup of VM 106 failed - job failed with err -5 - Input/output error 
INFO: Failed at 2025-03-17 11:57:37 
INFO: Backup job finished with errors 
TASK ERROR: job errors

The error is even understandable because the NVMe SSD where my VMs are stored simply "disappears." The NVMe storage is displayed with a ? in PVE and is no longer available.

I am using an Intel NUC of the 13th generation, and when I restart the system and go directly into the BIOS, the SSD is also not detected there. I have to disconnect the device from power, and only then does the SSD reappear.

There are no SMART errors on the SSD, and with only 2% wear, it is still quite new.

The error is 100% reproducible, no matter which VM I try to back up. At some point, the error occurs as described.

The NVMe storage is set up with LVM, and all my disk images are "raw."
In the backup settings, it does not matter which method I choose; the error occurs with all of them.

My PBS is running as a VM on the PVE. However, I have also installed an LXC controller using the community script. The same issue occurs there. As a backup target, I have tried both NFS and an iSCSI device. Each time, the backup fails again.


I am slowly running out of ideas. Does anyone have any suggestions for me?
 
The error is even understandable because the NVMe SSD where my VMs are stored simply "disappears." The NVMe storage is displayed with a ? in PVE and is no longer available.
This sounds however unrelated to the upgrade to PBS 3.3, more like a hardware issue. Check the systemd journal on your PVE host for errors.

I am using an Intel NUC of the 13th generation, and when I restart the system and go directly into the BIOS, the SSD is also not detected there. I have to disconnect the device from power, and only then does the SSD reappear.
Maybe a bad SSD, what type and manufacturer is this? Maybe check also for thermal issues, given the constraints of the NUC enclosure?
 
Thanks for the quick response. Here are the answers to your questions:

This sounds however unrelated to the upgrade to PBS 3.3, more like a hardware issue. Check the systemd journal on your PVE host for errors.
I used journalctl -p err to get the errors. Here is the output:

Code:
Mar 17 11:57:16 pve01 kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
Mar 17 11:57:33 pve01 kernel: INFO: task jbd2/dm-6-8:149378 blocked for more than 122 seconds.
Mar 17 11:57:33 pve01 kernel:       Tainted: P           O       6.8.12-8-pve #1
Mar 17 11:57:33 pve01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 17 11:57:33 pve01 kernel: INFO: task systemd-journal:149494 blocked for more than 122 seconds.
Mar 17 11:57:33 pve01 kernel:       Tainted: P           O       6.8.12-8-pve #1
Mar 17 11:57:33 pve01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 17 11:57:33 pve01 kernel: INFO: task tokio-runtime-w:149654 blocked for more than 122 seconds.
Mar 17 11:57:33 pve01 kernel:       Tainted: P           O       6.8.12-8-pve #1
Mar 17 11:57:33 pve01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 17 11:57:36 pve01 kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
Mar 17 11:57:36 pve01 kernel: I/O error, dev nvme0n1, sector 3777317216 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Mar 17 11:57:36 pve01 kernel: Buffer I/O error on device dm-6, logical block 20226092
Mar 17 11:57:36 pve01 kernel: Buffer I/O error on dev dm-6, logical block 9265, lost sync page write
Mar 17 11:57:36 pve01 kernel: EXT4-fs error (device dm-6): kmmpd:185: comm kmmpd-dm-6: Error writing to MMP block
Mar 17 11:57:36 pve01 kernel: Aborting journal on device dm-6-8.
Mar 17 11:57:36 pve01 kernel: Buffer I/O error on device dm-6, logical block 10081
Mar 17 11:57:36 pve01 kernel: Buffer I/O error on device dm-6, logical block 9513
Mar 17 11:57:36 pve01 kernel: Buffer I/O error on dev dm-6, logical block 9265, lost sync page write
Mar 17 11:57:36 pve01 kernel: Buffer I/O error on dev dm-6, logical block 10518528, lost sync page write
Mar 17 11:57:36 pve01 kernel: JBD2: I/O error when updating journal superblock for dm-6-8.
Mar 17 11:57:36 pve01 kernel: Buffer I/O error on device dm-6, logical block 10176
Mar 17 11:57:36 pve01 kernel: Buffer I/O error on device dm-6, logical block 10177
Mar 17 11:57:36 pve01 kernel: Buffer I/O error on device dm-6, logical block 10178
Mar 17 11:57:36 pve01 kernel: Buffer I/O error on device dm-6, logical block 10179
Mar 17 11:57:36 pve01 kernel: Buffer I/O error on device dm-6, logical block 10193
Mar 17 11:57:36 pve01 kernel: Buffer I/O error on dev dm-6, logical block 4227185, lost async page write
Mar 17 11:57:36 pve01 kernel: Buffer I/O error on dev dm-6, logical block 0, lost async page write
Mar 17 11:57:36 pve01 kernel: Buffer I/O error on device dm-6, logical block 10202
Mar 17 11:57:36 pve01 kernel: EXT4-fs (dm-6): previous I/O error to superblock detected
Mar 17 11:57:36 pve01 kernel: Buffer I/O error on device dm-6, logical block 10220
Mar 17 11:57:36 pve01 kernel: Buffer I/O error on dev dm-6, logical block 0, lost sync page write
Mar 17 11:57:36 pve01 kernel: EXT4-fs (dm-6): I/O error while writing superblock
Mar 17 11:57:36 pve01 kernel: Buffer I/O error on dev dm-6, logical block 10, lost async page write
Mar 17 11:57:36 pve01 kernel: EXT4-fs (dm-6): Delayed block allocation failed for inode 1050829 at logical offset 1438 with max blocks 1 with error 30
Mar 17 11:57:36 pve01 kernel: EXT4-fs error (device dm-6) in ext4_reserve_inode_write:5771: Journal has aborted
Mar 17 11:57:36 pve01 kernel: EXT4-fs (dm-6): This should not happen!! Data will be lost
Mar 17 11:57:36 pve01 kernel: EXT4-fs error (device dm-6) in ext4_do_writepages:2720: Journal has aborted
Mar 17 11:57:36 pve01 kernel: EXT4-fs error (device dm-6) in ext4_reserve_inode_write:5771: Journal has aborted
Mar 17 11:57:36 pve01 kernel: EXT4-fs error (device dm-6): ext4_dirty_inode:5975: inode #1050830: comm tokio-runtime-w: mark_inode_dirty error
Mar 17 11:57:36 pve01 kernel: EXT4-fs error (device dm-6) in ext4_dirty_inode:5976: Journal has aborted
Mar 17 11:57:36 pve01 kernel: EXT4-fs error (device dm-6): ext4_journal_check_start:84: comm tokio-runtime-w: Detected aborted journal
Mar 17 11:57:36 pve01 kernel: EXT4-fs error (device dm-6): ext4_dirty_inode:5975: inode #1050107: comm systemd-journal: mark_inode_dirty error
Mar 17 11:57:36 pve01 kernel: Buffer I/O error on dev dm-6, logical block 0, lost sync page write
Mar 17 11:57:36 pve01 kernel: EXT4-fs (dm-6): I/O error while writing superblock
Mar 17 11:57:36 pve01 kernel: EXT4-fs (dm-6): Remounting filesystem read-only
Mar 17 11:57:36 pve01 kernel: Buffer I/O error on dev dm-6, logical block 0, lost sync page write
Mar 17 11:57:36 pve01 kernel: EXT4-fs (dm-6): I/O error while writing superblock
Mar 17 11:57:37 pve01 pvedaemon[565203]: ERROR: Backup of VM 106 failed - job failed with err -5 - Input/output error
Mar 17 11:57:37 pve01 pvedaemon[565203]: job errors
Mar 17 11:57:41 pve01 kernel: Buffer I/O error on dev dm-6, logical block 9265, lost sync page write

Maybe a bad SSD, what type and manufacturer is this? Maybe check also for thermal issues, given the constraints of the NUC enclosure?

Hmm, I bought the SSD a few weeks ago. It is a Samsung 990 Pro 4TB. The temperatur seems okay. Output of the smart values:
Code:
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        45 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    2%
Data Units Read:                    45,357,631 [23.2 TB]
Data Units Written:                 18,309,821 [9.37 TB]
Host Read Commands:                 310,020,541
Host Write Commands:                384,194,107
Controller Busy Time:               2,073
Power Cycles:                       16
Power On Hours:                     3,245
Unsafe Shutdowns:                   7
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               45 Celsius
Temperature Sensor 2:               50 Celsius

It may be a coincidence but this behaviour started after upgrading to 3.3
 
Not even 10TB written and 2% used spare already. Definitely a problem with that SSD!
 
Hi guys,

as promised a little update from my side. Changing the SSD resolved the issue. Thanks at all for the help.