NVMe Disappearing

ikecomp

Member
Apr 9, 2020
12
1
8
55
Hi Guys -

I'm running into a bit of a weird issue where every week or sometimes every 2 weeks my nvme drive disappears from proxmox and only a reboot fixes it until it comes back again. I have the nvme drive mounted as a regular directory on an ext4 filesystem. The only thing running on the drive is my windows 10 disk image (qcow2 format). Nothing else uses the drive. I had a smaller cheaper sabrent drive in my machine previously that didn't have this issue some I'm wondering if it's a compatibility issue with proxmox or if the drive might just be bad

SYSLOG
Code:
Sep 27 02:54:01 proxmox systemd[1]: Started Proxmox VE replication runner.
Sep 27 02:54:04 proxmox kernel: [2908335.387198] nvme nvme0: Removing after probe failure status: -19
Sep 27 02:54:04 proxmox kernel: [2908335.407127] blk_update_request: I/O error, dev nvme0n1, sector 262865288 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
Sep 27 02:54:04 proxmox kernel: [2908335.407142] blk_update_request: I/O error, dev nvme0n1, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
Sep 27 02:54:04 proxmox kernel: [2908335.407145] Aborting journal on device nvme0n1p1-8.
Sep 27 02:54:04 proxmox kernel: [2908335.407154] JBD2: Error -5 detected when updating journal superblock for nvme0n1p1-8.
Sep 27 02:54:04 proxmox systemd[1]: Stopped target Local File Systems.
Sep 27 02:54:04 proxmox systemd[1]: Unmounting /mnt/NVMEDISK...
Sep 27 02:54:04 proxmox umount[24549]: umount: /mnt/NVMEDISK: target is busy.
Sep 27 02:54:04 proxmox systemd[1]: mnt-NVMEDISK.mount: Mount process exited, code=exited, status=32/n/a
Sep 27 02:54:04 proxmox systemd[1]: Failed unmounting /mnt/NVMEDISK.
Sep 27 02:54:04 proxmox systemd[1]: mnt-NVMEDISK.mount: Unit is bound to inactive unit dev-disk-by\x2duuid-d31acd69\x2df784\x2d4ba9\x2dbfef\x2d40036196c03e.device. Stopping, too.
Sep 27 02:54:04 proxmox systemd[1]: Unmounting /mnt/NVMEDISK...
Sep 27 02:54:04 proxmox umount[24550]: umount: /mnt/NVMEDISK: target is busy.
Sep 27 02:54:04 proxmox systemd[1]: mnt-NVMEDISK.mount: Mount process exited, code=exited, status=32/n/a
Sep 27 02:54:04 proxmox systemd[1]: Failed unmounting /mnt/NVMEDISK.
Sep 27 02:54:04 proxmox systemd[1]: mnt-NVMEDISK.mount: Unit is bound to inactive unit dev-disk-by\x2duuid-d31acd69\x2df784\x2d4ba9\x2dbfef\x2d40036196c03e.device. Stopping, too.
Sep 27 02:54:04 proxmox systemd[1]: Unmounting /mnt/NVMEDISK...
Sep 27 02:54:04 proxmox umount[24551]: umount: /mnt/NVMEDISK: target is busy.
Sep 27 02:54:04 proxmox systemd[1]: mnt-NVMEDISK.mount: Mount process exited, code=exited, status=32/n/a
Sep 27 02:54:04 proxmox systemd[1]: Failed unmounting /mnt/NVMEDISK.
Sep 27 02:54:04 proxmox systemd[1]: mnt-NVMEDISK.mount: Unit is bound to inactive unit dev-disk-by\x2duuid-d31acd69\x2df784\x2d4ba9\x2dbfef\x2d40036196c03e.device. Stopping, too.
Sep 27 02:54:04 proxmox systemd[1]: Unmounting /mnt/NVMEDISK...
Sep 27 02:54:04 proxmox umount[24552]: umount: /mnt/NVMEDISK: target is busy.
Sep 27 02:54:04 proxmox systemd[1]: mnt-NVMEDISK.mount: Mount process exited, code=exited, status=32/n/a
Sep 27 02:54:04 proxmox systemd[1]: Failed unmounting /mnt/NVMEDISK.
Sep 27 02:54:04 proxmox systemd[1]: mnt-NVMEDISK.mount: Unit is bound to inactive unit dev-disk-by\x2duuid-d31acd69\x2df784\x2d4ba9\x2dbfef\x2d40036196c03e.device. Stopping, too.
Sep 27 02:54:04 proxmox systemd[1]: Unmounting /mnt/NVMEDISK...
Sep 27 02:54:04 proxmox umount[24553]: umount: /mnt/NVMEDISK: target is busy.
Sep 27 02:54:04 proxmox systemd[1]: mnt-NVMEDISK.mount: Mount process exited, code=exited, status=32/n/a
Sep 27 02:54:04 proxmox systemd[1]: Failed unmounting /mnt/NVMEDISK.
Sep 27 02:54:04 proxmox systemd[1]: mnt-NVMEDISK.mount: Unit is bound to inactive unit dev-disk-by\x2duuid-d31acd69\x2df784\x2d4ba9\x2dbfef\x2d40036196c03e.device. Stopping, too.
Sep 27 02:54:04 proxmox systemd[1]: Unmounting /mnt/NVMEDISK...
Sep 27 02:54:04 proxmox umount[24554]: umount: /mnt/NVMEDISK: target is busy.
Sep 27 02:54:04 proxmox systemd[1]: mnt-NVMEDISK.mount: Mount process exited, code=exited, status=32/n/a
Sep 27 02:54:04 proxmox systemd[1]: Failed unmounting /mnt/NVMEDISK.
Sep 27 02:54:04 proxmox systemd[1]: mnt-NVMEDISK.mount: Unit is bound to inactive unit dev-disk-by\x2duuid-d31acd69\x2df784\x2d4ba9\x2dbfef\x2d40036196c03e.device. Stopping, too.
Sep 27 02:54:04 proxmox systemd[1]: Unmounting /mnt/NVMEDISK...
Sep 27 02:54:04 proxmox umount[24555]: umount: /mnt/NVMEDISK: target is busy.

DMESG
Code:
root@proxmox:/var/log# dmesg
[3031575.451463] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031575.451595] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031575.451695] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031584.723394] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031584.723533] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031584.723635] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031595.220270] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031595.220400] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031595.220500] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031605.439413] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031605.439548] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031605.439650] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031614.767457] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031614.767589] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031614.767689] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031625.222601] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031625.222733] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031625.222834] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031635.269611] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031635.269746] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031635.269848] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031644.784342] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
[3031644.784472] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1532: inode #2: comm pvestatd: reading directory lblock 0
 
Additional information

NVMe Smart Info
Code:
root@proxmox:~# smartctl -a /dev/nvme0n1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.44-2-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       WDS500G3X0C-00SJG0
Serial Number:                      XXXXXXXXX
Firmware Version:                   111110WD
PCI Vendor/Subsystem ID:            0x15b7
IEEE OUI Identifier:                0x001b44
Total NVM Capacity:                 500,107,862,016 [500 GB]
Unallocated NVM Capacity:           0
Controller ID:                      8215
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500,107,862,016 [500 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            001b44 8b460a9bc2
Local Time is:                      Mon Sep 28 15:09:55 2020 EDT
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     84 Celsius
Critical Comp. Temp. Threshold:     88 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     5.50W       -        -    0  0  0  0        0       0
 1 +     3.50W       -        -    1  1  1  1        0       0
 2 +     3.00W       -        -    2  2  2  2        0       0
 3 -   0.0700W       -        -    3  3  3  3     4000   10000
 4 -   0.0025W       -        -    4  4  4  4     4000   40000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        38 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    8,674,316 [4.44 TB]
Data Units Written:                 3,636,363 [1.86 TB]
Host Read Commands:                 74,444,755
Host Write Commands:                75,063,273
Controller Busy Time:               179
Power Cycles:                       23
Power On Hours:                     3,172
Unsafe Shutdowns:                   12
Media and Data Integrity Errors:    0
Error Information Log Entries:      1
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, max 256 entries)
No Errors Logged

Below is the mounting options being used and Host Build info

--fstab--
UUID=d31acd69-f784-4ba9-bfef-40036196c03e /mnt/NVMEDISK ext4 defaults 0 0

--Build--
CPU(s) 16 x AMD Ryzen 7 3700X 8-Core Processor (1 Socket)
Kernel Version Linux 5.4.44-2-pve #1 SMP PVE 5.4.44-2 (Wed, 01 Jul 2020 16:37:57 +0200)
PVE Manager Version pve-manager/6.2-10/a20769ed
NVME Drive: WD Black 500 GB

Hopefully it's just something I'm doing wrong.
 
Sep 27 02:54:04 proxmox kernel: [2908335.407127] blk_update_request: I/O error, dev nvme0n1, sector 262865288 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0 Sep 27 02:54:04 proxmox kernel: [2908335.407142] blk_update_request: I/O error, dev nvme0n1, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
seems the disk has some problems

Error Information Log Entries: 1

i'd try to run a smart selftest and see if it unearths something

also check the output of 'dmesg' if there is anything related to your nvme
 
I also had some problems with nvme - an A-DATA XPG.
It was causing a complete freeze at random times w/o anything in the log, reboot fixed it temporarily (it also disappeared from bios a few times).

Switched to Samsung and all problems went away.