Lexar NM790 2TB

senkis

New Member
Jul 22, 2023
22
5
3
Hi,

Proxmox keeps crashing.
Single drive installation, running the Lexar NM790 2TB.
Partition layout consists of default LVM section followed by a ext-4 partition created by me.
The computer is HP Prodesk 600 g4 mini.
Reboot "fixes" the issue for another few hours.
Partition layout of the drive and crash logs are listed below.

Partition layout
Disk /dev/nvme0n1: 1.86 TiB, 2048408248320 bytes, 4000797360 sectors
Disk model: Lexar SSD NM790 2TB
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Device Start End Sectors Size Type
/dev/nvme0n1p1 34 262177 262144 128M Microsoft reserved
/dev/nvme0n1p2 264192 266239 2048 1M BIOS boot
/dev/nvme0n1p3 266240 1314815 1048576 512M EFI System
/dev/nvme0n1p4 1314816 500383743 499068928 238G Linux LVM
/dev/nvme0n1p6 976775168 4000796671 3024021504 1.4T Linux filesystem

Logs

[Sun Oct 1 09:55:18 2023] nvme nvme0: I/O 539 (I/O Cmd) QID 3 timeout, aborting
[Sun Oct 1 09:55:18 2023] nvme nvme0: I/O 129 (I/O Cmd) QID 6 timeout, aborting
[Sun Oct 1 09:55:18 2023] nvme nvme0: I/O 655 (I/O Cmd) QID 5 timeout, aborting
[Sun Oct 1 09:55:48 2023] nvme nvme0: I/O 539 QID 3 timeout, reset controller
[Sun Oct 1 09:56:18 2023] nvme nvme0: I/O 16 QID 0 timeout, reset controller
[Sun Oct 1 09:56:49 2023] nvme nvme0: Abort status: 0x371
[Sun Oct 1 09:56:49 2023] nvme nvme0: Abort status: 0x371
[Sun Oct 1 09:56:49 2023] nvme nvme0: Abort status: 0x371
[Sun Oct 1 09:57:34 2023] INFO: task jbd2/dm-1-8:309 blocked for more than 120 seconds.
[Sun Oct 1 09:57:34 2023] Tainted: P O 6.2.16-14-pve #1
[Sun Oct 1 09:57:34 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Oct 1 09:57:34 2023] task:jbd2/dm-1-8 state:D stack:0 pid:309 ppid:2 flags:0x00004000
[Sun Oct 1 09:57:34 2023] Call Trace:
[Sun Oct 1 09:57:34 2023] <TASK>
[Sun Oct 1 09:57:34 2023] __schedule+0x402/0x1510
[Sun Oct 1 09:57:34 2023] schedule+0x63/0x110
[Sun Oct 1 09:57:34 2023] io_schedule+0x46/0x80
[Sun Oct 1 09:57:34 2023] bit_wait_io+0x11/0x90
[Sun Oct 1 09:57:34 2023] __wait_on_bit+0x4a/0x120
[Sun Oct 1 09:57:34 2023] ? __pfx_bit_wait_io+0x10/0x10
[Sun Oct 1 09:57:34 2023] out_of_line_wait_on_bit+0x8c/0xb0
[Sun Oct 1 09:57:34 2023] ? __pfx_wake_bit_function+0x10/0x10
[Sun Oct 1 09:57:34 2023] __wait_on_buffer+0x30/0x50
[Sun Oct 1 09:57:34 2023] jbd2_journal_commit_transaction+0x1156/0x1a30
[Sun Oct 1 09:57:34 2023] ? lock_timer_base+0x3b/0xe0
[Sun Oct 1 09:57:34 2023] kjournald2+0xab/0x280
[Sun Oct 1 09:57:34 2023] ? __pfx_autoremove_wake_function+0x10/0x10
[Sun Oct 1 09:57:34 2023] ? __pfx_kjournald2+0x10/0x10
[Sun Oct 1 09:57:34 2023] kthread+0xe6/0x110
[Sun Oct 1 09:57:34 2023] ? __pfx_kthread+0x10/0x10
[Sun Oct 1 09:57:34 2023] ret_from_fork+0x29/0x50
[Sun Oct 1 09:57:34 2023] </TASK>
[Sun Oct 1 09:57:34 2023] INFO: task systemd-journal:360 blocked for more than 120 seconds.
[Sun Oct 1 09:57:34 2023] Tainted: P O 6.2.16-14-pve #1
[Sun Oct 1 09:57:34 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Oct 1 09:57:34 2023] task:systemd-journal state:D stack:0 pid:360 ppid:1 flags:0x00000006
[Sun Oct 1 09:57:34 2023] Call Trace:
[Sun Oct 1 09:57:34 2023] <TASK>
[Sun Oct 1 09:57:34 2023] __schedule+0x402/0x1510
[Sun Oct 1 09:57:34 2023] schedule+0x63/0x110
[Sun Oct 1 09:57:34 2023] io_schedule+0x46/0x80
[Sun Oct 1 09:57:34 2023] folio_wait_bit_common+0x136/0x330
[Sun Oct 1 09:57:34 2023] ? __pfx_wake_page_function+0x10/0x10
[Sun Oct 1 09:57:34 2023] folio_wait_bit+0x18/0x30
[Sun Oct 1 09:57:34 2023] folio_wait_writeback+0x2c/0xa0
[Sun Oct 1 09:57:34 2023] wait_on_page_writeback+0x18/0x60
[Sun Oct 1 09:57:34 2023] __filemap_fdatawait_range+0x98/0x150
[Sun Oct 1 09:57:34 2023] file_write_and_wait_range+0x96/0xc0
[Sun Oct 1 09:57:34 2023] ext4_sync_file+0x11f/0x3a0
[Sun Oct 1 09:57:34 2023] vfs_fsync_range+0x45/0xa0
[Sun Oct 1 09:57:34 2023] __x64_sys_fsync+0x3c/0x70
[Sun Oct 1 09:57:34 2023] do_syscall_64+0x58/0x90
[Sun Oct 1 09:57:34 2023] ? handle_mm_fault+0x119/0x330
[Sun Oct 1 09:57:34 2023] ? lock_mm_and_find_vma+0x43/0x230
[Sun Oct 1 09:57:34 2023] ? exit_to_user_mode_prepare+0x39/0x190
[Sun Oct 1 09:57:34 2023] ? irqentry_exit_to_user_mode+0x17/0x20
[Sun Oct 1 09:57:34 2023] ? irqentry_exit+0x43/0x50
[Sun Oct 1 09:57:34 2023] ? exc_page_fault+0x91/0x1b0
[Sun Oct 1 09:57:34 2023] entry_SYSCALL_64_after_hwframe+0x73/0xdd
[Sun Oct 1 09:57:34 2023] RIP: 0033:0x7f9a906efa1a
[Sun Oct 1 09:57:34 2023] RSP: 002b:00007ffe1d00cca0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[Sun Oct 1 09:57:34 2023] RAX: ffffffffffffffda RBX: 000055be7a96bfe0 RCX: 00007f9a906efa1a
[Sun Oct 1 09:57:34 2023] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000019
[Sun Oct 1 09:57:34 2023] RBP: 000000000000006f R08: 0000000000000001 R09: 00007ffe1d00cfb8
[Sun Oct 1 09:57:34 2023] R10: 5992bb0479b1bab8 R11: 0000000000000293 R12: 0000000000000001
[Sun Oct 1 09:57:34 2023] R13: 00007ffe1d00cde8 R14: 00007ffe1d00cde0 R15: 000055be7a96bfe0
[Sun Oct 1 09:57:34 2023] </TASK>
...
[Sun Oct 1 09:58:15 2023] systemd[1]: systemd-journald.service: State 'stop-watchdog' timed out. Killing.
[Sun Oct 1 09:58:15 2023] systemd[1]: systemd-journald.service: Killing process 360 (systemd-journal) with signal SIGKILL.
[Sun Oct 1 09:58:57 2023] nvme nvme0: Device not ready; aborting initialisation, CSTS=0x0
[Sun Oct 1 09:58:57 2023] nvme nvme0: Disabling device after reset failure: -19
[Sun Oct 1 09:58:57 2023] I/O error, dev nvme0n1, sector 80065056 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[Sun Oct 1 09:58:57 2023] I/O error, dev nvme0n1, sector 168626984 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[Sun Oct 1 09:58:57 2023] I/O error, dev nvme0n1, sector 80521544 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 2
[Sun Oct 1 09:58:57 2023] I/O error, dev nvme0n1, sector 35897576 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[Sun Oct 1 09:58:57 2023] I/O error, dev nvme0n1, sector 176763960 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 2
[Sun Oct 1 09:58:57 2023] I/O error, dev nvme0n1, sector 168643152 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[Sun Oct 1 09:58:57 2023] EXT4-fs warning (device dm-1): ext4_end_bio:343: I/O error 10 writing to inode 1967645 starting block 7909929)
[Sun Oct 1 09:58:57 2023] Buffer I/O error on dev dm-12, logical block 9255, lost sync page write
[Sun Oct 1 09:58:57 2023] EXT4-fs warning (device dm-8): ext4_end_bio:343: I/O error 10 writing to inode 405728 starting block 1968202)
[Sun Oct 1 09:58:57 2023] Buffer I/O error on device dm-1, logical block 7909929
[Sun Oct 1 09:58:57 2023] I/O error, dev nvme0n1, sector 168716368 op 0x1:(WRITE) flags 0x800 phys_seg 20 prio class 2
[Sun Oct 1 09:58:57 2023] Buffer I/O error on device dm-8, logical block 1968202
[Sun Oct 1 09:58:57 2023] EXT4-fs error (device dm-12): kmmpd:185: comm kmmpd-dm-12: Error writing to MMP block
[Sun Oct 1 09:58:57 2023] I/O error, dev nvme0n1, sector 168636240 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[Sun Oct 1 09:58:57 2023] EXT4-fs warning (device dm-8): ext4_end_bio:343: I/O error 10 writing to inode 405722 starting block 1967338)
[Sun Oct 1 09:58:57 2023] Buffer I/O error on device dm-8, logical block 1967338
[Sun Oct 1 09:58:57 2023] I/O error, dev nvme0n1, sector 168690232 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 2
[Sun Oct 1 09:58:57 2023] Buffer I/O error on dev dm-8, logical block 9255, lost sync page write
[Sun Oct 1 09:58:57 2023] EXT4-fs error (device dm-8): kmmpd:185: comm kmmpd-dm-8: Error writing to MMP block
[Sun Oct 1 09:58:57 2023] I/O error, dev nvme0n1, sector 89382160 op 0x1:(WRITE) flags 0x800 phys_seg 4 prio class 2
[Sun Oct 1 09:58:57 2023] EXT4-fs error (device dm-8): ext4_check_bdev_write_error:223: comm kworker/u12:0: Error while async write back metadata
[Sun Oct 1 09:58:57 2023] Aborting journal on device dm-8-8.
[Sun Oct 1 09:58:57 2023] EXT4-fs error (device dm-8) in ext4_reserve_inode_write:5940: Journal has aborted
[Sun Oct 1 09:58:57 2023] EXT4-fs error (device dm-8): mpage_map_and_submit_extent:2535: inode #405667: comm kworker/u12:0: mark_inode_dirty error
[Sun Oct 1 09:58:57 2023] EXT4-fs error (device dm-8): mpage_map_and_submit_extent:2537: comm kworker/u12:0: Failed to mark inode 405667 dirty
[Sun Oct 1 09:58:57 2023] EXT4-fs error (device dm-8) in ext4_do_writepages:2890: Journal has aborted
[Sun Oct 1 09:58:57 2023] Aborting journal on device dm-1-8.
[Sun Oct 1 09:58:57 2023] EXT4-fs error (device dm-1): ext4_journal_check_start:83: comm kworker/u12:0: Detected aborted journal
[Sun Oct 1 09:58:57 2023] Aborting journal on device nvme0n1p6-8.
[Sun Oct 1 09:58:57 2023] EXT4-fs error (device nvme0n1p6) in ext4_reserve_inode_write:5940: Journal has aborted
[Sun Oct 1 09:58:57 2023] EXT4-fs error (device nvme0n1p6): ext4_dirty_inode:6144: inode #45350916: comm kvm: mark_inode_dirty error
[Sun Oct 1 09:58:57 2023] EXT4-fs error (device nvme0n1p6) in ext4_dirty_inode:6145: Journal has aborted
[Sun Oct 1 09:58:57 2023] EXT4-fs error (device dm-1): ext4_journal_check_start:83: comm pmxcfs: Detected aborted journal
[Sun Oct 1 09:58:57 2023] EXT4-fs warning (device dm-1): ext4_end_bio:343: I/O error 10 writing to inode 1967742 starting block 7852868)
[Sun Oct 1 09:58:57 2023] EXT4-fs warning (device dm-8): ext4_end_bio:343: I/O error 10 writing to inode 405895 starting block 1966181)
[Sun Oct 1 09:58:57 2023] Buffer I/O error on device dm-1, logical block 7852868
[Sun Oct 1 09:58:57 2023] nvme0n1: detected capacity change from 4000797360 to 0
[Sun Oct 1 09:58:57 2023] Buffer I/O error on dev nvme0n1p6, logical block 188776448, lost sync page write
[Sun Oct 1 09:58:57 2023] JBD2: I/O error when updating journal superblock for nvme0n1p6-8.
[Sun Oct 1 09:58:57 2023] EXT4-fs error (device nvme0n1p6): ext4_journal_check_start:83: comm kvm: Detected aborted journal
[Sun Oct 1 09:58:57 2023] Buffer I/O error on dev dm-1, logical block 8945664, lost sync page write
[Sun Oct 1 09:58:57 2023] JBD2: I/O error when updating journal superblock for dm-1-8.
[Sun Oct 1 09:58:57 2023] EXT4-fs warning (device dm-1): ext4_end_bio:343: I/O error 10 writing to inode 1967827 starting block 7856480)
[Sun Oct 1 09:58:57 2023] EXT4-fs error (device dm-1): ext4_journal_check_start:83: comm kworker/u12:5: Detected aborted journal
[Sun Oct 1 09:58:57 2023] EXT4-fs warning (device dm-1): ext4_end_bio:343: I/O error 10 writing to inode 1967488 starting block 7178208)
[Sun Oct 1 09:58:57 2023] Buffer I/O error on dev dm-1, logical block 0, lost sync page write
[Sun Oct 1 09:58:57 2023] EXT4-fs error (device dm-1): ext4_journal_check_start:83: comm kworker/u12:4: Detected aborted journal
[Sun Oct 1 09:58:57 2023] EXT4-fs (dm-1): I/O error while writing superblock
[Sun Oct 1 09:58:57 2023] EXT4-fs (dm-1): Remounting filesystem read-only
[Sun Oct 1 09:58:57 2023] Buffer I/O error on device dm-1, logical block 7856480
[Sun Oct 1 09:58:57 2023] Buffer I/O error on dev dm-1, logical block 0, lost sync page write
[Sun Oct 1 09:58:57 2023] EXT4-fs (dm-1): I/O error while writing superblock
[Sun Oct 1 09:58:57 2023] Buffer I/O error on device dm-1, logical block 7856481
[Sun Oct 1 09:58:57 2023] Buffer I/O error on device dm-1, logical block 7856482
 
What is smartctl or nvme-cli reporting about the disk health? Maybe the SSDs NAND or controller is just failing?
 
What is smartctl or nvme-cli reporting about the disk health? Maybe the SSDs NAND or controller is just failing?
Will attach output from the above once it rehappens, probably less then a day.
nvme smart-log , while it is not crashing:
root@pve:~# nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning : 0
temperature : 35°C (308 Kelvin)
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 0%
endurance group critical warning summary: 0
Data Units Read : 181,798 (93.08 GB)
Data Units Written : 1,253,100 (641.59 GB)
host_read_commands : 964,485
host_write_commands : 2,779,006
controller_busy_time : 3
power_cycles : 18
power_on_hours : 28
unsafe_shutdowns : 9
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 35°C (308 Kelvin)
Temperature Sensor 2 : 25°C (298 Kelvin)
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
 
Last edited:
@Dunuin

root@pve:~# nvme error-log /dev/nvme0n1
identify controller: Input/output error

root@pve:~# nvme smart-log /dev/nvme0n1
smart log: Input/output error

root@pve:/mnt/ssd# nvme list
Node Generic SN Model Namespace Usage Format FW Rev
--------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
 
Last edited:
  • Like
Reactions: senkis
I Also bought a Lexar NM790 4 TB, SSD PCIe 4.0 x4, NVMe 1.4, M.2 2280,
because 8GB/s for 200 $ seems quite promising.

i also have the error, default_ps_max_latency_us=0 does not help.

Code:
[    0.125800] Kernel command line: initrd=\EFI\proxmox\6.2.16-3-pve\initrd.img-6.2.16-3-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs nvme_core.default_ps_max_latency_us=0
[    3.508977] nvme nvme0: pci function 0000:01:00.0
[    3.512409] nvme nvme0: Device not ready; aborting initialisation, CSTS=0x0

so i managed to use the kernel patch ( https://forum.proxmox.com/threads/building-the-pve-kernel-on-proxmox-ve-6-x.76137/post-598481 )

Code:
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c61173be4..758a4ca60 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2408,6 +2408,7 @@ int nvme_enable_ctrl(struct nvme_ctrl *ctrl)
     } else {
         timeout = NVME_CAP_TIMEOUT(ctrl->cap);
     }
+    dev_info(ctrl->device, "[PATCH] nvme core got timeout %u\n",timeout);
 
     ctrl->ctrl_config |= (NVME_CTRL_PAGE_SHIFT - 12) << NVME_CC_MPS_SHIFT;
     ctrl->ctrl_config |= NVME_CC_AMS_RR | NVME_CC_SHN_NONE;
@@ -2425,8 +2426,9 @@ int nvme_enable_ctrl(struct nvme_ctrl *ctrl)
     ret = ctrl->ops->reg_write32(ctrl, NVME_REG_CC, ctrl->ctrl_config);
     if (ret)
         return ret;
+    dev_info(ctrl->device, "[PATCH] nvme_wait_ready now wait for %u, previously %u\n",(timeout + 1) * 2, (timeout + 1)/2);
     return nvme_wait_ready(ctrl, NVME_CSTS_RDY, NVME_CSTS_RDY,
-                   (timeout + 1) / 2, "initialisation");
+                   (timeout + 1) * 2, "initialisation");
 }
 EXPORT_SYMBOL_GPL(nvme_enable_ctrl);

and now it works
Code:
[    1.637107] nvme nvme0: pci function 0000:01:00.0
[    1.638416] nvme nvme0: [PATCH] nvme core got timeout 0
[    1.638421] nvme nvme0: [PATCH] nvme_wait_ready now wait for 2, previously 0
[    1.651530] nvme nvme0: allocated 40 MiB host memory buffer.
[    1.690536] nvme nvme0: 16/0/0 default/read/poll queues

here are my pci 4.0 benchmarks on zfs

Code:
lspci -vv
LnkSta: Speed 16GT/s, Width x4

zfs set primarycache=metadata zfs

fio --name=test  --ioengine=libaio --fallocate=none --refill_buffers --direct=1 -size=50G  -bs=1M  -iodepth=16 --rw=write
write: IOPS=3107, BW=3108MiB/s
fio --name=test  --ioengine=libaio --fallocate=none --refill_buffers --direct=1 -bs=1M  -iodepth=16 --rw=read --runtime=10
read: IOPS=3213, BW=3213MiB/s

fio --name=test  --ioengine=libaio --fallocate=none --refill_buffers --direct=1 -size=1G  -bs=4k  -iodepth=16 --rw=randwrite
write: IOPS=8196, BW=32.0MiB/s
fio --name=test  --ioengine=libaio --fallocate=none --refill_buffers --direct=1 -bs=4k  -iodepth=16 --rw=randread  --runtime=10
read: IOPS=5468, BW=21.4MiB/s
 
Last edited:
  • Like
Reactions: jofland
Same here: Lexar NM790 4 TB

The patch didn't apply with the patch file, so I changed it directly in the code file.

Kernel 2.6.2-19 seems to work now.
 
My Lexar NM790 4TB SSD is now working with kernel 6.2.16-19-pve. No modifications needed. The "Device not ready; aborting initialization, CSTS=0x0" error is gone.
 
Unfortunately, as pointed out by @Instantus in the other thread, the mentioned fix is not included in the current build of 6.5 yet. The fix is in mainline 6.5.5, but the current build is based on 6.5.3. It seems that there were multiple issues and some are already addressed, but not yet that one.
Thank you for this info.
Is it planned to consider this patch _LINK_ in the next Build?
I've the same problem with a new Lexar SSD.
 
Unfortunately, as pointed out by @Instantus in the other thread, the mentioned fix is not included in the current build of 6.5 yet. The fix is in mainline 6.5.5, but the current build is based on 6.5.3. It seems that there were multiple issues and some are already addressed, but not yet that one.
This seems to be the reason why my Proxmox host crashes about every 1-2 days, both with kernel versions 6.2.16-19 and 6.5.3.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!