[SOLVED] NVMe storage issue

gmazoyer

New Member
Jan 1, 2017
12
0
1
35
Hello,

I recently installed Proxmox 4 on a personal device at home to play with some VMs.
The device uses a NVMe SSD, Intel 600P Series M.2 to be precise.

And it looks like the kernel is losing the device randomly. I manage to gather some logs.

The file system becomes read-only with most basic commands segfaulting. So the system must be rebooted quite brutaly with:

echo 1 > /proc/sys/kernel/sysrq
echo b > /proc/sysrq-trigger


Seems to be like an issue with the kernel.
Did someone experience the same thing?

Edit (some info that might help):
  • Fresh install with ISO installer (version 4.4-eb2d6f1e-2)
  • pveversion - pve-manager/4.4.5/c43015a5 (running kernel: 4.4.35-1-pve)
  • Ubuntu related bug
 
Last edited:
Maybe look at the ubuntu lts kernel bugfix tracker. Proxmox VE uses the ubuntu lts kernel, so maybe someone there knowns something.
 
The fix is supposed to be in the Ubuntu LTS kernel (apparently), but the one that is in the "proposed" repository.
So I guess that it depends on the sync between the Ubuntu LTS kernel et Proxmox VE.
Are the patches cherry picked or is it a full sync?
 
according to the linked bug tracker entry the issue still affects Ubuntu's current -58 kernel which our kernel is based on (or affects it again?) - so there is nothing yet to cherry-pick. the original report and fix were for older versions of the 4.4 kernel in Ubuntu, so the "original fix" is already contained in our kernel as well..
 
So the only thing to do is to wait for another fix (from Ubuntu) and a kernel update including that fix?
 
So the only thing to do is to wait for another fix (from Ubuntu) and a kernel update including that fix?

yes. all our NVME devices work without problems with the 4.4.35 kernel so this seems to be limited to certain HW. if you find commits that you'd like to test feel free to ping here and I will see what we can do about building a test kernel. if you know your way around git and have a powerful enough build server, you can also try bisecting the kernel yourself to find the offending commit(s).
 
I have made a few research and it is difficult for a kernel newbie like me to isolate where it started to go wrong.

The only thing I know is that the code from where the error is logged is inside the drivers/nvme/host/pci.c file.
It is originate from the function nvme_remove_dead_ctrl(struct nvme_dev *dev).

Comparing the code from the running kernel to the last kernel source (assuming it is working properly), we can see that the file has changed. Even the quoted function, the logged error contained the status of the probe failure which is not the case with the current kernel.

So since it is a probe failure I've put my eyes on the nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) function, which, I suppose, is responsible of probing the NVMe device.

This is the last function in the kernel Git repository. And this is the function in the current running kernel.
Again I'm a kernel newbie so all my guesses can be wrong but maybe the issue is here.

As you can see in Github, the history about changes is quite long. The only commit that had my attention in the history was this one, mostly due to the commit message I admit. But again, newbie here, I never imagined having to dig into kernel driver for Intel hardware, the driver being written by Intel... But seeing the history of changes and how old they are I suppose that I actually need a kernel greater 4.4, but again, it's a guess.
 
Hi,
my short Info is, I work since one month with one M.2 2280 NVMe 1.2 PCIe Gen3x4 SSD 128GB from ADATA, XPG SX8000 on the small Fujitsu D3417-B Skylake Mainboard with the newest BIOS 1.17 an now it work in 24/7 in produktion without a Problem , have a look at this thread:
https://forum.proxmox.com/threads/4-4-install-to-zraid10.27699/page-2#post-156142
With this hardware for now I have no problems like you...
regards, maxprox
 
Hi,
my short Info is, I work since one month with one M.2 2280 NVMe 1.2 PCIe Gen3x4 SSD 128GB from ADATA, XPG SX8000 on the small Fujitsu D3417-B Skylake Mainboard with the newest BIOS 1.17 an now it work in 24/7 in produktion without a Problem , have a look at this thread:
https://forum.proxmox.com/threads/4-4-install-to-zraid10.27699/page-2#post-156142
With this hardware for now I have no problems like you...
regards, maxprox
Thanks for the feedback.
The BIOS on my Intel motherboard is already up-to-date (actually that the first thing I did).
 
only a little bit off topic (it would not help you, sorry) : I found a good solution for the trim command if you work with ssd's, on the german ubuntu wiki
https://wiki.ubuntuusers.de/SSD/TRIM/
for the /etc/cron.weekly
Code:
#!/bin/sh
# trim all mounted file systems which support it
# /sbin/fstrim --all || true
LOG=/var/log/batched_discard.log
echo "*** $(date -R) ***" >> $LOG
/sbin/fstrim -v / >> $LOG
## /sbin/fstrim -v /home >> $LOG

most notably the log Part I found a good and easy solution to have a look what happened
 
I have made a few research and it is difficult for a kernel newbie like me to isolate where it started to go wrong.

The only thing I know is that the code from where the error is logged is inside the drivers/nvme/host/pci.c file.
It is originate from the function nvme_remove_dead_ctrl(struct nvme_dev *dev).

Comparing the code from the running kernel to the last kernel source (assuming it is working properly), we can see that the file has changed. Even the quoted function, the logged error contained the status of the probe failure which is not the case with the current kernel.

So since it is a probe failure I've put my eyes on the nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) function, which, I suppose, is responsible of probing the NVMe device.

This is the last function in the kernel Git repository. And this is the function in the current running kernel.
Again I'm a kernel newbie so all my guesses can be wrong but maybe the issue is here.

As you can see in Github, the history about changes is quite long. The only commit that had my attention in the history was this one, mostly due to the commit message I admit. But again, newbie here, I never imagined having to dig into kernel driver for Intel hardware, the driver being written by Intel... But seeing the history of changes and how old they are I suppose that I actually need a kernel greater 4.4, but again, it's a guess.

one thing you can try which should be rather straightforward is installing the vanilla 4.5 kernel (http://kernel.ubuntu.com/~kernel-pp...0-generic_4.5.0-040500.201603140130_amd64.deb) and check whether the issue occurs there as well. if it does not, we can diff our current 4.4 kernel and the vanilla 4.5 kernel to check for changes. don't forget to remove the vanilla kernel again after testing!

note that the vanilla 4.5 kernel does not contain a ZFS module, so if you use ZFS don't try the above - it can't work ;)
 
one thing you can try which should be rather straightforward is installing the vanilla 4.5 kernel (http://kernel.ubuntu.com/~kernel-pp...0-generic_4.5.0-040500.201603140130_amd64.deb) and check whether the issue occurs there as well. if it does not, we can diff our current 4.4 kernel and the vanilla 4.5 kernel to check for changes. don't forget to remove the vanilla kernel again after testing!

note that the vanilla 4.5 kernel does not contain a ZFS module, so if you use ZFS don't try the above - it can't work ;)
Not a bad idea but the machine does not boot with this kernel. It seems that it cannot mount the root LVM logical volume.
 
I'm also having the same issue with the same Intel p600 nvme drive. Did you manage to fix the problem?
 
Well if you do get the bottom of it, be sure to share it here please. I'll do the same...
 
This affects me as well. I'm gonna try with the vanilla kernel as suggested in this thread, but please let's post any findings/solutions in this thread. Would be awesome if we could find a solution.

Cheers,
Simon
 
Looks like Ubuntu have re-opened the bug.
So I guess they consider that they have not fixed it yet.
 
Hi,
my short Info is, I work since one month with one M.2 2280 NVMe 1.2 PCIe Gen3x4 SSD 128GB from ADATA, XPG SX8000 on the small Fujitsu D3417-B Skylake Mainboard with the newest BIOS 1.17 an now it work in 24/7 in produktion without a Problem , have a look at this thread:
https://forum.proxmox.com/threads/4-4-install-to-zraid10.27699/page-2#post-156142
With this hardware for now I have no problems like you...
regards, maxprox

I have to correct myself, now I see, I have the same "NVMe storage issue" like you.
Because I did not check this out immediately,I have opened a new thread,
https://forum.proxmox.com/threads/nvme-ssd-driver-or-kernel-problem.31845/
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!