[SOLVED] NVMe SSD driver or kernel problem

maxprox

Well-Known Member
Aug 23, 2011
404
34
48
Germany - Nordhessen
www.fair-comp.de
My System and my complete setup is described in this thread:
https://forum.proxmox.com/threads/4-4-install-to-zraid10.27699/page-2#post-157742
(posts 22, 25 and 35)
here a short recapitulate:
It based on a skylake Fujitsu D3417-B Mainboard with 64GB RAM and one E3-1245-v5 XEON
Proxmox is installed on the new 128 GB NVMe SSD with ext4 and thin LVM (proxmox default setup)
Currently, only one VM, a Windows 2008 R2 server is running on this system, with 16 GB virtual RAM


For me it looks like there is an NVMe driver or kernel problem, as described in this bug report
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1626894
and here:
http://lists.infradead.org/pipermail/linux-nvme/2016-February/004096.html

I get the same nvme "Removing after probe failure" message,
here is my dmesg:
Code:
$ dmesg
...
[ 5266.767090] vmbr0: port 2(tap208i0) entered forwarding state
[ 5266.767094] vmbr0: port 2(tap208i0) entered forwarding state
[ 5267.794127] kvm: zapping shadow pages for mmio generation wraparound
[ 5270.363892] kvm: zapping shadow pages for mmio generation wraparound
=> [45512.825928] nvme 0000:01:00.0: Failed status: 0xffffffff, reset controller.
=> [45513.276990] nvme 0000:01:00.0: Removing after probe failure
=> [45513.276997] nvme0n1: detected capacity change from 128035676160 to 0
[45513.507206] Aborting journal on device dm-0-8.
[45513.507226] Buffer I/O error on dev dm-0, logical block 3702784, lost sync page write
[45513.507248] JBD2: Error -5 detected when updating journal superblock for dm-0-8.
[45513.507555] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[45513.507585] EXT4-fs error (device dm-0): ext4_journal_check_start:56: Detected aborted journal
[45513.507619] EXT4-fs (dm-0): Remounting filesystem read-only
[45513.507643] EXT4-fs (dm-0): previous I/O error to superblock detected
[45513.507656] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[45519.236744] device-mapper: thin: 251:4: metadata operation 'dm_pool_commit_metadata' failed: error = -5
[45519.236766] device-mapper: thin: 251:4: aborting current metadata transaction
[45519.236949] device-mapper: thin: 251:4: failed to abort metadata transaction
[45519.236977] device-mapper: thin: 251:4: switching pool to failure mode
[45519.236978] device-mapper: thin metadata: couldn't read superblock
[45519.236989] device-mapper: thin: 251:4: failed to set 'needs_check' flag in metadata
[45519.237004] device-mapper: thin: 251:4: dm_pool_get_metadata_transaction_id returned -22
[46805.070494] rrdcached[2458]: segfault at c0 ip 00007fb12ab3b1ed sp 00007fb126e376b0 error 4 in libc-2.19.so[7fb12aaf5000+1a1000]
[71326.376928] perf interrupt took too long (2567 > 2500), lowering kernel.perf_event_max_sample_rate to 50000

this happend every 24 to 48h sometimes shutting up with a kernel panic
The screen shot of the kernel panic is also shown in this post:
https://forum.proxmox.com/threads/4-4-install-to-zraid10.27699/page-2#post-157742

Code:
t# pveversion -v
proxmox-ve: 4.4-78 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-5 (running version: 4.4-5/c43015a5)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-78
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-85
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-71
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.1-1
pve-container: 1.0-90
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-5
lxcfs: 2.0.5-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80

I do not know exactly what for me the best solution is, probably to exchange the NVMe drive against a SATA drive, or is there an other possibility to solve the problem?

regards,
maxprox
 
Last edited:

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
7,619
1,432
164
My System and my complete setup is described in this thread:
https://forum.proxmox.com/threads/4-4-install-to-zraid10.27699/page-2#post-157742
(posts 22, 25 and 35)
here a short recapitulate:
It based on a skylake Fujitsu D3417-B Mainboard with 64GB RAM and one E3-1245-v5 XEON
Proxmox is installed on the new 128 GB NVMe SSD with ext4 and thin LVM (proxmox default setup)
Currently, only one VM, a Windows 2008 R2 server is running on this system, with 16 GB virtual RAM


For me it looks like there is an NVMe driver or kernel problem, as described in this bug report
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1626894
and here:
http://lists.infradead.org/pipermail/linux-nvme/2016-February/004096.html

I get the same nvme "Removing after probe failure" message,
here is my dmesg:
Code:
$ dmesg
...
[ 5266.767090] vmbr0: port 2(tap208i0) entered forwarding state
[ 5266.767094] vmbr0: port 2(tap208i0) entered forwarding state
[ 5267.794127] kvm: zapping shadow pages for mmio generation wraparound
[ 5270.363892] kvm: zapping shadow pages for mmio generation wraparound
=> [45512.825928] nvme 0000:01:00.0: Failed status: 0xffffffff, reset controller.
=> [45513.276990] nvme 0000:01:00.0: Removing after probe failure
=> [45513.276997] nvme0n1: detected capacity change from 128035676160 to 0
[45513.507206] Aborting journal on device dm-0-8.
[45513.507226] Buffer I/O error on dev dm-0, logical block 3702784, lost sync page write
[45513.507248] JBD2: Error -5 detected when updating journal superblock for dm-0-8.
[45513.507555] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[45513.507585] EXT4-fs error (device dm-0): ext4_journal_check_start:56: Detected aborted journal
[45513.507619] EXT4-fs (dm-0): Remounting filesystem read-only
[45513.507643] EXT4-fs (dm-0): previous I/O error to superblock detected
[45513.507656] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[45519.236744] device-mapper: thin: 251:4: metadata operation 'dm_pool_commit_metadata' failed: error = -5
[45519.236766] device-mapper: thin: 251:4: aborting current metadata transaction
[45519.236949] device-mapper: thin: 251:4: failed to abort metadata transaction
[45519.236977] device-mapper: thin: 251:4: switching pool to failure mode
[45519.236978] device-mapper: thin metadata: couldn't read superblock
[45519.236989] device-mapper: thin: 251:4: failed to set 'needs_check' flag in metadata
[45519.237004] device-mapper: thin: 251:4: dm_pool_get_metadata_transaction_id returned -22
[46805.070494] rrdcached[2458]: segfault at c0 ip 00007fb12ab3b1ed sp 00007fb126e376b0 error 4 in libc-2.19.so[7fb12aaf5000+1a1000]
[71326.376928] perf interrupt took too long (2567 > 2500), lowering kernel.perf_event_max_sample_rate to 50000

this happend every 24 to 48h sometimes shutting up with a kernel panic
The screen shot of the kernel panic is also shown in this post:
https://forum.proxmox.com/threads/4-4-install-to-zraid10.27699/page-2#post-157742

Code:
t# pveversion -v
proxmox-ve: 4.4-78 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-5 (running version: 4.4-5/c43015a5)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-78
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-85
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-71
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.1-1
pve-container: 1.0-90
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-5
lxcfs: 2.0.5-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80

I do not know exactly what for me the best solution is, probably to exchange the NVMe drive against a SATA drive, or is there an other possibility to solve the problem?

regards,
maxprox

there has been another Ubuntu kernel release with an NVME bug fix (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1651602) which will be included in the PVE kernel soon - I'd recommend either building a test one yourself or waiting a few days until it hits pvetest. from the description it sounds like it could fix your issue.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!