[SOLVED] NVMe SSD driver or kernel problem

maxprox · Jan 11, 2017

My System and my complete setup is described in this thread:
https://forum.proxmox.com/threads/4-4-install-to-zraid10.27699/page-2#post-157742
(posts 22, 25 and 35)
here a short recapitulate:
It based on a skylake Fujitsu D3417-B Mainboard with 64GB RAM and one E3-1245-v5 XEON
Proxmox is installed on the new 128 GB NVMe SSD with ext4 and thin LVM (proxmox default setup)
Currently, only one VM, a Windows 2008 R2 server is running on this system, with 16 GB virtual RAM

For me it looks like there is an NVMe driver or kernel problem, as described in this bug report
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1626894
and here:
http://lists.infradead.org/pipermail/linux-nvme/2016-February/004096.html

I get the same nvme "Removing after probe failure" message,
here is my dmesg:

Code:

$ dmesg
...
[ 5266.767090] vmbr0: port 2(tap208i0) entered forwarding state
[ 5266.767094] vmbr0: port 2(tap208i0) entered forwarding state
[ 5267.794127] kvm: zapping shadow pages for mmio generation wraparound
[ 5270.363892] kvm: zapping shadow pages for mmio generation wraparound
=> [45512.825928] nvme 0000:01:00.0: Failed status: 0xffffffff, reset controller.
=> [45513.276990] nvme 0000:01:00.0: Removing after probe failure
=> [45513.276997] nvme0n1: detected capacity change from 128035676160 to 0
[45513.507206] Aborting journal on device dm-0-8.
[45513.507226] Buffer I/O error on dev dm-0, logical block 3702784, lost sync page write
[45513.507248] JBD2: Error -5 detected when updating journal superblock for dm-0-8.
[45513.507555] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[45513.507585] EXT4-fs error (device dm-0): ext4_journal_check_start:56: Detected aborted journal
[45513.507619] EXT4-fs (dm-0): Remounting filesystem read-only
[45513.507643] EXT4-fs (dm-0): previous I/O error to superblock detected
[45513.507656] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[45519.236744] device-mapper: thin: 251:4: metadata operation 'dm_pool_commit_metadata' failed: error = -5
[45519.236766] device-mapper: thin: 251:4: aborting current metadata transaction
[45519.236949] device-mapper: thin: 251:4: failed to abort metadata transaction
[45519.236977] device-mapper: thin: 251:4: switching pool to failure mode
[45519.236978] device-mapper: thin metadata: couldn't read superblock
[45519.236989] device-mapper: thin: 251:4: failed to set 'needs_check' flag in metadata
[45519.237004] device-mapper: thin: 251:4: dm_pool_get_metadata_transaction_id returned -22
[46805.070494] rrdcached[2458]: segfault at c0 ip 00007fb12ab3b1ed sp 00007fb126e376b0 error 4 in libc-2.19.so[7fb12aaf5000+1a1000]
[71326.376928] perf interrupt took too long (2567 > 2500), lowering kernel.perf_event_max_sample_rate to 50000

this happend every 24 to 48h sometimes shutting up with a kernel panic
The screen shot of the kernel panic is also shown in this post:
https://forum.proxmox.com/threads/4-4-install-to-zraid10.27699/page-2#post-157742

Code:

t# pveversion -v
proxmox-ve: 4.4-78 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-5 (running version: 4.4-5/c43015a5)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-78
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-85
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-71
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.1-1
pve-container: 1.0-90
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-5
lxcfs: 2.0.5-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80

I do not know exactly what for me the best solution is, probably to exchange the NVMe drive against a SATA drive, or is there an other possibility to solve the problem?

regards,
maxprox

fabian · Jan 11, 2017

maxprox said:
My System and my complete setup is described in this thread:
https://forum.proxmox.com/threads/4-4-install-to-zraid10.27699/page-2#post-157742
(posts 22, 25 and 35)
here a short recapitulate:
It based on a skylake Fujitsu D3417-B Mainboard with 64GB RAM and one E3-1245-v5 XEON
Proxmox is installed on the new 128 GB NVMe SSD with ext4 and thin LVM (proxmox default setup)
Currently, only one VM, a Windows 2008 R2 server is running on this system, with 16 GB virtual RAM

For me it looks like there is an NVMe driver or kernel problem, as described in this bug report
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1626894
and here:
http://lists.infradead.org/pipermail/linux-nvme/2016-February/004096.html

I get the same nvme "Removing after probe failure" message,
here is my dmesg:

Code:

$ dmesg ... [ 5266.767090] vmbr0: port 2(tap208i0) entered forwarding state [ 5266.767094] vmbr0: port 2(tap208i0) entered forwarding state [ 5267.794127] kvm: zapping shadow pages for mmio generation wraparound [ 5270.363892] kvm: zapping shadow pages for mmio generation wraparound => [45512.825928] nvme 0000:01:00.0: Failed status: 0xffffffff, reset controller. => [45513.276990] nvme 0000:01:00.0: Removing after probe failure => [45513.276997] nvme0n1: detected capacity change from 128035676160 to 0 [45513.507206] Aborting journal on device dm-0-8. [45513.507226] Buffer I/O error on dev dm-0, logical block 3702784, lost sync page write [45513.507248] JBD2: Error -5 detected when updating journal superblock for dm-0-8. [45513.507555] Buffer I/O error on dev dm-0, logical block 0, lost sync page write [45513.507585] EXT4-fs error (device dm-0): ext4_journal_check_start:56: Detected aborted journal [45513.507619] EXT4-fs (dm-0): Remounting filesystem read-only [45513.507643] EXT4-fs (dm-0): previous I/O error to superblock detected [45513.507656] Buffer I/O error on dev dm-0, logical block 0, lost sync page write [45519.236744] device-mapper: thin: 251:4: metadata operation 'dm_pool_commit_metadata' failed: error = -5 [45519.236766] device-mapper: thin: 251:4: aborting current metadata transaction [45519.236949] device-mapper: thin: 251:4: failed to abort metadata transaction [45519.236977] device-mapper: thin: 251:4: switching pool to failure mode [45519.236978] device-mapper: thin metadata: couldn't read superblock [45519.236989] device-mapper: thin: 251:4: failed to set 'needs_check' flag in metadata [45519.237004] device-mapper: thin: 251:4: dm_pool_get_metadata_transaction_id returned -22 [46805.070494] rrdcached[2458]: segfault at c0 ip 00007fb12ab3b1ed sp 00007fb126e376b0 error 4 in libc-2.19.so[7fb12aaf5000+1a1000] [71326.376928] perf interrupt took too long (2567 > 2500), lowering kernel.perf_event_max_sample_rate to 50000

this happend every 24 to 48h sometimes shutting up with a kernel panic
The screen shot of the kernel panic is also shown in this post:
https://forum.proxmox.com/threads/4-4-install-to-zraid10.27699/page-2#post-157742

Code:

t# pveversion -v proxmox-ve: 4.4-78 (running kernel: 4.4.35-2-pve) pve-manager: 4.4-5 (running version: 4.4-5/c43015a5) pve-kernel-4.4.35-1-pve: 4.4.35-77 pve-kernel-4.4.35-2-pve: 4.4.35-78 pve-kernel-4.4.19-1-pve: 4.4.19-66 lvm2: 2.02.116-pve3 corosync-pve: 2.4.0-1 libqb0: 1.0-1 pve-cluster: 4.0-48 qemu-server: 4.0-102 pve-firmware: 1.1-10 libpve-common-perl: 4.0-85 libpve-access-control: 4.0-19 libpve-storage-perl: 4.0-71 pve-libspice-server1: 0.12.8-1 vncterm: 1.2-1 pve-docs: 4.4-1 pve-qemu-kvm: 2.7.1-1 pve-container: 1.0-90 pve-firewall: 2.0-33 pve-ha-manager: 1.0-38 ksm-control-daemon: 1.2-1 glusterfs-client: 3.5.2-2+deb8u2 lxc-pve: 2.0.6-5 lxcfs: 2.0.5-pve2 criu: 1.6.0-1 novnc-pve: 0.5-8 smartmontools: 6.5+svn4324-1~pve80 zfsutils: 0.6.5.8-pve13~bpo80

I do not know exactly what for me the best solution is, probably to exchange the NVMe drive against a SATA drive, or is there an other possibility to solve the problem?

regards,
maxprox

there has been another Ubuntu kernel release with an NVME bug fix (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1651602) which will be included in the PVE kernel soon - I'd recommend either building a test one yourself or waiting a few days until it hits pvetest. from the description it sounds like it could fix your issue.

maxprox · Jan 11, 2017

okay thats the "Hammer!"
for short I migrate to the old proxmox 3x server and it is no problem to wait some days
(The system was already running productively)
Thank You,
EDIT:
the Problem ist SOLVED here:
https://forum.proxmox.com/threads/nvme-storage-issue.31572/
maxprox

Search

Search

[SOLVED] NVMe SSD driver or kernel problem

maxprox

Renowned Member

fabian

Proxmox Staff Member

maxprox

Renowned Member