proxmox 6 regression: nvme-rdma: kernel NULL pointer dereference

lukas12342

Member
May 25, 2018
7
0
6
With proxmox 6 linux 5.0 the same command which used to work under proxmox 5.4 is broken. Trying to connect to an NVMe-oF RDMA target provided by spdk fails.

lukas@pve-master:~$ sudo modprobe nvme-rdma
lukas@pve-master:~$ sudo nvme discover -t rdma -a 10.0.0.20 -s 4420
Discovery Log Number of Records 1, Generation counter 3
=====Discovery Log Entry 0======
trtype: rdma
adrfam: ipv4
subtype: nvme subsystem
treq: not specified
portid: 0
trsvcid: 4420
subnqn: nqn.2016-06.io.spdk:cnode1
traddr: 10.0.0.20
rdma_prtype: not specified
rdma_qptype: connected
rdma_cms: rdma-cm
rdma_pkey: 0x0000
lukas@pve-master:~$ sudo nvme connect -t rdma -n "nqn.2016-06.io.spdk:cnode1" -a 10.0.0.20 -s 4420
Killed

Dmesg:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000190

probable cause:
If "nr-io-queues" is not set by nvme cli the kernel uses the logical cores count instead (core count also defines the upper limit of nr-io-queues inside the kernel). This becomes a problem when the nvmeof target provides fewer Io-queues then logical cores are available on the initiator (pve) system.
In my case an Intel i7 6700 system was the target which has only 8 logical cores and the amount of queues is usually for best performance kept blow or equal this number (parameter p in nvmf_create_transport with spdk).

With an old 4.15.18 kernel this is not a problem as the "nr-io-queues" as chosen by the kernel would match the number of queues-1 assigned to spdk (target) with the p parameter. (nr-io-queues=p-1)

On proxmox 6 this is not the case instead. Somehow a larger number than provided by sdpk must internally be choose.
Even though the logging prints the right number (spdk has p=2) kernel would need to choose nr-io-queues=1 this happens:

sudo nvme connect -t rdma -n "nqn.2016-06.io.spdk:cnode1" -a 10.0.0.20 -s 4420
[ 4381.574059] nvme nvme0: creating 1 I/O queues.
[ 4381.624318] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000​


sudo nvme connect -t rdma -n "nqn.2016-06.io.spdk:cnode1" -a 10.0.0.20 -s 4420 -i 1
[ 4348.394350] nvme nvme0: creating 1 I/O queues.
[ 4348.444315] nvme nvme0: new ctrl: NQN "nqn.2016-06.io.spdk:cnode1", addr 10.0.0.20:4420​

Theoretically after looking at the logs both commands should be equivalent.
But they are not with a 5.0 kernel.
This causes a kernel NULL pointer de-reference and the used network interfaces is in an undefined state which prevents a normal shutdown.
This can only be traced when logical_cores_ initiator > p-1 // (target _queues-1)



opts->nr_io_queues = min_t(unsigned int,num_online_cpus(), token);
https://elixir.bootlin.com/linux/v5.0.21/source/drivers/nvme/host/fabrics.c#L730
https://elixir.bootlin.com/linux/v5.0.21/source/drivers/nvme/host/fabrics.c#L636
 
Last edited:
4.19.66-1-MANJARO works fine just like pve 5.4 with 4.18 but its 5.0 kernel also has this regression.
This is also affecting the latest lts kernel of Ubuntu 19.04. 5.0.0-25-generic

[ 975.585153] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 10.0.0.20:4420
[ 975.585311] nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[ 992.578102] nvme nvme0: creating 3 I/O queues.
[ 992.758010] BUG: unable to handle kernel NULL pointer dereference at 0000000000000190
[ 992.758012] #PF error: [normal kernel read fault]
 
Last edited:
---------------------------------------------------------------------------------------------------------------------------------------------
[ 348.665272] nvme nvme1: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 10.0.0.20:4420
[ 348.665471] nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[ 361.604114] nvme nvme1: creating 3 I/O queues.
[ 361.664462] BUG: unable to handle kernel NULL pointer dereference at 0000000000000190
[ 361.664474] #PF error: [normal kernel read fault]
[ 361.664477] PGD 0 P4D 0
[ 361.664481] Oops: 0000 [#1] SMP NOPTI
[ 361.664485] CPU: 15 PID: 4192 Comm: nvme Tainted: P O 5.0.18-1-pve #1
[ 361.664489] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Taichi, BIOS P3.30 08/14/2018
[ 361.664497] RIP: 0010:blk_mq_init_allocated_queue+0x2f5/0x460
[ 361.664501] Code: 8b 4f 30 45 85 c9 74 87 41 83 fd 01 76 38 89 ca 48 c1 e2 04 48 03 93 10 06 00 00 48 8b 12 44 8b 0c 3a 48 8b 53 58 4e 8b 0c ca <41> 83 b9 90 01 00 00 ff 75 12 4e 8b 14 c6 4c 89 f2 41 8b 14 12 41
[ 361.664506] RSP: 0018:ffffa47586257ce0 EFLAGS: 00010282
[ 361.664513] RAX: 0000000000000003 RBX: ffff98d0cae81818 RCX: 0000000000000000
[ 361.664516] RDX: ffff98d0caea5800 RSI: ffffffff8cbbf820 RDI: 000000000000000c
[ 361.664519] RBP: ffffa47586257d08 R08: 0000000000000003 R09: 0000000000000000
[ 361.664523] R10: ffff98d15c800cc0 R11: 0000000000000003 R12: ffff98d14b69a008
[ 361.664526] R13: 0000000000000003 R14: 000000000001eec8 R15: ffff98d14b69a008
[ 361.664529] FS: 00007f64cca49580(0000) GS:ffff98d15d1c0000(0000) knlGS:0000000000000000
[ 361.664533] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 361.664536] CR2: 0000000000000190 CR3: 000000084ee78000 CR4: 00000000003406e0
[ 361.664539] Call Trace:
[ 361.664544] blk_mq_init_queue+0x3a/0x70
[ 361.664552] ? nvme_rdma_alloc_tagset+0x194/0x210 [nvme_rdma]
[ 361.664556] nvme_rdma_setup_ctrl+0x415/0x830 [nvme_rdma]
[ 361.664560] nvme_rdma_create_ctrl+0x2aa/0x3e8 [nvme_rdma]
[ 361.664564] nvmf_dev_write+0x7cc/0xa28 [nvme_fabrics]
[ 361.664569] ? apparmor_file_permission+0x1a/0x20
[ 361.664574] ? security_file_permission+0x33/0xf0
[ 361.664579] __vfs_write+0x1b/0x40
[ 361.664582] vfs_write+0xab/0x1b0
[ 361.664585] ksys_write+0x5c/0xd0
[ 361.664588] __x64_sys_write+0x1a/0x20
[ 361.664593] do_syscall_64+0x5a/0x110
[ 361.664599] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 361.664602] RIP: 0033:0x7f64cc971504
[ 361.664605] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 48 8d 05 f9 61 0d 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 49 89 d4 55 48 89 f5 53
[ 361.664611] RSP: 002b:00007ffe68974c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 361.664614] RAX: ffffffffffffffda RBX: 000000000000004b RCX: 00007f64cc971504
[ 361.664617] RDX: 000000000000004b RSI: 00007ffe68976140 RDI: 0000000000000003
[ 361.664621] RBP: 0000000000000003 R08: 000000000000021f R09: 000055eec665d690
[ 361.664624] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe68976140
[ 361.664627] R13: 0000000000000009 R14: 00007ffe689774f0 R15: 000055eec4f5c360
[ 361.664630] Modules linked in: nvme_rdma nvme_fabrics ebtable_filter ebtables ip_set ip6table_filter ip6_tables iptable_filter bpfilter softdog nfnetlink_log nfnetlink amd64_edac_mod edac_mce_amd kvm_amd kvm nls_iso8859_1 ib_srpt target_core_mod ib_umad zfs(PO) zunicode(PO) zlua(PO) rpcrdma ib_ipoib rdma_ucm arc4 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel nouveau iwlmvm video aesni_intel ttm snd_pcm aes_x86_64 mac80211 crypto_simd drm_kms_helper snd_timer cryptd glue_helper drm snd iwlwifi fb_sys_fops soundcore btusb syscopyarea sysfillrect btrtl wmi_bmof pcspkr sysimgblt btbcm btintel mxm_wmi k10temp bluetooth cfg80211 ccp ecdh_generic mac_hid zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm sunrpc iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c amd_iommu_v2 vfio_pci irqbypass vfio_virqfd vfio_iommu_type1
[ 361.664667] vfio mlx4_ib mlx4_en ib_uverbs ib_core mlx4_core devlink igb i2c_piix4 i2c_algo_bit ahci dca libahci gpio_amdpt wmi gpio_generic
[ 361.664697] CR2: 0000000000000190
[ 361.664700] ---[ end trace 1ce5da7747b2cdfc ]---
[ 361.664703] RIP: 0010:blk_mq_init_allocated_queue+0x2f5/0x460
[ 361.664706] Code: 8b 4f 30 45 85 c9 74 87 41 83 fd 01 76 38 89 ca 48 c1 e2 04 48 03 93 10 06 00 00 48 8b 12 44 8b 0c 3a 48 8b 53 58 4e 8b 0c ca <41> 83 b9 90 01 00 00 ff 75 12 4e 8b 14 c6 4c 89 f2 41 8b 14 12 41
[ 361.664711] RSP: 0018:ffffa47586257ce0 EFLAGS: 00010282
[ 361.664715] RAX: 0000000000000003 RBX: ffff98d0cae81818 RCX: 0000000000000000
[ 361.664721] RDX: ffff98d0caea5800 RSI: ffffffff8cbbf820 RDI: 000000000000000c
[ 361.664724] RBP: ffffa47586257d08 R08: 0000000000000003 R09: 0000000000000000
[ 361.664727] R10: ffff98d15c800cc0 R11: 0000000000000003 R12: ffff98d14b69a008
[ 361.664729] R13: 0000000000000003 R14: 000000000001eec8 R15: ffff98d14b69a008
[ 361.664732] FS: 00007f64cca49580(0000) GS:ffff98d15d1c0000(0000) knlGS:0000000000000000
[ 361.665712] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 361.666712] CR2: 0000000000000190 CR3: 000000084ee78000 CR4: 00000000003406e0


lukas@pve-master:~$ pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.18-1-pve)
pve-manager: 6.0-5 (running version: 6.0-5/f8a710d7)
pve-kernel-5.0: 6.0-6
pve-kernel-helper: 6.0-6
pve-kernel-4.15: 5.4-6
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-5.0.15-1-pve: 5.0.15-1
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-14-pve: 4.15.18-39
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph-fuse: 12.2.11+dfsg1-2.1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve2
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-3
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-63
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-5
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1
 
Before kernel 5.0 even if you dare to manually choose a value for nr-io-queues that is larger then provided by spdk for example:

target: scripts/rpc.py nvmf_create_transport -t RDMA -u 8192 -p 2 -c 0​

the p parameter decremented by one is the maximum number the kernel nmve-rdma/nvme-fabric module may choose for nr-io-queues.

sudo nvme connect -t rdma -n "nqn.2016-06.io.spdk:cnode1" -a 10.0.0.20 -s 4420

sudo nvme connect -t rdma -n "nqn.2016-06.io.spdk:cnode1" -a 10.0.0.20 -s 4420 -i 1​

are equivalent at least on 4.18 (pve 5.4) and 4.19.66-1-MANJARO but are not on 5.0 ! Only the second command succeeds.

On 4.1x if manual setting nr-io-queues=2 and therefore exceeding max=2-1 the kernel logs:
target: scripts/rpc.py nvmf_create_transport -t RDMA -u 8192 -p 2 -c 0
sudo nvme connect -t rdma -n "nqn.2016-06.io.spdk:cnode1" -a 10.0.0.20 -s 4420 -i 2
[ 2909.305848] nvme nvme0: creating 1 I/O queues.
[ 2909.365968] nvme nvme0: new ctrl: NQN "nqn.2016-06.io.spdk:cnode1", addr 10.0.0.20:4420​

In this case it uses that last possible value to connect to the target which is with only 1 I/O queue (2-1=1)
Doing the same on 5.0 cause the kernel null pointer dereference!
 
target: scripts/rpc.py nvmf_create_transport -t RDMA -u 8192 -p 8 -c 0

logical_cores_ initiator > p-1
On proxmox 6 running kernel 5.0 if true: this is causing the crash. In all cases it is logged: dmesg creating 7 I/O queues which would be correct but somehow not explicitly adding i=7 on a 5.0 kernel is causing a null pointer dereference. This is a regression as this was not the case with a kernel prior 5.0.

On 5.0 it is also possible to request more nr-io-queues than the target provides this also leads to the bug. Prior to 5.0 even if setting nr-io-queues=10 even though spdk is still set to p=8 it would only create 7 I/O queues and succeed but 5.0 also logs "creating 7 I/O queues" but this time fails.

So the logging is always correct but internally i suspect more nr-io-queues are created than available on the target.


Code:
       +---------------------------+                      +----------------------------------+
       | target server running spdk|                      |  initiator (pve  5.4 )           |
       |                           |                      |                                  |
       | 8 logical cores           |                      |  32 logical cores                |
       |                           |  true but 4.18 kernel|                                  |
       | spdk p=8                  <----------------------+  without specifying Io-queues    |
       |                           |                      |                                  |
       |                                                  |  dmesg creating 7 I/O queues.    |
       |                           |                      |                                  |
       |                           |                      |  nvme disk added                 |
       +---------------------------+                      +----------------------------------+
                      ^        ^
                      |        |                          +----------------------------------+
                      |        |                          |  initiator (pve  6 )             |
                      |        |                          |                                  |
                      |        |                          |  32 logical cores                |
                      |        |                          |                                  |
                      |        |      true                |  without specifying Io-queues    |
                      |        +--------------------------+                                  |
                      |                                   |  dmesg creating 7 I/O queues.    |
                      |                                   |                                  |
                      |                                   |  dmesg BUG Kernel null pointer   |
                      |                                   +----------------------------------+
                      |
                      |                                   +----------------------------------+
                      |                                   |  initiator (pve  6 )             |
                      |                                   |                                  |
                      |                                   |  32 logical cores                |
                      |                                   |                                  |
                      |               false               |  Io-queues=7                     |
                      +-----------------------------------+                                  |
                                                          |  dmesg creating 7 I/O queues.    |
                                                          |                                  |
                                                          |  nvme disk added                 |
                                                          +----------------------------------+
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!