proxmox 6 regression: nvme-rdma: kernel NULL pointer dereference

lukas12342

Member
May 25, 2018
7
0
6
With proxmox 6 linux 5.0 the same command which used to work under proxmox 5.4 is broken. Trying to connect to an NVMe-oF RDMA target provided by spdk fails.

lukas@pve-master:~$ sudo modprobe nvme-rdma
lukas@pve-master:~$ sudo nvme discover -t rdma -a 10.0.0.20 -s 4420
Discovery Log Number of Records 1, Generation counter 3
=====Discovery Log Entry 0======
trtype: rdma
adrfam: ipv4
subtype: nvme subsystem
treq: not specified
portid: 0
trsvcid: 4420
subnqn: nqn.2016-06.io.spdk:cnode1
traddr: 10.0.0.20
rdma_prtype: not specified
rdma_qptype: connected
rdma_cms: rdma-cm
rdma_pkey: 0x0000
lukas@pve-master:~$ sudo nvme connect -t rdma -n "nqn.2016-06.io.spdk:cnode1" -a 10.0.0.20 -s 4420
Killed

Dmesg:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000190

probable cause:
If "nr-io-queues" is not set by nvme cli the kernel uses the logical cores count instead (core count also defines the upper limit of nr-io-queues inside the kernel). This becomes a problem when the nvmeof target provides fewer Io-queues then logical cores are available on the initiator (pve) system.
In my case an Intel i7 6700 system was the target which has only 8 logical cores and the amount of queues is usually for best performance kept blow or equal this number (parameter p in nvmf_create_transport with spdk).

With an old 4.15.18 kernel this is not a problem as the "nr-io-queues" as chosen by the kernel would match the number of queues-1 assigned to spdk (target) with the p parameter. (nr-io-queues=p-1)

On proxmox 6 this is not the case instead. Somehow a larger number than provided by sdpk must internally be choose.
Even though the logging prints the right number (spdk has p=2) kernel would need to choose nr-io-queues=1 this happens:

sudo nvme connect -t rdma -n "nqn.2016-06.io.spdk:cnode1" -a 10.0.0.20 -s 4420
[ 4381.574059] nvme nvme0: creating 1 I/O queues.
[ 4381.624318] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000​


sudo nvme connect -t rdma -n "nqn.2016-06.io.spdk:cnode1" -a 10.0.0.20 -s 4420 -i 1
[ 4348.394350] nvme nvme0: creating 1 I/O queues.
[ 4348.444315] nvme nvme0: new ctrl: NQN "nqn.2016-06.io.spdk:cnode1", addr 10.0.0.20:4420​

Theoretically after looking at the logs both commands should be equivalent.
But they are not with a 5.0 kernel.
This causes a kernel NULL pointer de-reference and the used network interfaces is in an undefined state which prevents a normal shutdown.
This can only be traced when logical_cores_ initiator > p-1 // (target _queues-1)



opts->nr_io_queues = min_t(unsigned int,num_online_cpus(), token);
https://elixir.bootlin.com/linux/v5.0.21/source/drivers/nvme/host/fabrics.c#L730
https://elixir.bootlin.com/linux/v5.0.21/source/drivers/nvme/host/fabrics.c#L636
 
Last edited:
4.19.66-1-MANJARO works fine just like pve 5.4 with 4.18 but its 5.0 kernel also has this regression.
This is also affecting the latest lts kernel of Ubuntu 19.04. 5.0.0-25-generic

[ 975.585153] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 10.0.0.20:4420
[ 975.585311] nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[ 992.578102] nvme nvme0: creating 3 I/O queues.
[ 992.758010] BUG: unable to handle kernel NULL pointer dereference at 0000000000000190
[ 992.758012] #PF error: [normal kernel read fault]
 
Last edited:
---------------------------------------------------------------------------------------------------------------------------------------------
[ 348.665272] nvme nvme1: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 10.0.0.20:4420
[ 348.665471] nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[ 361.604114] nvme nvme1: creating 3 I/O queues.
[ 361.664462] BUG: unable to handle kernel NULL pointer dereference at 0000000000000190
[ 361.664474] #PF error: [normal kernel read fault]
[ 361.664477] PGD 0 P4D 0
[ 361.664481] Oops: 0000 [#1] SMP NOPTI
[ 361.664485] CPU: 15 PID: 4192 Comm: nvme Tainted: P O 5.0.18-1-pve #1
[ 361.664489] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Taichi, BIOS P3.30 08/14/2018
[ 361.664497] RIP: 0010:blk_mq_init_allocated_queue+0x2f5/0x460
[ 361.664501] Code: 8b 4f 30 45 85 c9 74 87 41 83 fd 01 76 38 89 ca 48 c1 e2 04 48 03 93 10 06 00 00 48 8b 12 44 8b 0c 3a 48 8b 53 58 4e 8b 0c ca <41> 83 b9 90 01 00 00 ff 75 12 4e 8b 14 c6 4c 89 f2 41 8b 14 12 41
[ 361.664506] RSP: 0018:ffffa47586257ce0 EFLAGS: 00010282
[ 361.664513] RAX: 0000000000000003 RBX: ffff98d0cae81818 RCX: 0000000000000000
[ 361.664516] RDX: ffff98d0caea5800 RSI: ffffffff8cbbf820 RDI: 000000000000000c
[ 361.664519] RBP: ffffa47586257d08 R08: 0000000000000003 R09: 0000000000000000
[ 361.664523] R10: ffff98d15c800cc0 R11: 0000000000000003 R12: ffff98d14b69a008
[ 361.664526] R13: 0000000000000003 R14: 000000000001eec8 R15: ffff98d14b69a008
[ 361.664529] FS: 00007f64cca49580(0000) GS:ffff98d15d1c0000(0000) knlGS:0000000000000000
[ 361.664533] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 361.664536] CR2: 0000000000000190 CR3: 000000084ee78000 CR4: 00000000003406e0
[ 361.664539] Call Trace:
[ 361.664544] blk_mq_init_queue+0x3a/0x70
[ 361.664552] ? nvme_rdma_alloc_tagset+0x194/0x210 [nvme_rdma]
[ 361.664556] nvme_rdma_setup_ctrl+0x415/0x830 [nvme_rdma]
[ 361.664560] nvme_rdma_create_ctrl+0x2aa/0x3e8 [nvme_rdma]
[ 361.664564] nvmf_dev_write+0x7cc/0xa28 [nvme_fabrics]
[ 361.664569] ? apparmor_file_permission+0x1a/0x20
[ 361.664574] ? security_file_permission+0x33/0xf0
[ 361.664579] __vfs_write+0x1b/0x40
[ 361.664582] vfs_write+0xab/0x1b0
[ 361.664585] ksys_write+0x5c/0xd0
[ 361.664588] __x64_sys_write+0x1a/0x20
[ 361.664593] do_syscall_64+0x5a/0x110
[ 361.664599] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 361.664602] RIP: 0033:0x7f64cc971504
[ 361.664605] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 48 8d 05 f9 61 0d 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 49 89 d4 55 48 89 f5 53
[ 361.664611] RSP: 002b:00007ffe68974c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 361.664614] RAX: ffffffffffffffda RBX: 000000000000004b RCX: 00007f64cc971504
[ 361.664617] RDX: 000000000000004b RSI: 00007ffe68976140 RDI: 0000000000000003
[ 361.664621] RBP: 0000000000000003 R08: 000000000000021f R09: 000055eec665d690
[ 361.664624] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe68976140
[ 361.664627] R13: 0000000000000009 R14: 00007ffe689774f0 R15: 000055eec4f5c360
[ 361.664630] Modules linked in: nvme_rdma nvme_fabrics ebtable_filter ebtables ip_set ip6table_filter ip6_tables iptable_filter bpfilter softdog nfnetlink_log nfnetlink amd64_edac_mod edac_mce_amd kvm_amd kvm nls_iso8859_1 ib_srpt target_core_mod ib_umad zfs(PO) zunicode(PO) zlua(PO) rpcrdma ib_ipoib rdma_ucm arc4 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel nouveau iwlmvm video aesni_intel ttm snd_pcm aes_x86_64 mac80211 crypto_simd drm_kms_helper snd_timer cryptd glue_helper drm snd iwlwifi fb_sys_fops soundcore btusb syscopyarea sysfillrect btrtl wmi_bmof pcspkr sysimgblt btbcm btintel mxm_wmi k10temp bluetooth cfg80211 ccp ecdh_generic mac_hid zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm sunrpc iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c amd_iommu_v2 vfio_pci irqbypass vfio_virqfd vfio_iommu_type1
[ 361.664667] vfio mlx4_ib mlx4_en ib_uverbs ib_core mlx4_core devlink igb i2c_piix4 i2c_algo_bit ahci dca libahci gpio_amdpt wmi gpio_generic
[ 361.664697] CR2: 0000000000000190
[ 361.664700] ---[ end trace 1ce5da7747b2cdfc ]---
[ 361.664703] RIP: 0010:blk_mq_init_allocated_queue+0x2f5/0x460
[ 361.664706] Code: 8b 4f 30 45 85 c9 74 87 41 83 fd 01 76 38 89 ca 48 c1 e2 04 48 03 93 10 06 00 00 48 8b 12 44 8b 0c 3a 48 8b 53 58 4e 8b 0c ca <41> 83 b9 90 01 00 00 ff 75 12 4e 8b 14 c6 4c 89 f2 41 8b 14 12 41
[ 361.664711] RSP: 0018:ffffa47586257ce0 EFLAGS: 00010282
[ 361.664715] RAX: 0000000000000003 RBX: ffff98d0cae81818 RCX: 0000000000000000
[ 361.664721] RDX: ffff98d0caea5800 RSI: ffffffff8cbbf820 RDI: 000000000000000c
[ 361.664724] RBP: ffffa47586257d08 R08: 0000000000000003 R09: 0000000000000000
[ 361.664727] R10: ffff98d15c800cc0 R11: 0000000000000003 R12: ffff98d14b69a008
[ 361.664729] R13: 0000000000000003 R14: 000000000001eec8 R15: ffff98d14b69a008
[ 361.664732] FS: 00007f64cca49580(0000) GS:ffff98d15d1c0000(0000) knlGS:0000000000000000
[ 361.665712] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 361.666712] CR2: 0000000000000190 CR3: 000000084ee78000 CR4: 00000000003406e0


lukas@pve-master:~$ pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.18-1-pve)
pve-manager: 6.0-5 (running version: 6.0-5/f8a710d7)
pve-kernel-5.0: 6.0-6
pve-kernel-helper: 6.0-6
pve-kernel-4.15: 5.4-6
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-5.0.15-1-pve: 5.0.15-1
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-14-pve: 4.15.18-39
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph-fuse: 12.2.11+dfsg1-2.1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve2
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-3
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-63
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-5
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1
 
Before kernel 5.0 even if you dare to manually choose a value for nr-io-queues that is larger then provided by spdk for example:

target: scripts/rpc.py nvmf_create_transport -t RDMA -u 8192 -p 2 -c 0​

the p parameter decremented by one is the maximum number the kernel nmve-rdma/nvme-fabric module may choose for nr-io-queues.

sudo nvme connect -t rdma -n "nqn.2016-06.io.spdk:cnode1" -a 10.0.0.20 -s 4420

sudo nvme connect -t rdma -n "nqn.2016-06.io.spdk:cnode1" -a 10.0.0.20 -s 4420 -i 1​

are equivalent at least on 4.18 (pve 5.4) and 4.19.66-1-MANJARO but are not on 5.0 ! Only the second command succeeds.

On 4.1x if manual setting nr-io-queues=2 and therefore exceeding max=2-1 the kernel logs:
target: scripts/rpc.py nvmf_create_transport -t RDMA -u 8192 -p 2 -c 0
sudo nvme connect -t rdma -n "nqn.2016-06.io.spdk:cnode1" -a 10.0.0.20 -s 4420 -i 2
[ 2909.305848] nvme nvme0: creating 1 I/O queues.
[ 2909.365968] nvme nvme0: new ctrl: NQN "nqn.2016-06.io.spdk:cnode1", addr 10.0.0.20:4420​

In this case it uses that last possible value to connect to the target which is with only 1 I/O queue (2-1=1)
Doing the same on 5.0 cause the kernel null pointer dereference!
 
target: scripts/rpc.py nvmf_create_transport -t RDMA -u 8192 -p 8 -c 0

logical_cores_ initiator > p-1
On proxmox 6 running kernel 5.0 if true: this is causing the crash. In all cases it is logged: dmesg creating 7 I/O queues which would be correct but somehow not explicitly adding i=7 on a 5.0 kernel is causing a null pointer dereference. This is a regression as this was not the case with a kernel prior 5.0.

On 5.0 it is also possible to request more nr-io-queues than the target provides this also leads to the bug. Prior to 5.0 even if setting nr-io-queues=10 even though spdk is still set to p=8 it would only create 7 I/O queues and succeed but 5.0 also logs "creating 7 I/O queues" but this time fails.

So the logging is always correct but internally i suspect more nr-io-queues are created than available on the target.


Code:
       +---------------------------+                      +----------------------------------+
       | target server running spdk|                      |  initiator (pve  5.4 )           |
       |                           |                      |                                  |
       | 8 logical cores           |                      |  32 logical cores                |
       |                           |  true but 4.18 kernel|                                  |
       | spdk p=8                  <----------------------+  without specifying Io-queues    |
       |                           |                      |                                  |
       |                                                  |  dmesg creating 7 I/O queues.    |
       |                           |                      |                                  |
       |                           |                      |  nvme disk added                 |
       +---------------------------+                      +----------------------------------+
                      ^        ^
                      |        |                          +----------------------------------+
                      |        |                          |  initiator (pve  6 )             |
                      |        |                          |                                  |
                      |        |                          |  32 logical cores                |
                      |        |                          |                                  |
                      |        |      true                |  without specifying Io-queues    |
                      |        +--------------------------+                                  |
                      |                                   |  dmesg creating 7 I/O queues.    |
                      |                                   |                                  |
                      |                                   |  dmesg BUG Kernel null pointer   |
                      |                                   +----------------------------------+
                      |
                      |                                   +----------------------------------+
                      |                                   |  initiator (pve  6 )             |
                      |                                   |                                  |
                      |                                   |  32 logical cores                |
                      |                                   |                                  |
                      |               false               |  Io-queues=7                     |
                      +-----------------------------------+                                  |
                                                          |  dmesg creating 7 I/O queues.    |
                                                          |                                  |
                                                          |  nvme disk added                 |
                                                          +----------------------------------+
 
Last edited: