Proxmox + NVMeOF is that possible

Sébastien Riccio · Jun 24, 2021

Hello community,

We have recently bought a SAN that provides NVMeOF (NVMe Over Fabric) and we can now export "LUNs" as NVMe disks.

The SAN and the Proxmox nodes are linked through Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5].

After installing all the required drivers, we can see the NVMeOF disk on the nodes.

root@sandbox:~# nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S463NF0M717776K Samsung SSD 970 PRO 512GB 1 55.09 GB / 512.11 GB 512 B + 0 B 1B2QEXP7
/dev/nvme1n1 2102353SYS9WM5800002 Huawei-XSG1 1 46.75 GB / 3.30 TB 512 B + 0 B 1000001

The idea is to share a "big" LUN between the proxmox servers and us it as shared storage between the nodes.

Is that even possible? What would be the issues ? Can this be used with Proxmox ? Any known success stories with such a setup ?

Thanks a lot for your thoughts, ideas, warnings...

Kind regards.

mira · Jun 25, 2021

This should be doable by using LVM, similar to LVM over iSCSI.
In the GUI select your Node -> Disks -> LVM and there you can create a new volume group. Choose the right device and make sure 'Add storage' is enabled.
Once that is done, go to Datacenter -> Storage and select your new storage. When editing enable the 'Shared' flag and make sure that the storage is not limited to this node, but all the nodes that have access to the LUN.

Sébastien Riccio · Jun 25, 2021

Hello Mira,

Thank you for your feedback. That's exactly what we were about to test.

So we went ahead with this setup and everything looked good, until our test VM started to show corruption on the file system.

One of the disk of the VM was a ZFS pool (an inside the VM ZFS pool) and running a zpool scrub quickly showed corruption of the pool.

We first thought it was an issue with having ZFS in the VM and did some test on XFS formatted partitions to exclude an issue with this.

For example, we rsynced some files using --checksum option from a disk being on another storage (iSCSI) to the disk using the NVMEoF.
Re-running the rsync showed that some files from the previous rsync had a different checksum of the source and are retransfered, meaning the files on the destination disk were corrupted.

The weird thing is that, after shutting down the VM and mouting the same LV used for the VM but directly on the host and doing the same tests, no corruption occurs.

This should mean the corruption only occurs when the volume is accessed from the VM and that the NVMeOF setup + LVM s working correctly.

I started some more tests from the VM using different disk options (writeback, no cache, ssd mode on and off , discard on and off, etc), using Virtio SCSI, Virtio BLOCK. The results from the VM is always the same. Data get corrupted.

We also mounted the LV on the host, rsynced some data on it and then mounted the disk again with the data inside the VM to verify the checksums.

Here it shows no corruption, so that would mean it occurs when writing to the disk from th VM, but not when reading.

We also tried to upgrade a node of the cluster to Proxmox 7 in case QEMU 6.0 would make a difference but the results are the same.

We're out of ideas at this point as it seems everything is ok at the host level but start doing weird things from a VM.

Any idea what could be the issue ?

We also did all the same tests with a LVM on iSCSI (exported from same filer) and it shows no corruption in any case, so it seems to be linked to the use of NVMeOF with VMs on proxmox.

I've also bypassed LVM and attached the raw disk to the VM and corruption happens too.

The only NVMeOF setup I was able to run was to mount the disk on the host and use it as a qcow2 files storage (like for NFS) and no corrpution happens here.

Thanks a lot for any ideas

Kind regards.

mira · Jun 28, 2021

That sounds strange. Is there anything in the logs (syslog, dmesg) on the host and in the guest?
Which exact storage are you using and which kernel driver is it using?

nfertig · Jun 28, 2021

Hello,

Find below the information:

Storage: OceanStor Dorado 5000 V6, version 6.1.0.SPH7

Proxmox kernel: 5.4.119-1-pve (node pve08) and 5.11.22-1-pve (pve07) - same problem

Network card :
# lspci | grep Mell
02:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
02:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

# mstflint -d 02:00.0 q
Image type: FS4
FW Version: 16.30.1004
FW Release Date: 29.3.2021
Product Version: 16.30.1004
Rom Info: type=UEFI version=14.23.17 cpu=AMD64
type=PXE version=3.6.301 cpu=AMD64
Description: UID GuidsNumber
Base GUID: b8cef60300078eea 4
Base MAC: b8cef6078eea 4
Image VSD: N/A
Device VSD: N/A
PSID: MT_0000000080
Security Attributes: N/A

# lsblk
...
nvme1n1 259:5 0 3T 0 disk
├─NVME_LVM-vm--109--disk--0 253:2 0 200G 0 lvm
├─NVME_LVM-vm--109--disk--1 253:3 0 200G 0 lvm
├─NVME_LVM-vm--109--disk--2 253:4 0 200G 0 lvm
└─NVME_LVM-vm--109--disk--3 253:6 0 500G 0 lvm

~# nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme1n1 2102353SYS9WM5800002 Huawei-XSG1 1 90.31 GB / 3.30 TB 512 B + 0 B 1000001

root@pve07:~# nvme list-subsys
nvme-subsys1 - NQN=nqn.2020-02.huawei.nvme:nvm-subsystem-sn-2102353SYS9WM5800002
\
+- nvme1 rdma traddr=10.10.78.11 trsvcid=4420 live
+- nvme2 rdma traddr=10.10.79.11 trsvcid=4420 live

Thank you,
Nicolas

alexskysilk · Jun 28, 2021

This is where you get your storage vendor's support involved.

3nvy · Jan 17, 2023

Sorry to bring this thread back from the grave, but I am seeing the exact same behavior. I've created the storage as per mira's instructions in post 2. Either I get immediate failure to the storage, or I get errors within the VM. I'm running NVMe-oF over TCP.

Storage: Whitebox Ubuntu Storage Node - GRAID Powered

Proxmox kernel: 6.1.0-1-pve - Version 7.3-4. All nodes at same level

lspci:
64:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57508 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11)
64:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57508 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11)

Relevant dmesg output:
[ 9994.257136] nvme nvme1: creating 128 I/O queues.
[ 9994.275112] nvme nvme1: mapped 128/0/0 default/read/poll queues.
[ 9994.293721] nvme nvme1: new ctrl: NQN "nqn.2020-05.com.graidtech:GRAID-SR478EBF34AD4FA142:dg0vd0", addr x.x.x.x:4420

lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
...
nvme1n1 259:6 0 36.4T 0 disk

Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
...
/dev/nvme1n1 927ef21ac98b4f1026dc Linux 1 40.00 TB / 40.00 TB 4 KiB + 0 B 5.15.0-5

nvme list-subsys
nvme-subsys1 - NQN=nqn.2020-05.com.graidtech:GRAID-SR478EBF34AD4FA142:dg0vd0
\
+- nvme1 tcp traddr=x.x.x.x trsvcid=4420 src_addr=x.x.x.x live

Any thoughts on what I can try?

itlinux · Apr 29, 2024

Hello @3nvy I will check with R&D and see if they do have some pointers for you.

Kelleyo · Jun 25, 2024

Hello @3nvy, Graid now has full support for Proxmox: https://www.graidtech.com/supremeraid-proxmox-guide-2024/ Also our support for Support for Proxmox VE 8.2 with kernel 6.8 is confirmed in the Graid SupremeRAID™ 1.6 release, which will be available next week. We also support NVMe-oF as both target or initiator.

Search

Search

Proxmox + NVMeOF is that possible

Sébastien Riccio

Active Member

mira

Proxmox Staff Member

Sébastien Riccio

Active Member

mira

Proxmox Staff Member

nfertig

Member

alexskysilk

Distinguished Member

3nvy

New Member

itlinux

New Member

Kelleyo

New Member