Proxmox + NVMeOF is that possible

Jul 18, 2019
24
0
21
46
Hello community,

We have recently bought a SAN that provides NVMeOF (NVMe Over Fabric) and we can now export "LUNs" as NVMe disks.

The SAN and the Proxmox nodes are linked through Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5].

After installing all the required drivers, we can see the NVMeOF disk on the nodes.

root@sandbox:~# nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S463NF0M717776K Samsung SSD 970 PRO 512GB 1 55.09 GB / 512.11 GB 512 B + 0 B 1B2QEXP7
/dev/nvme1n1 2102353SYS9WM5800002 Huawei-XSG1 1 46.75 GB / 3.30 TB 512 B + 0 B 1000001

The idea is to share a "big" LUN between the proxmox servers and us it as shared storage between the nodes.

Is that even possible? What would be the issues ? Can this be used with Proxmox ? Any known success stories with such a setup ?

Thanks a lot for your thoughts, ideas, warnings...

Kind regards.
 
This should be doable by using LVM, similar to LVM over iSCSI.
In the GUI select your Node -> Disks -> LVM and there you can create a new volume group. Choose the right device and make sure 'Add storage' is enabled.
Once that is done, go to Datacenter -> Storage and select your new storage. When editing enable the 'Shared' flag and make sure that the storage is not limited to this node, but all the nodes that have access to the LUN.
 
Hello Mira,

Thank you for your feedback. That's exactly what we were about to test.

So we went ahead with this setup and everything looked good, until our test VM started to show corruption on the file system.

One of the disk of the VM was a ZFS pool (an inside the VM ZFS pool) and running a zpool scrub quickly showed corruption of the pool.

We first thought it was an issue with having ZFS in the VM and did some test on XFS formatted partitions to exclude an issue with this.

For example, we rsynced some files using --checksum option from a disk being on another storage (iSCSI) to the disk using the NVMEoF.
Re-running the rsync showed that some files from the previous rsync had a different checksum of the source and are retransfered, meaning the files on the destination disk were corrupted.

The weird thing is that, after shutting down the VM and mouting the same LV used for the VM but directly on the host and doing the same tests, no corruption occurs.

This should mean the corruption only occurs when the volume is accessed from the VM and that the NVMeOF setup + LVM s working correctly.

I started some more tests from the VM using different disk options (writeback, no cache, ssd mode on and off , discard on and off, etc), using Virtio SCSI, Virtio BLOCK. The results from the VM is always the same. Data get corrupted.

We also mounted the LV on the host, rsynced some data on it and then mounted the disk again with the data inside the VM to verify the checksums.

Here it shows no corruption, so that would mean it occurs when writing to the disk from th VM, but not when reading.

We also tried to upgrade a node of the cluster to Proxmox 7 in case QEMU 6.0 would make a difference but the results are the same.

We're out of ideas at this point as it seems everything is ok at the host level but start doing weird things from a VM.

Any idea what could be the issue ?

We also did all the same tests with a LVM on iSCSI (exported from same filer) and it shows no corruption in any case, so it seems to be linked to the use of NVMeOF with VMs on proxmox.

I've also bypassed LVM and attached the raw disk to the VM and corruption happens too.

The only NVMeOF setup I was able to run was to mount the disk on the host and use it as a qcow2 files storage (like for NFS) and no corrpution happens here.

Thanks a lot for any ideas :)

Kind regards.
 
That sounds strange. Is there anything in the logs (syslog, dmesg) on the host and in the guest?
Which exact storage are you using and which kernel driver is it using?
 
Hello,

Find below the information:

Storage: OceanStor Dorado 5000 V6, version 6.1.0.SPH7

Proxmox kernel: 5.4.119-1-pve (node pve08) and 5.11.22-1-pve (pve07) - same problem

Network card :
# lspci | grep Mell
02:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
02:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

# mstflint -d 02:00.0 q
Image type: FS4
FW Version: 16.30.1004
FW Release Date: 29.3.2021
Product Version: 16.30.1004
Rom Info: type=UEFI version=14.23.17 cpu=AMD64
type=PXE version=3.6.301 cpu=AMD64
Description: UID GuidsNumber
Base GUID: b8cef60300078eea 4
Base MAC: b8cef6078eea 4
Image VSD: N/A
Device VSD: N/A
PSID: MT_0000000080
Security Attributes: N/A

# lsblk
...
nvme1n1 259:5 0 3T 0 disk
├─NVME_LVM-vm--109--disk--0 253:2 0 200G 0 lvm
├─NVME_LVM-vm--109--disk--1 253:3 0 200G 0 lvm
├─NVME_LVM-vm--109--disk--2 253:4 0 200G 0 lvm
└─NVME_LVM-vm--109--disk--3 253:6 0 500G 0 lvm

~# nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme1n1 2102353SYS9WM5800002 Huawei-XSG1 1 90.31 GB / 3.30 TB 512 B + 0 B 1000001

root@pve07:~# nvme list-subsys
nvme-subsys1 - NQN=nqn.2020-02.huawei.nvme:nvm-subsystem-sn-2102353SYS9WM5800002
\
+- nvme1 rdma traddr=10.10.78.11 trsvcid=4420 live
+- nvme2 rdma traddr=10.10.79.11 trsvcid=4420 live

Thank you,
Nicolas
 
Sorry to bring this thread back from the grave, but I am seeing the exact same behavior. I've created the storage as per mira's instructions in post 2. Either I get immediate failure to the storage, or I get errors within the VM. I'm running NVMe-oF over TCP.

Storage: Whitebox Ubuntu Storage Node - GRAID Powered

Proxmox kernel: 6.1.0-1-pve - Version 7.3-4. All nodes at same level

lspci:
64:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57508 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11)
64:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57508 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11)



Relevant dmesg output:
[ 9994.257136] nvme nvme1: creating 128 I/O queues.
[ 9994.275112] nvme nvme1: mapped 128/0/0 default/read/poll queues.
[ 9994.293721] nvme nvme1: new ctrl: NQN "nqn.2020-05.com.graidtech:GRAID-SR478EBF34AD4FA142:dg0vd0", addr x.x.x.x:4420


lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
...
nvme1n1 259:6 0 36.4T 0 disk


Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
...
/dev/nvme1n1 927ef21ac98b4f1026dc Linux 1 40.00 TB / 40.00 TB 4 KiB + 0 B 5.15.0-5

nvme list-subsys
nvme-subsys1 - NQN=nqn.2020-05.com.graidtech:GRAID-SR478EBF34AD4FA142:dg0vd0
\
+- nvme1 tcp traddr=x.x.x.x trsvcid=4420 src_addr=x.x.x.x live


Any thoughts on what I can try?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!