Proxmox and Intel Optane DCPMM

Please take the following with a big grain of salt as I do not have any hands-on experience with persistent memory.

What do you want to achieve? I took a quick look and AFAICT the memory is exposed as a block device /dev/pmem0 which they format with XFS to place a file on it which they then attach to the VM as a simulated nvdimm device.

I am not sure why they do it like that. If all you want is to have a disk present in the VM that is stored on the fast pmem device, I would try to go the usual PVE road. If you want to pass through the pmem directly to the VM and don't care about migrating that VM to another node, you can do so by creating a new VM disk and instead of defining a storage where to place it, assign the device directly:
Code:
qm set <vmid>  -scsi5 file=/dev/pmem0

This will create a new disk with the bus type SCSI and bus ID 5 that if using the /dev/pmem0 device directly.

If you want to store multiple disk images or have a cluster where each node is configured similarly, you can also format the pmem device with a file system, make sure it is mounted (/etc/fstab) and create a directory storage on top of it. This will let you use the regular PVE tools to create disk images there and if you do a migration to another node, the disk image will also be transferred.

I hope that helps :)
 
Hey @aaron ,

Thank you for your insight. I can give a bit more details here.

PMem (NVDIMM) in AppDirect mode (storage) can either be wrapped in a regular block device with filesystem on top of it or exposed as NVDIMM and then formatted / mounted as "PMem aware" FS.

1) The benefits of using regular block device exposed and not PMem aware FS are all about compatibility - OS doesn't need to know it's working with PMem. The downside of it is performance hit as you still go though the OS file stack and use page files. While fast than many SSD, your latency is on magnitude higher than it could be with "Pmem" aware FS or applications using it directly as NVDIMM device.

2) If you expose it as NVDIMM, it means your PMem configuration tools, such as ndctl will recognize it as such and extra manipulation options become available. Then, on the VM itself, the device is recognized as NVDIMM and allows for "DAX" mount option, which makes the FS "PMem aware" and use DAX instead of traditional storage APIs / page files. It allows you to achieve Pmem-level io latency, in many cases 10x lower than using regular block device, which dramatically improves IOPS for random 4k read/write access. It also allows you to use libpmem directly in your apps.

Now, #1 is implemented in VMware as "Persistent memory storage type" (vPMemDisk), so you can use it as a main root disk for your VM. It only performs slightly better than fast NVMe SSDs, depending on configuration. I suspect what you suggested in your post may work for this scenario.

#2 is implemented in VMware as (vPMem), which is what that Intel article describes - passing through virtual NVDIMM.

I actually made it work yesterday with ProxMox, it's not fancy but it works. I used "args" option in qemu-server configs and it appears to work just fine but "memory hotplug" needs to be disabled:

args: -machine nvdimm=on -m slots=2,maxmem=1T -object memory-backend-file,id=mem1,share,mem-path=/pmemfs0/pmem0,size=100G,align=2M -device nvdimm,memdev=mem1,id=nv1,label-size=2M

So, if you are comfortable with manual edits and no sane way of migrating NVDIMM PMem VMs, you can certainly make it work with ProxMox. VMware allows for migration of vPMemDisk via Storage vMotion and vPMem as usually between machines with PMem installed.

Hope it helps
 
Thank you for the explanation, I am definitely a bit wiser now :)

The only thing that itches me in that Intel guide is that the NVDIMM device that you configure for the VM is backed by a file on an XFS file system with which the /dev/pmem0 device is formatted.

I did read a bit through the qemu docs about nvdimms and what that does, is to simulate NVDIMMs to the guest. Not sure if there are any saner ways to pass through NVDIMMs more directly to the guest so that the guest knows about them.
 
Hey @aaron ,

Thank you for your insight. I can give a bit more details here.

PMem (NVDIMM) in AppDirect mode (storage) can either be wrapped in a regular block device with filesystem on top of it or exposed as NVDIMM and then formatted / mounted as "PMem aware" FS.

1) The benefits of using regular block device exposed and not PMem aware FS are all about compatibility - OS doesn't need to know it's working with PMem. The downside of it is performance hit as you still go though the OS file stack and use page files. While fast than many SSD, your latency is on magnitude higher than it could be with "Pmem" aware FS or applications using it directly as NVDIMM device.

2) If you expose it as NVDIMM, it means your PMem configuration tools, such as ndctl will recognize it as such and extra manipulation options become available. Then, on the VM itself, the device is recognized as NVDIMM and allows for "DAX" mount option, which makes the FS "PMem aware" and use DAX instead of traditional storage APIs / page files. It allows you to achieve Pmem-level io latency, in many cases 10x lower than using regular block device, which dramatically improves IOPS for random 4k read/write access. It also allows you to use libpmem directly in your apps.

Now, #1 is implemented in VMware as "Persistent memory storage type" (vPMemDisk), so you can use it as a main root disk for your VM. It only performs slightly better than fast NVMe SSDs, depending on configuration. I suspect what you suggested in your post may work for this scenario.

#2 is implemented in VMware as (vPMem), which is what that Intel article describes - passing through virtual NVDIMM.

I actually made it work yesterday with ProxMox, it's not fancy but it works. I used "args" option in qemu-server configs and it appears to work just fine but "memory hotplug" needs to be disabled:

args: -machine nvdimm=on -m slots=2,maxmem=1T -object memory-backend-file,id=mem1,share,mem-path=/pmemfs0/pmem0,size=100G,align=2M -device nvdimm,memdev=mem1,id=nv1,label-size=2M

So, if you are comfortable with manual edits and no sane way of migrating NVDIMM PMem VMs, you can certainly make it work with ProxMox. VMware allows for migration of vPMemDisk via Storage vMotion and vPMem as usually between machines with PMem installed.

Hope it helps
Hi,

did you get any further with this? any luck with compiling with --enable-libpmem and running VM with args pmem=on for migration to work?

Thanks
 
An update for future me and anone else interested in optane Pmem. pve-qemu is compiled with --enable-libpmem as default.

if namespaces are fsdax or dax, 2M aligned and passed to VM aligned to 128m native performance can be achived.
i followed https://nvdimm.docs.kernel.org/2mib_fs_dax.html

Rocky linux ( and i assume redhat) can be booted from a passed through nvdimm if the passed through namespace type is changed to sector. may work with type Raw also as this also was detected in the rocky linux GUI installer. i chose to put the Boot and EFI partitions on a regualr qemu disk as i couldnt get OVMF to boot the nvdimm directly.


Benchmarks from FIO using this bash script: https://gist.github.com/dullage/7e7f7669ade208885314f83b1b3d6999

Real hardware:
hpe dl380 gen10
xeon gold 6230
2x 256g optane pmem dcpmm 100 interleaved, app direct.

fdax namespace with sectorsize 512 xfs mounted with dax​

Sequential Read: 2929MB/s IOPS=11
Sequential Write: 3359MB/s IOPS=13

512KB Read: 3333MB/s IOPS=6666
512KB Write: 3798MB/s IOPS=7596

Sequential Q32T1 Read: 2929MB/s IOPS=366
Sequential Q32T1 Write: 3386MB/s IOPS=423

4KB Read: 1534MB/s IOPS=392901
4KB Write: 1462MB/s IOPS=374491

4KB Q32T1 Read: 1527MB/s IOPS=391026
4KB Q32T1 Write: 1064MB/s IOPS=272385

4KB Q8T8 Read: 9306MB/s IOPS=2382414
4KB Q8T8 Write: 1350MB/s IOPS=345745

fdax namespace passed to fedora VM with Dax mount​

Sequential Read: 548MB/s IOPS=4
Sequential Write: 219MB/s IOPS=1

512KB Read: 891MB/s IOPS=1782
512KB Write: 217MB/s IOPS=434

Sequential Q32T1 Read: 1178MB/s IOPS=294
Sequential Q32T1 Write: 222MB/s IOPS=55

4KB Read: 79MB/s IOPS=20467
4KB Write: 84MB/s IOPS=21504

4KB Q32T1 Read: 230MB/s IOPS=58977
4KB Q32T1 Write: 149MB/s IOPS=38235

4KB Q8T8 Read: 697MB/s IOPS=178685
4KB Q8T8 Write: 174MB/s IOPS=44784

fdax namespace passed to fedora VM with Dax mount 2M aligned​

Sequential Read: 2723MB/s IOPS=10
Sequential Write: 3192MB/s IOPS=12

512KB Read: 3820MB/s IOPS=7641
512KB Write: 3565MB/s IOPS=7130

Sequential Q32T1 Read: 2770MB/s IOPS=346
Sequential Q32T1 Write: 3176MB/s IOPS=397

4KB Read: 1534MB/s IOPS=392901
4KB Write: 1348MB/s IOPS=345289

4KB Q32T1 Read: 1557MB/s IOPS=398637
4KB Q32T1 Write: 975MB/s IOPS=249756

4KB Q8T8 Read: 9433MB/s IOPS=2414965
4KB Q8T8 Write: 1268MB/s IOPS=324658

devdax namespace passed to fedora VM with Dax mount 2M aligned​

Sequential Read: 2869MB/s IOPS=11
Sequential Write: 3224MB/s IOPS=12

512KB Read: 3798MB/s IOPS=7596
512KB Write: 3595MB/s IOPS=7191

Sequential Q32T1 Read: 2949MB/s IOPS=368
Sequential Q32T1 Write: 3240MB/s IOPS=405

4KB Read: 1559MB/s IOPS=399123
4KB Write: 1551MB/s IOPS=397187

4KB Q32T1 Read: 1568MB/s IOPS=401568
4KB Q32T1 Write: 1012MB/s IOPS=259240

4KB Q8T8 Read: 9366MB/s IOPS=2397834
4KB Q8T8 Write: 1326MB/s IOPS=339499

devdax namespace passed to Rocky VM with sector namespace. rocky installed on root with EFI and boot on qemu disk​

Sequential Read: 2869MB/s IOPS=11
Sequential Write: 925MB/s IOPS=3

512KB Read: 2863MB/s IOPS=5727
512KB Write: 930MB/s IOPS=1861

Sequential Q32T1 Read: 2929MB/s IOPS=366
Sequential Q32T1 Write: 928MB/s IOPS=116

4KB Read: 1025MB/s IOPS=262564
4KB Write: 663MB/s IOPS=169870

4KB Q32T1 Read: 1207MB/s IOPS=309132
4KB Q32T1 Write: 655MB/s IOPS=167782

4KB Q8T8 Read: 7107MB/s IOPS=1819621
4KB Q8T8 Write: 3130MB/s IOPS=801293
 
Last edited:
Thank you for reporting back. I would be interessted in the IO delay. Can you check with ioping?

No problem. heres a few that i think are the most relevant. i can do some more if theres anything specific you want to see. allthough i havent tried, i expect its posible to install Rocky on to a Fdax or raw namespace for increased performance.

please take results with a huge dose of salt as i dont realy know what i am doing.

bios memory controler setting for optane dimms is optimse for low latancy rather than bandwith.

fdax namespace with sectorsize 512 xfs mounted with dax 2M aligned on Proxmox host​

Code:
--- . (xfs /dev/pmem0.2 147.6 GiB) ioping statistics ---
23 requests completed in 338.2 us, 92 KiB read, 68.0 k iops, 265.6 MiB/s
generated 24 requests in 23.8 s, 96 KiB, 1 iops, 4.04 KiB/s
min/avg/max/mdev = 6.42 us / 14.7 us / 25.7 us / 3.08 us

fdax namespace passed to Rocky VM with Dax mount 2M aligned​

Code:
--- . (xfs /dev/pmem0 98.2 GiB) ioping statistics ---
23 requests completed in 415.8 us, 92 KiB read, 55.3 k iops, 216.1 MiB/s
generated 24 requests in 23.9 s, 96 KiB, 1 iops, 4.02 KiB/s
min/avg/max/mdev = 7.88 us / 18.1 us / 28.3 us / 5.93 us

devdax namespace passed to Rocky VM with sector namespace. rocky installed on root with EFI and boot on qemu disk​

Code:
--- . (xfs /dev/pmem1s1 49.7 GiB) ioping statistics ---
28 requests completed in 2.03 ms, 112 KiB read, 13.8 k iops, 53.8 MiB/s
generated 29 requests in 28.6 s, 116 KiB, 1 iops, 4.05 KiB/s
min/avg/max/mdev = 30.6 us / 72.6 us / 89.5 us / 16.8 us
 
Thank you. 15µs is fast, the throughput was not that good in comparison to modern NVMe, yet I hoped that the latency would be good, which it is.
 
Thank you. 15µs is fast, the throughput was not that good in comparison to modern NVMe, yet I hoped that the latency would be good, which it is.
Yeah i was supprised the throughput is as low as it is, though with higher Q and Threads it seems good. i will have to try with the bandwidth optimised setting in the bios too.

Also this is only 2 interleaved dimms. I made the mistake of buying 256GB dims before i realised i would need an M or L CPU to get maximum bandwidth. i havent found a cheep M or L with simmilar single core performace to the 6230. If i can get passthrough or better yet booting of a windows VM to work and backing up is posible then i will probably get 6x (or posiblly 12x with dual CPU) 128GB for increased bandwidth.
 
Changing the bios setting from latancy optimissed to bandwith optimsed had quite a big effect. allthough i dont understand why the 256m read got worse. ioping is the same


Code:
256m Sequential Read: 1344MB/s IOPS=5
256m Sequential Write: 5493MB/s IOPS=21

512KB Read: 8258MB/s IOPS=16516
512KB Write: 10240MB/s IOPS=20480

8m Sequential Q32T1 Read: 5355MB/s IOPS=669
8m Sequential Q32T1 Write: 6037MB/s IOPS=754

4KB Read: 1212MB/s IOPS=310303
4KB Write: 777MB/s IOPS=199076

4KB Q32T1 Read: 1430MB/s IOPS=366122
4KB Q32T1 Write: 784MB/s IOPS=200907

4KB Q8T8 Read: 3852MB/s IOPS=986201
4KB Q8T8 Write: 1691MB/s IOPS=432994
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!