I want to like VirtIOFS, but...

alexskysilk · Apr 17, 2025

scyto said:
I am starting with VirtioFS to expose the cephFS volume on the proxmox hosts to the docker swarm VMs.

It is not likely that you can beat the speed of native cephfs connection by sticking another obfuscation layer on top of it, regardless of how efficient. I hope you would post the results of your comparison.

SInisterPisces · Apr 18, 2025

rzmeu said:
I am not informed enough to recommend this, I wouldn't use it on data I am not ready to lose. I am still using cache=auto, which I have set up half a year ago with hook scripts. Compared to always, auto loses about 21% performance, using OP's fio command.

If NFS has no issues for you, then I don't see any reasons to migrate as you won't get better performance with virtiofs.

Thanks for this. I was just about to post something similar.

I've never messed with the default (no cache) for my VirtIO SCSI disks, so I'm not really clear on what it means to change to cache being always on for VirtIO FS. What is that actually doing at the filesystem level? Does it actually change whether ZFS is working in async/sync mode?

scyto · Apr 18, 2025

SInisterPisces said:
Does it actually change whether ZFS is working in async/sync mode?

I did a bit more research, it seems the cache policy mode is about metadata and paths.... not data? as such i assume it just means reads of metadata and paths..... not sure how that would affect sync/async (which i though was about writes... but i am still new to ZFS on my truenas server and just have cpehRBD and LVMs on my promox.

cache=always: Metadata, data, and pathname lookup are cached in the guest and never expire.
cache=auto: Metadata and pathname lookup cache expires after a configured amount of time (default is 1 second).
cache=none: Forbids the FUSE client from caching to achieve the best coherency at the cost of performance.

looking at https://github.com/kata-containers/runtime/issues/2748 it seems it depends on what is writing to the file system (host vs guest)- it looks like if a host changes the metadata or path then virtioFS gets 'funky' - so its only safe to use always where it is gurateed only the guest will write to virtioFS and that is the only way the paths and metadat on the storage backing the virtioFS can be changed.

so for my cepFS backed virtioFS it would seem 'always' could be an issue incase another node changes the metadata or paths in the cephFS but the only scenario i could see this being an issue is as follows, this is how I interpret it based on the github link above:

swarm container FOO is runningt on docker01 on pve1 - everything works fine with always
swarm container FOO moves from vm-docker01 on pve1 to vm-docker02 on pve2 things will be fine if virtioFS has never cached the metadat or paths, it makes changes to these
swarm container FOO moves from vm-docker02 on pve2 to vm-docker01 on pve1 - now because virtioFS hasn't seen the VM write the changes ot metadat / paths it will not provide the wrong metadata to the VM

i don't know how the processes in the container would respond at that point.... i won't bother testing, i will just always leave to auto based on this!

devilkin · Apr 18, 2025

I think one major benefit of virtiofs over NFS is that you don't have to worry about doing your writes sync (and using a SLOG for security)?

HomebrewD · May 1, 2025

Having the same performance issues.

Mount on host gets 36.k Read Iops/19.6K Write Iops with OP's command.
In the guest mounted with virtiofs I only get 3576 Read Iops/1928 Write Iops.

Big oof if you ask me.

waltar · May 2, 2025

HomebrewD said:
Big oof if you ask me.

Thanks for sharing your experience which is the "like" for and not for the low perf.

guruevi · May 2, 2025

VirtioFS is intended as a shared user-space file system between multiple containers. I believe it is still mostly single threaded and thus using it in VMs has significant overhead between context switching and synchronizing file system semantics. There are improvements being built such as shared memory pages between host and guest and multi-queue in Linux, but that also requires a lot of work to percolate that down.

Whether it is “better” than NFS or CephFS or a block device depends on your needs.

fiona · May 2, 2025

Hi,
a user mentions that setting a custom option --thread-pool-size improved the situation for them: https://bugzilla.proxmox.com/show_bug.cgi?id=6370

If you want to test if that helps for your performance issues too and are not afraid to get your hands dirty, you'll need to set it manually in the Perl code for now (in /usr/share/perl5/PVE/QemuServer/Virtiofs.pm and then run systemctl reload-or-restart pvedaemon.service pveproxy.service pvescheduler.service) or replace the binary with a wrapper. Note that messing up the code will lead to errors, you can reinstall with apt install --reinstall qemu-server virtiofsd. Both kinds of change will get lost after updates, so this is just for testing. Check with ps aux if the virtiofsd process is then actually started with the additional parameter.

HomebrewD · May 2, 2025

It was actually me who filed that bug

.
Only reduced (note reduced as it stills craps the bed a bit too often :/) the amounts of hangs in the vm's.
Didn't improve performance sadly.
Did try with different amount of threads (1/4 total cores, 1/2, 1/1,...) but not that much difference.

waltar · May 2, 2025

Thanks for sharing your experience too which is the "like" for and not for the low perf.

scyto · May 2, 2025

My experience is that virtiofs bridging of a cephFS mount is signifcantly faster than mounting a cephFS volume in the VM (i haven't tried rbd yet).

Now it could be I have messed up using the cephFS kernel driver to mount the volume across the loopback network, not sure. I am in the middle of a post to ask for help on that, but here is a preview, this is the same volume mounted two different ways.

virtioFS passing cephFS from the host mount into a guest mount

Code:

| Test           | Read MB/s  | Write MB/s | Read IOPS | Write IOPS |
|----------------|------------|------------|-----------|------------|
| seqwrite-1M    | 0          | 2145MiB/s  | -         | 2145       |
| seqread-1M     | 3488MiB/s  | 0          | 3488      | -          |
| randrw-4k      | 118MiB/s   | 50.6MiB/s  | 30.2k     | 13.0k      |

libceph in the vm accessing the same cephFS volume across the loopback network (in theory the same way the host is accessing the cephFS)

Code:

| Test           | Read MB/s  | Write MB/s | Read IOPS | Write IOPS |
|----------------|------------|------------|-----------|------------|
| seqwrite-1M    | 0          | 984MiB/s   | -         | 983        |
| seqread-1M     | 1601MiB/s  | 0          | 1601      | -          |
| randrw-4k      | 5899KiB/s  | 2518KiB/s  | 1474      | 629        |

( sudo

mount -t ceph :/ /mnt/docker-libceph   -o name=docker-cephfs,secretfile=/etc/ceph/docker-cephFS.secret,conf=/etc/ceph/ceph.conf,fs=docker

) not sure if there are params i should set to make it faster?)

I suspect in the first test i am seeing some aspects of metatdata and other caching provided by virtioFS and possibly QEMU. Output is from a script 'i' wrote to wrap FIO and make it easy for me to run repeatble tests and summarize output so it possible i also messed that up and the tests are bad.

i continue to investigate the loopback network i have for the VM to see if i made some dumb mistake there......

Havent seen any issues with hangs yet.

--edit--
no mystery after all, the first result is defintely due to virtiofs metadata caching and whatever qemu does, this is the same test run on the proxmox host, why i didn't think to do that before.... sigh... this is the host connecting to cephFS, similar results to the VM.

Code:

| Test           | Read MB/s  | Write MB/s | Read IOPS | Write IOPS |
|----------------|------------|------------|-----------|------------|
| seqwrite-1M    | 0          | 853MiB/s   | -         | 852        |
| seqread-1M     | 1755MiB/s  | 0          | 1755      | -          |
| randrw-4k      | 6401KiB/s  | 2739KiB/s  | 1600      | 684        |

waltar · May 2, 2025

A filesystem ever do write buffering before flush and read ahead which get next file blocks before requested from application.
This isn't that way when using block storage as you cannot foresee which blocks would be need next.

SInisterPisces · May 2, 2025

ElectronicsWizardry did some storage testing that really helped me get a handle on what to expect out of VirtIOFS right now.
See: https://www.youtube.com/watch?v=d_zlMxkattE

I actually see this as very useful for getting small files from the host to a container without needing to set up NFS or SMB. That's a bit overkill just to copy in an 8KB configuration file or even a whole home directory's worth of config dotfiles. And moving a ton of small files around doesn't need the best sustained throughput.

HomebrewD · May 12, 2025

Did some more testing and I'm getting way better performance now.
Updated both Proxmox and Ubuntu vm's to kernel 6.11.

Running mounts with cache=never,direct-io=1,expose_acl=1

Custom options set via script (/usr/libexec/virtiofsd) like so:

Code:

#!/bin/bash
/usr/libexec/virtiofsd.org --thread-pool-size=64  --inode-file-handles=never "${@}"

I now get 25.4K Read Iops/13.7K Write Iops in the Quest compared to the Host 29.8K Read Iops/16.1K Write Iops.

Seems that cache=never does the trick funnily enough.

scyto · May 12, 2025

HomebrewD said:
Seems that cache=never does the trick funnily enough.

or it was the other 4 settings in combination...

keep all the other settings you modified and turn metadata cache back on and see if you gte a regression in perf, if you do then it means the metadata caching is less efficient that qemus native caching it does, though IIRC having direct-io and caching one togethe caused me issues

also direct-io only matters if you app supports the O_DIRECT flag, if it doesn't then it will have no effect. I am still confused if one can use DAX or not in the proxmox implementation and which scenarios that will help in.

SInisterPisces · May 12, 2025

HomebrewD said:
Did some more testing and I'm getting way better performance now.
Updated both Proxmox and Ubuntu vm's to kernel 6.11.

Running mounts with cache=never,direct-io=1,expose_acl=1

Custom options set via script (/usr/libexec/virtiofsd) like so:

Code:

#!/bin/bash /usr/libexec/virtiofsd.org --thread-pool-size=64 --inode-file-handles=never "${@}"

I now get 25.4K Read Iops/13.7K Write Iops in the Quest compared to the Host 29.8K Read Iops/16.1K Write Iops.

Seems that cache=never does the trick funnily enough.

Nice. Glad it improved so much.

What does the inode related flag and the "${@}" argument do here? /usr/libexec/virtiofsd.org --thread-pool-size=64 --inode-file-handles=never "${@}"

HomebrewD · May 12, 2025

scyto said:
or it was the other 4 settings in combination...

keep all the other settings you modified and turn metadata cache back on and see if you gte a regression in perf, if you do then it means the metadata caching is less efficient that qemus native caching it does, though IIRC having direct-io and caching one togethe caused me issues

also direct-io only matters if you app supports the O_DIRECT flag, if it doesn't then it will have no effect. I am still confused if one can use DAX or not in the proxmox implementation and which scenarios that will help in.

I did actually test with one guest first with the only difference being cache set to auto vs none. With it set to auto I was getting 6K read / 3K write as before.
Tried with the dax option which gave me an error saying the fs didn't support it but didn't look that much into it.
Left the direct io enabled just in case something can make use of it.

HomebrewD · May 12, 2025

SInisterPisces said:
Nice. Glad it improved so much.

What does the inode related flag and the "${@}" argument do here? /usr/libexec/virtiofsd.org --thread-pool-size=64 --inode-file-handles=never "${@}"

Never is the default iirc. Played around with it a bit but left it in there in case I need it again (Pretty sure Windows needs to have it set to mandatory).
"${@}" just contains the arguments that proxmox calls it with.

scyto · May 12, 2025

HomebrewD said:
I did actually test with one guest first with the only difference being cache set to auto vs none. With it set to auto I was getting 6K read / 3K write as before.
Tried with the dax option which gave me an error saying the fs didn't support it but didn't look that much into it.
Left the direct io enabled just in case something can make use of it.

cool, interesting it made that much of a difference as it only caches metadata for 1 second, what was your benchmark you used?
also i found that the caching QEMU was doing in general made far bigger impact - irrespective of the none vs auto as the speed test was faster than the underlying filesystem (ceph in my case) could possibly deliver..... remember just because the guest is not caching the host can still be caching..... i wonder if thats the reason....

- cache=none
metadata, data and pathname lookup are not cached in guest. They are always
fetched from host and any changes are immediately pushed to host.

- cache=auto
metadata and pathname lookup cache expires after a configured amount of time
(default is 1 second). Data is cached while the file is open (close to open
consistency).

i will do another run this week, the one conclusion i had is *NEVER* set the cache to allways... unless you are absolutely sure only the guest can ever write to the underlying filesystem....

- cache=always
metadata, data and pathname lookup are cached in guest and never expire.

scyto · May 13, 2025

my results,

Code:

cache=auto
| Test           | Read MB/s  | Write MB/s | Read IOPS | Write IOPS |
|----------------|------------|------------|-----------|------------|
| seqwrite-1M    | 0          | 2166MiB/s  | -         | 2166       |
| seqread-1M     | 3548MiB/s  | 0          | 3547      | -          |
| randrw-4k      | 121MiB/s   | 52.1MiB/s  | 31.1k     | 13.3k      |

cache=none
| Test           | Read MB/s  | Write MB/s | Read IOPS | Write IOPS |
|----------------|------------|------------|-----------|------------|
| seqwrite-1M    | 0          | 4862MiB/s  | -         | 4862       |
| seqread-1M     | 3119MiB/s  | 0          | 3118      | -          |
| randrw-4k      | 142MiB/s   | 60.7MiB/s  | 36.2k     | 15.5k      |

this matches with what i read that cache=none can help writes but hurt reads depending on workload

I want to like VirtIOFS, but...

Distinguished Member

Well-Known Member

Well-Known Member

Well-Known Member

Member

Renowned Member

Well-Known Member

Proxmox Staff Member

Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Member

Well-Known Member

Well-Known Member

Member

Member

Well-Known Member

Well-Known Member

We value your privacy