[SOLVED] Windows VM Poor Performance on Ceph

BadAtIT

New Member
Jan 15, 2024
9
0
1
I am doing some testing trying to carve out future deployments for clientele. We are trying to utilize Proxmox and Ceph for an HCI environment that will operate for critical infrastructure systems so HA and live migration for maintenance are very important. The smallest clients we have would have a 3 host system, which I know is not ideal for Ceph, but this is enterprise equipment all with plenty of RAM and 10g+ netwroks so I am hoping to be able to standardize even with the smaller deployements. I have spun up Proxmox on 3 hosts for testing and created a Ceph storage pool running HA and everyhing looked great. I did full network testing and system tuning in an attempt to optmize Ceph, e.g using tuned and setting profile to network-latency but running a windows VM on the Cluster I was suprised by the low perforamance considering the equipment. Per Crysatal Disk Mark I am only getting 187 MB/s sequential write and 107 MB/s sequential read on a test VM running Windows Server 2022. The Ceph pool that is setup is composed of only 3 OSDs that are all NVME drives that were giving about aprox. 3.5 G/s sequential write throuput measured via fio before being added to the Ceph pool. The Ceph Cluster network is 40G w/ jumbo frames that I was getting consistent 34G throughput on measured via Iperf3. I am new to Proxmox and Ceph, almost all of my proffesional experiece is with vmWare but with the Broadcom aquisition we are looking at options. Is there anything I can do to increase my rw speed on windows or is this an issue in regards to windows not playing well with RBD. The test build I had to play with is below.

Node1 AMD Epyc 7764 64 core cpu 256G ram Mellanox Connectx 40g card
Node2 Dual Intel Xenon E52680 28 Core total 128G Ram Mellanox Connactx 40g Card
Node3 Dual Intel Xenon E52680 28 Core total 128G Ram Mellanox Connactx 40g Card
Cisco Nexus Switch QSFP+ 40g
 
What Disks you using? How many Pools do you have? How many PGs they have? What Switches you are using? Do you configure Layer3+4 Hashing? Do you configure Performance Modus on the BIOS?

Post your VM Config qm config VMID please.
 
That is slow seq read/write. But CEPH is really slow, you need to adhere to the standards, ie five nodes with multiple OSDs per node.
Im new to CEPH myself, but my limited testings yields similar results, but not that bad.

I have only used very olh HGST 200GB sas ssd, 6xOSD per node in a 3 node cluster.
I only did benchmarks with fio. You can find my results in the forum.

I would suggest that you look at the VM settings, SCSI Virtio, write back caching, etc. verify the jumbo frame just in case.
Did you run iperf between the two same windows VMs that you are running benchmarks on ?

btw using different cpu architectures in the same HA cluster is not a good idea.
 
Last edited:
What Disks you using? How many Pools do you have? How many PGs they have? What Switches you are using? Do you configure Layer3+4 Hashing? Do you configure Performance Modus on the BIOS?

Post your VM Config qm config VMID please.
The disks are consumer grade Samsung 980 pros and 990 pro. Keep in mind this is not going into production and is POC testing only (I know POC should represent real world but its what I have laying around). There is only one pool with 32 PGs set to auto scale. The switch is a Cisco nexus 3132Q, I did not configure layer 3+4 hashing, I thought Ceph performed fine on layer 2+3? Turbo Mode has been configured in BIOS, all CPU cores are active and never sleep, energy performance disabled etc. I can almost guarantee the issue is due to user error, like I said I'm new to Ceph and am learning as I go. Thank you very much for your help, qme config below
 

Attachments

  • Qme Config.png
    Qme Config.png
    35.6 KB · Views: 19
That is slow seq read/write. But CEPH is really slow, you need to adhere to the standards, ie five nodes with multiple OSDs per node.
Im new to CEPH myself, but my limited testings yields similar results, but not that bad.

I have only used very olh HGST 200GB sas ssd, 6xOSD per node in a 3 node cluster.
I only did benchmarks with fio. You can find my results in the forum.

I would suggest that you look at the VM settings, SCSI Virtio, write back caching, etc. verify the jumbo frame just in case.
Did you run iperf between the two same windows VMs that you are running benchmarks on ?

btw using different cpu architectures in the same HA cluster is not a good idea.
I know 5+ nodes would be ideal but three is what I have to test with and as I stated I am hoping to be able to standardize with Ceph even on our smaller deployments with three nodes if I can get halfway decent performance out of Ceph. I do have a bunch of extra disks around and can increase the osds per node.

Ill check the SCSI Virtio settings, I have not done that. The windows VMs are on a different network than the Ceph cluster, Iperf3 would only show the throughput of the vmbr ported to the VMs, right? Am I suppposed to port over a vmbr off of the ceph cluster to the VM? I was under the impression the cluster network did not need direct access to the VM via a vmbr as the VM was created from the Ceph storage pool?

Yeah, mixing processors families in an HA is big no no. Its what I had to work with for testing , the AMD is just handleing the quorom.

Thank you for your reply, Im sure my issues are due to my ignorance with the platform. Ill check out the SCSI Virtio settings, do you think I completely boofed the networksetup for Ceph? I set the public network to my 1g management network and the cluster network on the 40g. The windows VM of course was created from the ceph pool but has a seperate vmbr that would be the domain network (seperate from Ceph public and cluster)
 
Per Crysatal Disk Mark I am only getting 187 MB/s sequential write and 107 MB/s sequential read on a test VM running Windows Server 2022.
May or may not indicate a problem; post the actual screencap. Also, be aware that guest cpu allotment will have direct impact on this performance; see if bumping up core count to 8 helps; also. balooning performance sucks on windows guests; I'd disable it.

The disks are consumer grade Samsung 980 pros and 990 pro.
how many? please post the output of ceph osd tree. to also eliminate other possible issues/bottlenecks, please post /etc/pve/ceph.conf and /etc/network/interfaces
 
May or may not indicate a problem; post the actual screencap. Also, be aware that guest cpu allotment will have direct impact on this performance; see if bumping up core count to 8 helps; also. balooning performance sucks on windows guests; I'd disable it.


how many? please post the output of ceph osd tree. to also eliminate other possible issues/bottlenecks, please post /etc/pve/ceph.conf and /etc/network/interfaces
Yeah crystal disk mark (or any benchmark tests i know of ) really doesn't handle parvirtualization well. I am comparing to testing Ive done with vmWare vSAN on a similar setup.

Im currently re-configuring some VM settings, ill adjust core count as well and see how it does,

Ceph only has 3 OSDs, one per node. I know not ideal, I have some other identical drives laying around, ill increase osd count after tweaking some other settings. I like to do one thing at time to properly gauge performace gains.

Let me make some changes based off your others feed back and Ill post some results as well as ceph and network configs. Thanks for your reply, this is my first time posting to the Proxmox community and you have all been very helpful, thank you.
 
Last edited:
Okay I found my issue. As much as I preach RFM, I didn't thoroughly review some basic settings myself, shame on me. Embarrassingly enough I thought that the public network was used for administration only and that the cluster network hosted osds and vm traffic for RBD rw, which is not the case. The public network was on a measly 1g causing the bottleneck, I moved everything over to a separate 40g port sub-netted out from the cluster network and wallah!

I had to re-configure the /etc/pve/ceph.conf file reconstruct my managers and mons . I had some trouble doing this, I found that is was best to edit the ceph.conf changing the public network and then delete theme and add them back one by one through the gui. They all came back with the exception of one. I could not delete it from the gui as it had fragmeneted and still existed in /var/lib/ceph/mon/. Once I found it I moved it out of there and was able to create the mon and manager from the giu. Impressingly the windows VM stayed running without a hiccup, which is great to know.

Attached is the speed I am getting from my windows Server 2022 VM after properly configuring things, keep in mind my sequential read was only 180MB/s before setting it up correctly. The rand rw is still pretty bad, if anyone has some ideas? Thanks for all the help!!!
 

Attachments

  • Ceph Server 2022 40G.png
    Ceph Server 2022 40G.png
    113.9 KB · Views: 28
Also note that was a poor screenshot for sequential write I have seen them go as high as 500-550 MB/s. The lack of consistency is probably due to used consumer grade disks.
 
Write back cache, reduce network latency.

And read the testing and optimization guide,tests, that's been done by others and published in the forum
 
Any ideas to increase write performance. I haven't found any smoking guns?
don't use consumer drive.... (really, you'll have 400iops with 4k write vs 10000-20000 iops with ssd/nvme with supercapacitor).

ceph use sync write + fdisk for his journal, performance will be a disaster with consumer ssd.
 
  • Like
Reactions: PmUserZFS
Any luck with your troubleshooting?

I'm experiencing very similar performance. I have a 5 nodes cluster with all SSD and NVME.

1706031246404.png



Okay I found my issue. As much as I preach RFM, I didn't thoroughly review some basic settings myself, shame on me. Embarrassingly enough I thought that the public network was used for administration only and that the cluster network hosted osds and vm traffic for RBD rw, which is not the case. The public network was on a measly 1g causing the bottleneck, I moved everything over to a separate 40g port sub-netted out from the cluster network and wallah!

I had to re-configure the /etc/pve/ceph.conf file reconstruct my managers and mons . I had some trouble doing this, I found that is was best to edit the ceph.conf changing the public network and then delete theme and add them back one by one through the gui. They all came back with the exception of one. I could not delete it from the gui as it had fragmeneted and still existed in /var/lib/ceph/mon/. Once I found it I moved it out of there and was able to create the mon and manager from the giu. Impressingly the windows VM stayed running without a hiccup, which is great to know.

Attached is the speed I am getting from my windows Server 2022 VM after properly configuring things, keep in mind my sequential read was only 180MB/s before setting it up correctly. The rand rw is still pretty bad, if anyone has some ideas? Thanks for all the help!!!
Okay I found my issue. As much as I preach RFM, I didn't thoroughly review some basic settings myself, shame on me. Embarrassingly enough I thought that the public network was used for administration only and that the cluster network hosted osds and vm traffic for RBD rw, which is not the case. The public network was on a measly 1g causing the bottleneck, I moved everything over to a separate 40g port sub-netted out from the cluster network and wallah!

I had to re-configure the /etc/pve/ceph.conf file reconstruct my managers and mons . I had some trouble doing this, I found that is was best to edit the ceph.conf changing the public network and then delete theme and add them back one by one through the gui. They all came back with the exception of one. I could not delete it from the gui as it had fragmeneted and still existed in /var/lib/ceph/mon/. Once I found it I moved it out of there and was able to create the mon and manager from the giu. Impressingly the windows VM stayed running without a hiccup, which is great to know.

Attached is the speed I am getting from my windows Server 2022 VM after properly configuring things, keep in mind my sequential read was only 180MB/s before setting it up correctly. The rand rw is still pretty bad, if anyone has some ideas? Thanks for all the help!!!
 

Attachments

  • 1706031219339.png
    1706031219339.png
    83.7 KB · Views: 17
I don't use any production Ceph clusters using solid-state drives. Yup, it's all SAS 15K HDDs. Don't run Windows either, just all Linux VMs. If going to use solid-state drives, you want enterprise sold-state storage with power-loss protection (PLP).

With that being said, I use the following VM optimization learned through trial-and-error. YMMV.

Code:
    Set SAS HDD Write Cache Enable (WCE) (sdparm -s WCE=1 -S /dev/sd[x])
    Set VM Disk Cache to Writeback
    Set VM Disk controller to VirtIO-Single SCSI controller and enable IO Thread & Discard option
    Set VM CPU Type to 'Host'
    Set VM CPU NUMA on servers with 2 more physical CPU sockets
    Set VM Networking VirtIO Multiqueue to number of Cores/vCPUs
    Set VM Qemu-Guest-Agent software installed
    Set VM IO Scheduler to none/noop on Linux
    Set Ceph RBD pool to use 'krbd' option
 
Any luck with your troubleshooting?
Enabling write back cache helped slightly but I think I have hit the limitations of these consumer drives. Here is the last benchmark after enabling write back cache.
1706231073514.png

As others have stated running consumer drives is a bad idea and have nowhere near the IOPs of an enterprise drive. I only had this set up for testing and was able to accomplish what I wanted with the equipment on hand. Here are the steps I took configuring Ceph, if anyone thinks there's something I should have done different please chime in.

1. Created separate networks for Public and Cluster network (possibly overkill with this few OSD on 40g) enabled jumbo frames on both networks
2. Installed "tuned" and configured the profile to network latency, rebooted servers and made sure changes took in the bios.
3. Confirmed network speed with Iperf3 avg. around 36G/s
3. Added all of my OSDs and created separate classes for NVME, SSD and HDD
4. Created CRUSH rules for each type of drive designation
5. Created pools with the proper CRUSH rule for that pool
6. Enabled write back cache on RBD

I ended up with 18 OSDs in total. Each node has 2 NVME, 2 SSD and 2 HDD. I did this to test performance with different drive types and CRUSH rules. The results posted above are on the NVME pool.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!