Ceph performance with simple hardware. Slow writing.

adriano_da_silva · Sep 22, 2021

I have read on Ceph's official website that their proposal is a distributed storage system with common hardware. He goes further, saying that it is recommended to use SSD and networks starting at 10 Gbps, but that it can work normally with HD's and Gigabit networks when the load is small.

Well, I'm using 7 old nodes with HD's and 4 gigabit ports (LACP). Intel Xeon X5650 processors, about 64GB of Ram. Some with 32GB of RAM;

Connected by stacked HP A5120 switches.

VM's run with serious write performance problem.

Windows Server VMs run with paravirtualized VIrtIO SCSI KVM disk drivers, which has improved performance a little bit, but is still quite difficult to use.

In disk benchmark software (CrystalDisk), inside the VMs, I managed to receive a report indicating good read performance (230 MB/s sequential and 22 and 2.5 random 4K). For my reality it would be enough. But all write results were much worse, showing 21 MB/s of sequential writes, 1.22 and 0.2 MB/s of 4K random writes.

Can this difference between read and write result be attributed to network or disks?

Would it be that if I invest in some small 128 or 256 GB SSD or NVMe disks exclusively for DB/Wall, putting one in each OSD, would it help speed up sequential and random writes to bring them closer to the results I have today in reading?

Grateful.

AlexLup · Sep 22, 2021

Check ceph osd df and look at the standard deviation, should be as close to zero as possible. This indicates whether your disks are evenly distributed. Check how many PGs you have.

Also check this out:
https://www.youtube.com/watch?v=LlLLJxNcVOY

adriano_da_silva · Sep 23, 2021

Tanks !

In my configuration I have 512 PGs.

Let's see what is returned when I type the suggested command:

Code:

root@pve-11:~# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP    META     AVAIL   %USE VAR  PGS STATUS
 0   hdd 0.90959  1.00000 931 GiB  64 GiB  63 GiB 148 KiB 1024 MiB 867 GiB 6.90 0.94 219     up
 1   hdd 0.90959  1.00000 931 GiB  63 GiB  62 GiB  20 KiB 1024 MiB 869 GiB 6.74 0.92 213     up
 3   hdd 0.90919  1.00000 931 GiB  65 GiB  64 GiB 112 KiB 1024 MiB 866 GiB 6.97 0.95 201     up
 6   hdd 0.90970  1.00000 932 GiB  65 GiB  64 GiB  64 KiB 1024 MiB 866 GiB 7.00 0.96 217     up
 2   hdd 0.90959  1.00000 931 GiB  81 GiB  80 GiB   4 KiB 1024 MiB 850 GiB 8.70 1.19 246     up
 4   hdd 0.90970  1.00000 932 GiB  70 GiB  69 GiB  68 KiB 1024 MiB 861 GiB 7.52 1.03 227     up
 5   hdd 0.90970  1.00000 932 GiB  69 GiB  68 GiB 160 KiB 1024 MiB 863 GiB 7.41 1.01 213     up
                    TOTAL 6.4 TiB 477 GiB 470 GiB 577 KiB  7.0 GiB 5.9 TiB 7.32                 
MIN/MAX VAR: 0.92/1.19  STDDEV: 0.62

Any suggestion?

AlexLup · Sep 23, 2021

That looks pretty evenly, are the disks evenly spread over the hosts you have?

adriano_da_silva · Sep 23, 2021

Yes, I currently have one OSD per Host. In this case, one HD for each host. There are 7 hosts in total. I plan to increase this in the future. I plan to put at least three disks per host. I thought of putting maybe one flash drive per host, just for DB and Wall of OSDs.

Klaus Steinberger · Sep 27, 2021

adriano_da_silva said:
I have read on Ceph's official website that their proposal is a distributed storage system with common hardware. He goes further, saying that it is recommended to use SSD and networks starting at 10 Gbps, but that it can work normally with HD's and Gigabit networks when the load is small.

Well, I'm using 7 old nodes with HD's and 4 gigabit ports (LACP). Intel Xeon X5650 processors, about 64GB of Ram. Some with 32GB of RAM;

Connected by stacked HP A5120 switches.

VM's run with serious write performance problem.

Windows Server VMs run with paravirtualized VIrtIO SCSI KVM disk drivers, which has improved performance a little bit, but is still quite difficult to use.

In disk benchmark software (CrystalDisk), inside the VMs, I managed to receive a report indicating good read performance (230 MB/s sequential and 22 and 2.5 random 4K). For my reality it would be enough. But all write results were much worse, showing 21 MB/s of sequential writes, 1.22 and 0.2 MB/s of 4K random writes.

Can this difference between read and write result be attributed to network or disks?

Would it be that if I invest in some small 128 or 256 GB SSD or NVMe disks exclusively for DB/Wall, putting one in each OSD, would it help speed up sequential and random writes to bring them closer to the results I have today in reading?

Grateful.

Spinning HD's and a 1 GIgabit network are just not fast enough for the specific workload of VM's
CEPH is very latency dependant.
A setup with spinning disks and a 1 Gigabit/s network is just good for a read intensive buld storage or a experimental setup.

don't ever try it with VM's

adriano_da_silva · Sep 27, 2021

Klaus Steinberger said:
Spinning HD's and a 1 GIgabit network are just not fast enough for the specific workload of VM's
CEPH is very latency dependant.
A setup with spinning disks and a 1 Gigabit/s network is just good for a read intensive buld storage or a experimental setup.

don't ever try it with VM's

Tanks for reply!

The only thing that is officially understood on the Ceph website is that with large loads you should not use Gbps. And I mean by "large" loads, many VMs running in a cluster, many clients accessing or large volumes of data (many terabytes) which is definitely not my case.

However, this information you bring seems true, but it is not common to find it on the internet. You can only find it when you specifically look for it, that is, when the problem is already occurring.

Turns out I can't get away from it right now. I have the setup in production and I need to tune it to get the best I can with this networking hardware. Even why, 10 Gbps networks or more are prohibitively expensive for small or medium businesses here in my country. I like the robustness of Ceph, the high availability it provides. But I need to find a more cost-effective way to improve that performance. After all, isn't that Ceph's proposal from the start? I wonder when the first stable versions of Ceph, in 2012, how expensive and unusual was a 10 Gbps network??? I believe that only large companies or data centers used networks at this level at that time.

What I want is not great performance, for many VMs or for many clients. I want acceptable performance.

Do you think that improving the disks with flash in DB/Wall, I should improve the writing performance? Can improve latency?

Why in my current case reading performs so much better than writing? Do not both reading and writing use the same network?
Doesn't the same data travel through the gigabit tube?

ness1602 · Sep 27, 2021

Since this is Gigabit, max speed that you can get per Host is 1000mbit/3 = 333mbit/8= 41MB/s . And this is it.
Maybe you can get 2.5gbit or 5gbit?

adriano_da_silva · Sep 28, 2021

ness1602 said:
Since this is Gigabit, max speed that you can get per Host is 1000mbit/3 = 333mbit/8= 41MB/s . And this is it.
Maybe you can get 2.5gbit or 5gbit?

I do LACP with 4 gigabit ports with jumbo frame, exclusive to Ceph. In theory, 4 gigabit ways. Having seven OSD nodes with identical setup sharing the load, I can have something close to 3900 gigabit (full dulpex). I did the test with iperf and got roughly that in simultaneous traffic across ports.

Assuming you can achieve something close to 3.5 gbps in Ceph, in the communication between nodes, then it would have something around 437 MB/s at full load. It will be great if this is confirmed. But today, I can already read 230 MB/s, which is good.. The problem is when I need to WRITE... And my question is this: Why is it so slow to WRITE, but to read it is fast? It doesn't make sense, since for the network, it doesn't matter if the data is coming or going, the speed is the same. The biggest slowdown then is where? In writing spinning disks?

Klaus Steinberger · Sep 29, 2021

adriano_da_silva said:
adriano_da_silva said:

Tanks for reply!

The only thing that is officially understood on the Ceph website is that with large loads you should not use Gbps. And I mean by "large" loads, many VMs running in a cluster, many clients accessing or large volumes of data (many terabytes) which is definitely not my case.

However, this information you bring seems true, but it is not common to find it on the internet. You can only find it when you specifically look for it, that is, when the problem is already occurring.

Turns out I can't get away from it right now. I have the setup in production and I need to tune it to get the best I can with this networking hardware. Even why, 10 Gbps networks or more are prohibitively expensive for small or medium businesses here in my country. I like the robustness of Ceph, the high availability it provides. But I need to find a more cost-effective way to improve that performance. After all, isn't that Ceph's proposal from the start? I wonder when the first stable versions of Ceph, in 2012, how expensive and unusual was a 10 Gbps network??? I believe that only large companies or data centers used networks at this level at that time.

What I want is not great performance, for many VMs or for many clients. I want acceptable performance.

Do you think that improving the disks with flash in DB/Wall, I should improve the writing performance? Can improve latency?

Why in my current case reading performs so much better than writing? Do not both reading and writing use the same network?
Doesn't the same data travel through the gigabit tube?

Click to expand...

I do LACP with 4 gigabit ports with jumbo frame, exclusive to Ceph. In theory, 4 gigabit ways. Having seven OSD nodes with identical setup sharing the load, I can have something close to 3900 gigabit (full dulpex). I did the test with iperf and got roughly that in simultaneous traffic across ports.

Assuming you can achieve something close to 3.5 gbps in Ceph, in the communication between nodes, then it would have something around 437 MB/s at full load. It will be great if this is confirmed. But today, I can already read 230 MB/s, which is good.. The problem is when I need to WRITE... And my question is this: Why is it so slow to WRITE, but to read it is fast? It doesn't make sense, since for the network, it doesn't matter if the data is coming or going, the speed is the same. The biggest slowdown then is where? In writing spinning

adriano_da_silva said:
Tanks for reply!

The only thing that is officially understood on the Ceph website is that with large loads you should not use Gbps. And I mean by "large" loads, many VMs running in a cluster, many clients accessing or large volumes of data (many terabytes) which is definitely not my case.

However, this information you bring seems true, but it is not common to find it on the internet. You can only find it when you specifically look for it, that is, when the problem is already occurring.

Turns out I can't get away from it right now. I have the setup in production and I need to tune it to get the best I can with this networking hardware. Even why, 10 Gbps networks or more are prohibitively expensive for small or medium businesses here in my country. I like the robustness of Ceph, the high availability it provides. But I need to find a more cost-effective way to improve that performance. After all, isn't that Ceph's proposal from the start? I wonder when the first stable versions of Ceph, in 2012, how expensive and unusual was a 10 Gbps network??? I believe that only large companies or data centers used networks at this level at that time.

What I want is not great performance, for many VMs or for many clients. I want acceptable performance.

Do you think that improving the disks with flash in DB/Wall, I should improve the writing performance? Can improve latency?

Why in my current case reading performs so much better than writing? Do not both reading and writing use the same network?
Doesn't the same data travel through the gigabit tube?

Reading is in CEPH always ways faster, as it can be read from the nearest OSD Node, in best case this is the local node.

Writing is a different thing, as CEPH has to mirror it to replicas in the background (on the backend network) and acknowledges to the client after the last replica acknowledge the write.

You can use flash disk for the DB/Wal to speed up writing massively, but be careful which SSD's you choose.
Cheapest way is to use on SSD for all OSD's on the same node, but this setup is little bit mad to handle in case of SSD / HD failures.
But definitly use server grade SSD's, as these are the only ones who guarantee short latencies and will not be written to the death too fast.
Server Grade SATA/SAS SSD's should not be too expensive as you need only small devices.

For the network: a switch with 8 10 GBit/s RJ45 ports should be available under 1000 € , a single port 10G Base-T card is under 110 € so you can build up the CEPH Backend network with 10 GBit/s for under 2000 €

ness1602 · Sep 29, 2021

I guess LACP doesn't count when CEPH needs to sync writes(to all 3 replicas i guess, or two replicas and one master)

adriano_da_silva · Sep 29, 2021

Klaus Steinberger said:
Reading is in CEPH always ways faster, as it can be read from the nearest OSD Node, in best case this is the local node.

Writing is a different thing, as CEPH has to mirror it to replicas in the background (on the backend network) and acknowledges to the client after the last replica acknowledge the write.

You can use flash disk for the DB/Wal to speed up writing massively, but be careful which SSD's you choose.
Cheapest way is to use on SSD for all OSD's on the same node, but this setup is little bit mad to handle in case of SSD / HD failures.
But definitly use server grade SSD's, as these are the only ones who guarantee short latencies and will not be written to the death too fast.
Server Grade SATA/SAS SSD's should not be too expensive as you need only small devices.

For the network: a switch with 8 10 GBit/s RJ45 ports should be available under 1000 € , a single port 10G Base-T card is under 110 € so you can build up the CEPH Backend network with 10 GBit/s for under 2000 €

Thank you very much for your reply.

The attention we are giving to this subject has been very important to me.

But I believe that if I really need a 10Gbps network on the backend, I'll need to go to another storage solution. And I will explain why.

A while ago I used two computers mirroring data via DRBD and providing it via ISCSI, all on a Gigabit network. And it worked relatively well, with mandatory attention only to the power supply. And even so, it's a model that needs more attention, because it's more delicate.

I thought about switching to Ceph, as it seems like a more robust technology, as I could have failed more than one node without worry. I thought about increased availability. But I didn't imagine it would have this whole bottleneck.

Well, a 10Gbps switch can cost "only" €1000 somewhere in Europe, or something similar in the US, Japan. But can we put such a switch in Brazil for that price?

Even if we can, with the exchange rate today, 1000 € is not that cheap.

Remembering that I'm looking for high availability, since I'm leaving an architecture where I have duplicate switches (in stacks), I would need two switches, it would already be 2000 € and then 14 network cards or at least 7 cards with double ports. we are talking about something around 1000-1400 € more, totaling something around 3400 €. If we are going to import the equipment to Brazil, we have to pay taxes, something around 60%. There, we are already at € 5440, speaking simply here, without observing shipping, any inaccuracies or additional expenses.

Putting this in the exchange rate, it would represent something around R$34,435.00 (in local currency). Not to mention disks, just networking.

So, this is an investment out of my budget for my reality right now.

It would only be possible, perhaps, if he found used equipment at a good price. However, here in Brazil, it is almost impossible to find this type of equipment used today.

Maybe I have to go back to my old model, using DRBD and ISCII, or something similar. I don't want to be forced to do this.. :-(

Again, thank you very much for the responses.

PS.

I don't see a lot of network usage on my Proxmox dashboard. Is latency really much better on the 10Gbps network compared to 1Gbps?

AlexLup · Sep 30, 2021

Hi Adriano,
I have invested in 10gb networking, 3 hosts (not even your 7 ones),Samsung NVME PLPs, bigger HDDs, NVME DBWAL, Cache pools etc and wound up getting disappointed with the numbers at 100mb/sec read and 30mb/sec write cold/warm data with my expensive 10gb networking barely breaking a sweat. I tried all different options, making the pools making bigger PG numbers (to spread out the r/w even more) etc etc and while that did squeze out a bit more - the whole cluster went haywire for a couple of days leaving me without access to my data until I finally got to access my files by shrinking the replica from 3 to 2 and nuking and paving the monitors one by one. So all in all not a very pleasant ride of 2 years.

I am right now playing with ZFS on the HDDs, 32gb of RAM (default) and NVME for readcache (L2ARC) and a kind of a writecache (SLOG+cache) and I got to 4000mb/sec reads directly from RAM (ARC) and 600mb/sec for warm data and about 600mb/sec + 400mb/sec for cold data ON THE SAME HARDWARE in local tests.

I then added Gluster on top of this to get the HA - which brings down the reads/writes to a third but still gives reasonable speeds. I am further looking into getting HA to work properly (dead machine breaks HA) for now, and am also eyeing Drive45's autocache on top of ZFS which should give me a whole lot of write performance as well. An added bonus is that even if Gluster breaks, I still see my files on there unlike ceph! Thats not even taking into account that Gluster is able to do real georeplication as well on two nodes without metadataservers.
I am paying for this by giving up the nice GUI proxmox has for ceph and possibly reliability (although I know many larger companies using gluster).

My advice therefore is to try the above route as well and try out different scenarios for HA alternatively to try TrueNAS SCALE (FreeNAS on Linux for docker images) or - just build a monster proxmox on top of ZFS plain and simple which should be rock solid judging by the amount of forum posts here swearing by ZFS.

Good luck!

need2gcm · Sep 30, 2021

@adriano_da_silva

From my experiences with Ceph, I prefer it for the rebalancing and reliability it has offered. I have put a lab ceph setup through hell with a mix of various drives, host capabilities, and even uneven node networking capabilities. The only thing it did not handle well (and nothing does) is SMR disks, which will also give you atrocious write speeds so watch out for any SMR disks. They are only suitable for reads and slow sequential writes. That cluster never lost any data, even with full on multi-hardware failure events. That being said, I am certain ZFS will perform faster on your setup regardless of how you tune Ceph.

That being said, I saw significant improvements in Ceph when I implemented SSDs for cache (DB/WAL). I suggest adding some kind of flash cache for your OSDs as soon as you can since I believe you will see the most immediate performance gain per cost. All flash storage would be better, but when wouldn't it? And of course 10G or above networking will help any setup, but the performance gains you would see from that with your current hardware would be minimal at best. Certainly not a bad investment and a good direction to move to, but it will not solve your immediate issues.

Lastly, what storage controllers are in use? Depending on what controllers are driving those disks, you could actually have bottlenecks there.

Klaus Steinberger · Sep 30, 2021

adriano_da_silva said:
Thank you very much for your reply.

The attention we are giving to this subject has been very important to me.

But I believe that if I really need a 10Gbps network on the backend, I'll need to go to another storage solution. And I will explain why.

A while ago I used two computers mirroring data via DRBD and providing it via ISCSI, all on a Gigabit network. And it worked relatively well, with mandatory attention only to the power supply. And even so, it's a model that needs more attention, because it's more delicate.

I thought about switching to Ceph, as it seems like a more robust technology, as I could have failed more than one node without worry. I thought about increased availability. But I didn't imagine it would have this whole bottleneck.

Well, a 10Gbps switch can cost "only" €1000 somewhere in Europe, or something similar in the US, Japan. But can we put such a switch in Brazil for that price?

Even if we can, with the exchange rate today, 1000 € is not that cheap.

Remembering that I'm looking for high availability, since I'm leaving an architecture where I have duplicate switches (in stacks), I would need two switches, it would already be 2000 € and then 14 network cards or at least 7 cards with double ports. we are talking about something around 1000-1400 € more, totaling something around 3400 €. If we are going to import the equipment to Brazil, we have to pay taxes, something around 60%. There, we are already at € 5440, speaking simply here, without observing shipping, any inaccuracies or additional expenses.

Putting this in the exchange rate, it would represent something around R$34,435.00 (in local currency). Not to mention disks, just networking.

So, this is an investment out of my budget for my reality right now.

It would only be possible, perhaps, if he found used equipment at a good price. However, here in Brazil, it is almost impossible to find this type of equipment used today.

Maybe I have to go back to my old model, using DRBD and ISCII, or something similar. I don't want to be forced to do this.. :-(

Again, thank you very much for the responses.

PS.

I don't see a lot of network usage on my Proxmox dashboard. Is latency really much better on the 10Gbps network compared to 1Gbps?

If you want HA, things always get more expensive, and yes Latency is much better with 10G. But I bet that you can at least gain some speed with DB/WAL on SSD. Bad that hardware availability is so bad in Brazil. Hope you can find something used.
Also try jumbo frames on you gig Network (hope your switch supports it?), this should be slightly better.

Another thing you can do, but it limits the number of nodes:

you can build up a full mesh network for the CEPH backend. With 3 nodes you just need two fast ports on every node. There is an example setup with linux bridge ( I didn it with OpenVswitch):

https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server

you do not need to connect this to outside, so no switch is needed, and it is redundant.

you need "number of nodes - 1" interfaces so on a 5 node cluster you need 4 ports

a option which is cheaper but way slower is to build a ring (node 1 - node 2 - node 3 ....... node n back to node 1) but with 7 nodes this adds too much hops and so the latency is not good especially with jumbo frames. Maybe a cheapy option for max. 4 nodes, but not more.

I have one 3 node cluster with SAS SSD's and a 25 G backend network as a full mesh, and it performs like a charm. We will build up our next cluster with 5 nodes and a 25G full mesh for CEPH.

A hint for hardware: sometimes SFP+ 10G cards are easier to get refurbished. DAC Cables are not too expensive, but of course more expensive than Cat 6A, but definitly cheaper than SFP+ Transceiver and Fibercables.

adriano_da_silva · Oct 1, 2021

need2gcm said:
@adriano_da_silva

From my experiences with Ceph, I prefer it for the rebalancing and reliability it has offered. I have put a lab ceph setup through hell with a mix of various drives, host capabilities, and even uneven node networking capabilities. The only thing it did not handle well (and nothing does) is SMR disks, which will also give you atrocious write speeds so watch out for any SMR disks. They are only suitable for reads and slow sequential writes. That cluster never lost any data, even with full on multi-hardware failure events. That being said, I am certain ZFS will perform faster on your setup regardless of how you tune Ceph.

That being said, I saw significant improvements in Ceph when I implemented SSDs for cache (DB/WAL). I suggest adding some kind of flash cache for your OSDs as soon as you can since I believe you will see the most immediate performance gain per cost. All flash storage would be better, but when wouldn't it? And of course 10G or above networking will help any setup, but the performance gains you would see from that with your current hardware would be minimal at best. Certainly not a bad investment and a good direction to move to, but it will not solve your immediate issues.

Lastly, what storage controllers are in use? Depending on what controllers are driving those disks, you could actually have bottlenecks there.

Guys, thanks so much for all the tips. This debate is very valuable.

I'm testing two consumer NVMes (two very cheap Chinese brands) for performance with "fio". One is Netac, the other Xray. With a size of 256GB, the goal would be to buy 7 to use one on each node to use in DB/Wall disks. The performance of each NVME looks a little better than the HD (SATA or SAS) when writing files with direct writing, with only 1 job, without the use of cache. Of course when using the NVME cache, this difference grows a lot, but we know that Ceph does not use cache on disks, so better not consider this feature.

When increasing the number of jobs (from 1 to 5 or 10 jobs), the tendency is that the difference in performance of NVMe improves a lot compared to disks. In all scenarios, random 512 Bytes, 4K, 64K, or sequential read/write. I don't know if Ceph can benefit from this (number of jobs). Will be?

NVMe reads are always many times better than HD's, in any scenario. Only in tiny file reads scenarios, really very small (512 bytes), using only 1 job, did one of the NVMe perform poorly. In fact, it performed poorly both in reading and writing these files in this situation. But files of only 512 bytes, it seems something to be disregarded in Ceph, correct? In contrast, this same NVMe performed better in all other scenarios, especially with more jobs.

Guys, I know that enterprise SSDs with supercapacitor are recommended for having much better writing performance, however, we come back to the question of prices again, because here they are very expensive drivers. I need to find ways to improve performance without too much cost.

Thank you all!!

adriano_da_silva · Oct 4, 2021

ness1602 said:
I guess LACP doesn't count when CEPH needs to sync writes(to all 3 replicas i guess, or two replicas and one master)

Tankyou for your reply!

I find it difficult for CEPH to have control over this. I believe that the decision of "where" to send or receive data is in the Operating System, outside CEPH's control. I'm not sure, but I believe it. I did tests with simultaneous iperf from one node to several other nodes, and it used all ports, significantly increasing communication.

By the way, I do a reading test from inside the VM, using benchmark software it indicates 230 MB/s of disk transfer and if at the same time I monitor the node's communication from the outside with the dstat software, it shows that the node uses more of 240 MB/s of network communication, which means 1920 that far exceeds a single lane of 1 Gbps. This is of course, because if the node communicates with several nodes through a LACP interface, the negotiation between the operating system and the switch takes care of communicating through different ports for each different node that communicates, according to the distribution algorithm. I can also see simultaneously with dstat in the other nodes, the communication flowing in a distributed way, both in the reads and in the writes. At least I believe I could see it.

Why would writing be different? Does the client send the write to a single node? And if it sends, even so, it would be more than 120MB/s the limit with this node. Then, considering this single node, I believe it would communicate with the other nodes to distribute the write across all OSDs. Am I right? If so, why wouldn't he do it through the various LACP ports?

adriano_da_silva · Oct 8, 2021

Guys, thanks so much for all the tips.

I made a change to my system yesterday and it looks like I've gotten acceptable results so far!

Let's see:

In the configuration of the VM, in the part where the SCSI controller is indicated, I was using VirtIO SCSI, I had already installed the drivers in the VM according to the old guidelines, everything as mentioned above. No performance improvement.

A couple of months ago, I noticed that the default VM hard drive cache setting in Proxmox is "(Default) No cache". Then I realized that there are several caching configuration options. Two caught my attention, one described as "writeback" and the second as "writeback (unsafe)". I made a change to "writeback", which seemed sensible for anyone who wants to preserve data. But this modification, incredibly, didn't bring me any performance improvement. So I tried to change the Cache setting to "Writeback (unsafe)", and it gave me a bit more performance, but when the VM crashed, I lost data. As the option itself describes, this option is "unsafe". So I went back to "(Default) in cache" and gave up on changing this parameter.

After that, I opened this post and continued my investigations.

But, now looking at some articles on the internet, and comparing my settings, I realized that in the hard disk configuration, in the bus/device type part, I was using SATA! Then I realized that there was at least one incongruity, using VirtIO SCSI controller and SATA disk. Strange that Proxmox accepted. But then I changed this setting to VirtIO Block and immediately noticed a slight improvement in writing and reading (mostly random 4K).

But now, I read somewhere saying that VitIO Block setting on hard drive, coupled with "writeback" cache would give good performance. So I decided to also change the VM hard disk configuration (proxmox GUI configuration), the hard disk cache going to "writeback" (but of course, not what indicates "unsafe").

The result was a surprising improvement in writing and even reading performance. So far it seems to me an acceptable performance, reaching 237MB/s read, 78MB/s sequential write and 5.5MB/s random 4K write. To me it seems enough. At least for now.

The question is: Is it safe to use this writeback cache? What is the real implication of this?

tconnors · Nov 9, 2022

adriano_da_silva said:
But, now looking at some articles on the internet, and comparing my settings, I realized that in the hard disk configuration, in the bus/device type part, I was using SATA! Then I realized that there was at least one incongruity, using VirtIO SCSI controller and SATA disk. Strange that Proxmox accepted. But then I changed this setting to VirtIO Block and immediately noticed a slight improvement in writing and reading (mostly random 4K)

Ah, very interesting. Something to try tonight...

The question is: Is it safe to use this writeback cache? What is the real implication of this?

From this: https://forum.proxmox.com/threads/writeback-vs-writeback-unsafe.24824/
I get that so long as your guest implements barriers correctly (surely everything these days), then yes, it's safe to use.

Otter7721 · Jan 30, 2023

adriano_da_silva said:
5.5MB/s random 4K write

5.5MB is about 5.5*1024/4 = 1408 IOPS
This does look like it's doing all the HDD work, but what about NVME? This could still be a bottleneck. At least I am.

Code:

(base) [root@localhost fio-cdm]# ./fio-cdm
tests: 5, size: 1.0GiB, target: /root/fio-cdm 2.0GiB/27.8GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     1143.50|        4.81|
|SEQ1M Q1 T1 |      504.79|      193.58|
|RND4K Q32T16|      111.63|       28.25|
|. IOPS      |    27254.04|     6896.46|
|. latency us|    18699.01|    73813.12|
|RND4K Q1 T1 |        6.18|        2.10|
|. IOPS      |     1508.50|      512.79|
|. latency us|      658.65|     1943.99|

https://forum.proxmox.com/threads/bad-rand-read-write-i-o-proxmox-ceph.68404/#post-529486

Ceph performance with simple hardware. Slow writing.

Member

Well-Known Member

Member

Well-Known Member

Member

Renowned Member

Member

Renowned Member

Member

Renowned Member

Renowned Member

Member

Well-Known Member

Member

Renowned Member

Member

Member

Member

Member

New Member