Hello guys.
I'm trying to set up a very fast enterprise NVMe (960GB datacenter devices with tantalum capacitors) as a cache for two or three isolated (I will use 2TB disks, but in these tests I used a 1TB one) spinning disks that I have on a Proxmox node.
The goal, depending on the results I get in benchmark tests, would be to set up an identical configuration for all my ten hyperconverged Ceph nodes, putting the OSD's to run on top of bcache and with DB/Wall on the same NVMe, but on a separate partition .
I know there's a risk in putting two or three spinning disks depending on a single NVMe, but as this risk will be shared with other nodes (I use three copies), I think it's an acceptable risk.
The fact is that I already have ten nodes, several HDDs and also 1 NVMe for each node and I need to leverage this hardware in a rational way to get better performance in a safe way.
Testing the fio in NVME, it performs well enough at 4K random writes, even using direct and fsync flags.
But when I do the same test on bcache writeback, the performance drops a lot. Of course, it's better than the performance of spinning disks, but much worse than when accessed directly from the NVMe device hardware.
As we can see, the same test done on the bcache0 device only got 1548 IOPS and that yielded only 6.3 KB/s.
This is much more than any spinning HDD could give me, but many times less than the result obtained by NVMe.
I've noticed in several tests, varying the amount of jobs or increasing the size of the blocks, that the larger the size of the blocks, the more I approximate the performance of the physical device to the bcache device. But it always seems that the amount of IOPS is limited to somewhere around 1500-1800 IOPS (maximum). By increasing the amount of jobs, I get better results and more IOPS, but if you divide the total IOPS by the amount of jobs, you can see that the IOPS are always limited in the range 1500-1800 per job.
The commands used to configure bcache were:
Monitoring with dstat, it is possible to notice that when activating the fio command, the writing is all done in the cache device (a second partition of NVMe), until the end of the test. The spinning disk is only written after the time has passed and it is possible to see the read on the NVMe and the write on the spinning disk (which means the transfer of data in the background).
This means that the writeback cache mechanism appears to be working as it should, except for the performance limitation.
With ioping it is also possible to notice a limitation, as the latency of the bcache0 device is around 1.5ms, while in the case of the raw device (a partition of NVMe), the same test is only 82.1us.
The cache was configured directly on one of the NVMe partitions (in this case, the first partition). I did several tests using fio and ioping, testing on a partition on the NVMe device, without partition and directly on the raw block, on a first partition, on the second, with or without configuring bcache. I did all this to remove any doubt as to the method. The results of tests performed directly on the hardware device, without going through bcache are always fast and similar.
But tests in bcache are always slower. If you use writethrough, of course, it gets much worse, because the performance is equal to the raw spinning disk.
Using writeback improves a lot, but still doesn't use the full speed of NVMe.
But I've also noticed that there is a limit on writing sequential data, which is a little more than half of the maximum write rate shown in direct tests by the NVMe device.
Processing doesn't seem to be going up like the tests.
Please would anyone know, what could be causing these limits?
Tanks
I'm trying to set up a very fast enterprise NVMe (960GB datacenter devices with tantalum capacitors) as a cache for two or three isolated (I will use 2TB disks, but in these tests I used a 1TB one) spinning disks that I have on a Proxmox node.
The goal, depending on the results I get in benchmark tests, would be to set up an identical configuration for all my ten hyperconverged Ceph nodes, putting the OSD's to run on top of bcache and with DB/Wall on the same NVMe, but on a separate partition .
I know there's a risk in putting two or three spinning disks depending on a single NVMe, but as this risk will be shared with other nodes (I use three copies), I think it's an acceptable risk.
The fact is that I already have ten nodes, several HDDs and also 1 NVMe for each node and I need to leverage this hardware in a rational way to get better performance in a safe way.
Testing the fio in NVME, it performs well enough at 4K random writes, even using direct and fsync flags.
Code:
root@pve-20:~# fio --filename=/dev/nvme0n1p2 --direct=1 --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth=1 --runtime=10 --time_based --group_reporting --name=journal-test --ioengine=libaio
write: IOPS=32.9k, BW=129MiB/s (135MB/s)(1286MiB/10001msec); 0 zone resets
lat (nsec) : 1000=0.01%
lat (usec) : 2=0.01%, 20=0.01%, 50=99.73%, 100=0.12%, 250=0.01%
lat (usec) : 500=0.02%, 750=0.11%, 1000=0.01%
cpu : usr=11.59%, sys=18.37%, ctx=329115, majf=0, minf=14
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,329119,0,329118 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=1286MiB (1348MB), run=10001-10001msec
But when I do the same test on bcache writeback, the performance drops a lot. Of course, it's better than the performance of spinning disks, but much worse than when accessed directly from the NVMe device hardware.
Code:
root@pve-20:~# fio --filename=/dev/bcache0 --direct=1 --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth=1 --runtime=10 --time_based --group_reporting --name=journal-test --ioengine=libaio
write: IOPS=1548, BW=6193KiB/s (6342kB/s)(60.5MiB/10001msec); 0 zone resets
lat (usec) : 50=0.41%, 100=31.42%, 250=66.20%, 500=1.01%, 750=0.31%
lat (usec) : 1000=0.15%
lat (msec) : 2=0.20%, 4=0.08%, 10=0.08%, 20=0.15%
cpu : usr=3.72%, sys=11.67%, ctx=44541, majf=0, minf=12
Run status group 0 (all jobs):
WRITE: bw=6193KiB/s (6342kB/s), 6193KiB/s-6193KiB/s (6342kB/s-6342kB/s), io=60.5MiB (63.4MB), run=10001-10001msec
Disk stats (read/write):
bcache0: ios=0/30596, merge=0/0, ticks=0/8492, in_queue=8492, util=98.99%, aggrios=0/16276, aggrmerge=0/0, aggrticks=0/4528, aggrin_queue=578, aggrutil=98.17%
sdb: ios=0/2, merge=0/0, ticks=0/1158, in_queue=1156, util=5.59%
nvme0n1: ios=1/32550, merge=0/0, ticks=1/7898, in_queue=0, util=98.17%
As we can see, the same test done on the bcache0 device only got 1548 IOPS and that yielded only 6.3 KB/s.
This is much more than any spinning HDD could give me, but many times less than the result obtained by NVMe.
I've noticed in several tests, varying the amount of jobs or increasing the size of the blocks, that the larger the size of the blocks, the more I approximate the performance of the physical device to the bcache device. But it always seems that the amount of IOPS is limited to somewhere around 1500-1800 IOPS (maximum). By increasing the amount of jobs, I get better results and more IOPS, but if you divide the total IOPS by the amount of jobs, you can see that the IOPS are always limited in the range 1500-1800 per job.
The commands used to configure bcache were:
Code:
# echo writeback > /sys/block/bcache0/bcache/cache_mode
# echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
##
## Then I tried everything also with the commands below, but there was no improvement.
##
# echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
# echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us
Monitoring with dstat, it is possible to notice that when activating the fio command, the writing is all done in the cache device (a second partition of NVMe), until the end of the test. The spinning disk is only written after the time has passed and it is possible to see the read on the NVMe and the write on the spinning disk (which means the transfer of data in the background).
Code:
--dsk/sdb---dsk/nvme0n1-dsk/bcache0 ---io/sdb----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
read writ: read writ: read writ| read writ: read writ: read writ| recv send| 1m 5m 15m |usr sys idl wai stl| int csw | time | #aio
0 0 : 0 0 : 0 0 | 0 0 : 0 0 : 0 0 |8462B 8000B|0.03 0.15 0.31| 1 0 99 0 0| 250 383 |09-05 15:19:47| 0
0 0 :4096B 454k: 0 336k| 0 0 :1.00 184 : 0 170 |4566B 4852B|0.03 0.15 0.31| 2 2 94 1 0|1277 3470 |09-05 15:19:48| 1B
0 8192B: 0 8022k: 0 6512k| 0 2.00 : 0 3388 : 0 3254 |3261B 2827B|0.11 0.16 0.32| 0 2 93 5 0|4397 16k|09-05 15:19:49| 1B
0 0 : 0 7310k: 0 6460k| 0 0 : 0 3240 : 0 3231 |6773B 6428B|0.11 0.16 0.32| 0 1 93 6 0|4190 16k|09-05 15:19:50| 1B
0 0 : 0 7313k: 0 6504k| 0 0 : 0 3252 : 0 3251 |6719B 6201B|0.11 0.16 0.32| 0 2 92 6 0|4482 16k|09-05 15:19:51| 1B
0 0 : 0 7313k: 0 6496k| 0 0 : 0 3251 : 0 3250 |4743B 4016B|0.11 0.16 0.32| 0 1 93 6 0|4243 16k|09-05 15:19:52| 1B
0 0 : 0 7329k: 0 6496k| 0 0 : 0 3289 : 0 3245 |6107B 6062B|0.11 0.16 0.32| 1 1 90 8 0|4706 18k|09-05 15:19:53| 1B
0 0 : 0 5373k: 0 4184k| 0 0 : 0 2946 : 0 2095 |6387B 6062B|0.26 0.19 0.33| 0 2 95 4 0|3774 12k|09-05 15:19:54| 1B
0 0 : 0 6966k: 0 5668k| 0 0 : 0 3270 : 0 2834 |7264B 7546B|0.26 0.19 0.33| 0 1 93 5 0|4214 15k|09-05 15:19:55| 1B
0 0 : 0 7271k: 0 6252k| 0 0 : 0 3258 : 0 3126 |5928B 4584B|0.26 0.19 0.33| 0 2 93 5 0|4156 16k|09-05 15:19:56| 1B
0 0 : 0 7419k: 0 6504k| 0 0 : 0 3308 : 0 3251 |5226B 5650B|0.26 0.19 0.33| 2 1 91 6 0|4433 16k|09-05 15:19:57| 1B
0 0 : 0 6444k: 0 5704k| 0 0 : 0 2873 : 0 2851 |6494B 8021B|0.26 0.19 0.33| 1 1 91 7 0|4352 16k|09-05 15:19:58| 0
0 0 : 0 0 : 0 0 | 0 0 : 0 0 : 0 0 |6030B 7204B|0.24 0.19 0.32| 0 0 100 0 0| 209 279 |09-05 15:19:59| 0
This means that the writeback cache mechanism appears to be working as it should, except for the performance limitation.
With ioping it is also possible to notice a limitation, as the latency of the bcache0 device is around 1.5ms, while in the case of the raw device (a partition of NVMe), the same test is only 82.1us.
Code:
root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=1 time=1.52 ms (warmup)
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=2 time=1.60 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=3 time=1.55 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=4 time=1.59 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=5 time=1.52 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=6 time=1.44 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=7 time=1.01 ms (fast)
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=8 time=968.6 us (fast)
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=9 time=1.12 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=10 time=1.12 ms
--- /dev/bcache0 (block device 931.5 GiB) ioping statistics ---
9 requests completed in 11.9 ms, 36 KiB written, 754 iops, 2.95 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 968.6 us / 1.33 ms / 1.60 ms / 249.1 us
-------------------------------------------------------------------
root@pve-20:/# dstat -drnlcyt -D sdb,nvme0n1,bcache0 --aio
--dsk/sdb---dsk/nvme0n1-dsk/bcache0 ---io/sdb----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
read writ: read writ: read writ| read writ: read writ: read writ| recv send| 1m 5m 15m |usr sys idl wai stl| int csw | time | #aio
332B 181k: 167k 937k: 20B 303k|0.01 11.2 :11.9 42.1 :0.00 5.98 | 0 0 |0.10 0.31 0.36| 0 0 99 0 0| 392 904 |09-05 15:26:35| 0
0 0 : 0 0 : 0 0 | 0 0 : 0 0 : 0 0 |2200B 2506B|0.09 0.31 0.36| 0 0 99 0 0| 437 538 |09-05 15:26:40| 0
0 0 : 0 5632B: 0 4096B| 0 0 : 0 4.00 : 0 3.00 |8868B 8136B|0.09 0.31 0.36| 0 0 100 0 0| 247 339 |09-05 15:26:41| 0
0 0 : 0 5632B: 0 4096B| 0 0 : 0 4.00 : 0 3.00 |7318B 7372B|0.09 0.31 0.36| 0 0 99 0 0| 520 2153 |09-05 15:26:42| 0
0 0 : 0 5632B: 0 4096B| 0 0 : 0 4.00 : 0 3.00 |3315B 2768B|0.09 0.31 0.36| 1 0 97 2 0|1130 2214 |09-05 15:26:43| 0
0 0 : 0 5632B: 0 4096B| 0 0 : 0 4.00 : 0 3.00 |9526B 12k|0.09 0.31 0.36| 1 0 99 0 0| 339 564 |09-05 15:26:44| 0
0 4096B:4096B 6656B: 0 4096B| 0 1.00 :1.00 6.00 : 0 3.00 |6142B 6536B|0.08 0.30 0.36| 0 1 98 0 0| 316 375 |09-05 15:26:45| 0
0 4096B:4096B 5632B: 0 4096B| 0 1.00 :1.00 4.00 : 0 3.00 |3378B 3714B|0.08 0.30 0.36| 0 0 100 0 0| 191 328 |09-05 15:26:46| 0
0 4096B:4096B 6656B: 0 4096B| 0 1.00 :1.00 6.00 : 0 3.00 | 10k 21k|0.08 0.30 0.36| 1 0 99 0 0| 387 468 |09-05 15:26:47| 0
0 4096B:4096B 5632B: 0 4096B| 0 1.00 :1.00 4.00 : 0 3.00 |7650B 8602B|0.08 0.30 0.36| 0 0 97 2 0| 737 2627 |09-05 15:26:48| 0
0 4096B:4096B 6144B: 0 4096B| 0 1.00 :1.00 5.00 : 0 3.00 |9025B 8083B|0.08 0.30 0.36| 0 0 100 0 0| 335 510 |09-05 15:26:49| 0
0 4096B:4096B 5632B: 0 4096B| 0 1.00 :1.00 4.00 : 0 3.00 | 12k 11k|0.08 0.30 0.35| 0 0 100 0 0| 290 496 |09-05 15:26:50| 0
0 4096B:4096B 0 : 0 0 | 0 1.00 :1.00 0 : 0 0 |5467B 5365B|0.08 0.30 0.35| 0 0 100 0 0| 404 300 |09-05 15:26:51| 0
0 4096B:4096B 0 : 0 0 | 0 1.00 :1.00 0 : 0 0 |7973B 7315B|0.08 0.30 0.35| 0 0 100 0 0| 195 304 |09-05 15:26:52| 0
0 4096B:4096B 0 : 0 0 | 0 1.00 :1.00 0 : 0 0 |6183B 4929B|0.08 0.30 0.35| 0 0 99 1 0| 683 2542 |09-05 15:26:53| 0
0 4096B:4096B 12k: 0 0 | 0 1.00 :1.00 2.00 : 0 0 |4995B 4998B|0.08 0.30 0.35| 0 0 100 0 0| 199 422 |09-05 15:26:54| 0
0 0 : 0 0 : 0 0 | 0 0 : 0 0 : 0 0 |8353B 8059B|0.07 0.29 0.35| 0 0 100 0 0| 164 217 |09-05 15:26:55| 0
=====================================================================================================
root@pve-20:~# ioping -c10 /dev/nvme0n1p2 -D -Y -WWW -s4k
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=1 time=81.2 us (warmup)
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=2 time=82.7 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=3 time=82.4 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=4 time=94.4 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=5 time=95.1 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=6 time=67.5 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=7 time=85.1 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=8 time=63.5 us (fast)
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=9 time=82.2 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=10 time=86.1 us
--- /dev/nvme0n1p2 (block device 300 GiB) ioping statistics ---
9 requests completed in 739.2 us, 36 KiB written, 12.2 k iops, 47.6 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 63.5 us / 82.1 us / 95.1 us / 10.0 us
-----------------------------------------------------------------------------------------
root@pve-20:/# dstat -drnlcyt -D sdb,nvme0n1,bcache0 --aio
--dsk/sdb---dsk/nvme0n1-dsk/bcache0 ---io/sdb----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
read writ: read writ: read writ| read writ: read writ: read writ| recv send| 1m 5m 15m |usr sys idl wai stl| int csw | time | #aio
332B 181k: 167k 935k: 20B 302k|0.01 11.2 :11.9 42.0 :0.00 5.96 | 0 0 |0.18 0.25 0.32| 0 0 99 0 0| 392 904 |09-05 15:30:49| 0
0 0 : 0 0 : 0 0 | 0 0 : 0 0 : 0 0 |4443B 4548B|0.16 0.25 0.32| 0 0 100 0 0| 108 209 |09-05 15:30:55| 0
0 0 : 0 4096B: 0 0 | 0 0 : 0 1.00 : 0 0 |3526B 3844B|0.16 0.25 0.32| 1 0 99 0 0| 316 434 |09-05 15:30:56| 0
0 0 : 0 4096B: 0 0 | 0 0 : 0 1.00 : 0 0 |5855B 4707B|0.16 0.25 0.32| 0 0 100 0 0| 146 277 |09-05 15:30:57| 0
0 0 : 0 4096B: 0 0 | 0 0 : 0 1.00 : 0 0 |8897B 7349B|0.16 0.25 0.32| 0 0 99 1 0| 740 2323 |09-05 15:30:58| 0
0 0 : 0 4096B: 0 0 | 0 0 : 0 1.00 : 0 0 |7802B 7280B|0.15 0.24 0.32| 0 0 100 0 0| 118 235 |09-05 15:30:59| 0
0 0 : 0 4096B: 0 0 | 0 0 : 0 1.00 : 0 0 |5610B 4593B|0.15 0.24 0.32| 2 0 98 0 0| 667 682 |09-05 15:31:00| 0
0 0 : 0 4096B: 0 0 | 0 0 : 0 1.00 : 0 0 |9046B 8254B|0.15 0.24 0.32| 4 0 96 0 0| 515 707 |09-05 15:31:01| 0
0 0 : 0 4096B: 0 0 | 0 0 : 0 1.00 : 0 0 |5323B 5129B|0.15 0.24 0.32| 0 0 100 0 0| 191 247 |09-05 15:31:02| 0
0 0 : 0 4096B: 0 0 | 0 0 : 0 1.00 : 0 0 |4249B 3549B|0.15 0.24 0.32| 0 0 98 2 0| 708 2565 |09-05 15:31:03| 0
0 0 : 0 4096B: 0 0 | 0 0 : 0 1.00 : 0 0 |7577B 7351B|0.14 0.24 0.32| 0 0 100 0 0| 291 350 |09-05 15:31:04| 0
0 0 :2080k 4096B: 0 0 | 0 0 :62.0 1.00 : 0 0 |5731B 5692B|0.14 0.24 0.32| 0 0 100 0 0| 330 462 |09-05 15:31:05| 0
0 0 : 0 0 : 0 0 | 0 0 : 0 0 : 0 0 |7347B 5852B|0.14 0.24 0.32| 1 0 99 0 0| 306 419 |09-05 15:31:06| 0
The cache was configured directly on one of the NVMe partitions (in this case, the first partition). I did several tests using fio and ioping, testing on a partition on the NVMe device, without partition and directly on the raw block, on a first partition, on the second, with or without configuring bcache. I did all this to remove any doubt as to the method. The results of tests performed directly on the hardware device, without going through bcache are always fast and similar.
But tests in bcache are always slower. If you use writethrough, of course, it gets much worse, because the performance is equal to the raw spinning disk.
Using writeback improves a lot, but still doesn't use the full speed of NVMe.
But I've also noticed that there is a limit on writing sequential data, which is a little more than half of the maximum write rate shown in direct tests by the NVMe device.
Processing doesn't seem to be going up like the tests.
Please would anyone know, what could be causing these limits?
Tanks
Last edited: