Hello guys.
I'm trying to set up a very fast enterprise NVMe (960GB datacenter devices with tantalum capacitors) as a cache for two or three isolated (I will use 2TB disks, but in these tests I used a 1TB one) spinning disks that I have on a Proxmox node.
The goal, depending on the results I get in benchmark tests, would be to set up an identical configuration for all my ten hyperconverged Ceph nodes, putting the OSD's to run on top of bcache and with DB/Wall on the same NVMe, but on a separate partition .
I know there's a risk in putting two or three spinning disks depending on a single NVMe, but as this risk will be shared with other nodes (I use three copies), I think it's an acceptable risk.
The fact is that I already have ten nodes, several HDDs and also 1 NVMe for each node and I need to leverage this hardware in a rational way to get better performance in a safe way.
Testing the fio in NVME, it performs well enough at 4K random writes, even using direct and fsync flags.
	
	
	
		
But when I do the same test on bcache writeback, the performance drops a lot. Of course, it's better than the performance of spinning disks, but much worse than when accessed directly from the NVMe device hardware.
	
	
	
		
As we can see, the same test done on the bcache0 device only got 1548 IOPS and that yielded only 6.3 KB/s.
This is much more than any spinning HDD could give me, but many times less than the result obtained by NVMe.
I've noticed in several tests, varying the amount of jobs or increasing the size of the blocks, that the larger the size of the blocks, the more I approximate the performance of the physical device to the bcache device. But it always seems that the amount of IOPS is limited to somewhere around 1500-1800 IOPS (maximum). By increasing the amount of jobs, I get better results and more IOPS, but if you divide the total IOPS by the amount of jobs, you can see that the IOPS are always limited in the range 1500-1800 per job.
The commands used to configure bcache were:
	
	
	
		
Monitoring with dstat, it is possible to notice that when activating the fio command, the writing is all done in the cache device (a second partition of NVMe), until the end of the test. The spinning disk is only written after the time has passed and it is possible to see the read on the NVMe and the write on the spinning disk (which means the transfer of data in the background).
	
	
	
		
This means that the writeback cache mechanism appears to be working as it should, except for the performance limitation.
With ioping it is also possible to notice a limitation, as the latency of the bcache0 device is around 1.5ms, while in the case of the raw device (a partition of NVMe), the same test is only 82.1us.
	
	
	
		
The cache was configured directly on one of the NVMe partitions (in this case, the first partition). I did several tests using fio and ioping, testing on a partition on the NVMe device, without partition and directly on the raw block, on a first partition, on the second, with or without configuring bcache. I did all this to remove any doubt as to the method. The results of tests performed directly on the hardware device, without going through bcache are always fast and similar.
But tests in bcache are always slower. If you use writethrough, of course, it gets much worse, because the performance is equal to the raw spinning disk.
Using writeback improves a lot, but still doesn't use the full speed of NVMe.
But I've also noticed that there is a limit on writing sequential data, which is a little more than half of the maximum write rate shown in direct tests by the NVMe device.
Processing doesn't seem to be going up like the tests.
Please would anyone know, what could be causing these limits?
Tanks
				
			I'm trying to set up a very fast enterprise NVMe (960GB datacenter devices with tantalum capacitors) as a cache for two or three isolated (I will use 2TB disks, but in these tests I used a 1TB one) spinning disks that I have on a Proxmox node.
The goal, depending on the results I get in benchmark tests, would be to set up an identical configuration for all my ten hyperconverged Ceph nodes, putting the OSD's to run on top of bcache and with DB/Wall on the same NVMe, but on a separate partition .
I know there's a risk in putting two or three spinning disks depending on a single NVMe, but as this risk will be shared with other nodes (I use three copies), I think it's an acceptable risk.
The fact is that I already have ten nodes, several HDDs and also 1 NVMe for each node and I need to leverage this hardware in a rational way to get better performance in a safe way.
Testing the fio in NVME, it performs well enough at 4K random writes, even using direct and fsync flags.
		Code:
	
	root@pve-20:~# fio --filename=/dev/nvme0n1p2 --direct=1 --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth=1 --runtime=10 --time_based --group_reporting --name=journal-test --ioengine=libaio
  write: IOPS=32.9k, BW=129MiB/s (135MB/s)(1286MiB/10001msec); 0 zone resets
  lat (nsec)   : 1000=0.01%
  lat (usec)   : 2=0.01%, 20=0.01%, 50=99.73%, 100=0.12%, 250=0.01%
  lat (usec)   : 500=0.02%, 750=0.11%, 1000=0.01%
  cpu          : usr=11.59%, sys=18.37%, ctx=329115, majf=0, minf=14
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,329119,0,329118 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
  WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=1286MiB (1348MB), run=10001-10001msecBut when I do the same test on bcache writeback, the performance drops a lot. Of course, it's better than the performance of spinning disks, but much worse than when accessed directly from the NVMe device hardware.
		Code:
	
	root@pve-20:~# fio --filename=/dev/bcache0 --direct=1 --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth=1 --runtime=10 --time_based --group_reporting --name=journal-test --ioengine=libaio
  write: IOPS=1548, BW=6193KiB/s (6342kB/s)(60.5MiB/10001msec); 0 zone resets
  lat (usec)   : 50=0.41%, 100=31.42%, 250=66.20%, 500=1.01%, 750=0.31%
  lat (usec)   : 1000=0.15%
  lat (msec)   : 2=0.20%, 4=0.08%, 10=0.08%, 20=0.15%
  cpu          : usr=3.72%, sys=11.67%, ctx=44541, majf=0, minf=12
Run status group 0 (all jobs):
  WRITE: bw=6193KiB/s (6342kB/s), 6193KiB/s-6193KiB/s (6342kB/s-6342kB/s), io=60.5MiB (63.4MB), run=10001-10001msec
Disk stats (read/write):
    bcache0: ios=0/30596, merge=0/0, ticks=0/8492, in_queue=8492, util=98.99%, aggrios=0/16276, aggrmerge=0/0, aggrticks=0/4528, aggrin_queue=578, aggrutil=98.17%
  sdb: ios=0/2, merge=0/0, ticks=0/1158, in_queue=1156, util=5.59%
  nvme0n1: ios=1/32550, merge=0/0, ticks=1/7898, in_queue=0, util=98.17%As we can see, the same test done on the bcache0 device only got 1548 IOPS and that yielded only 6.3 KB/s.
This is much more than any spinning HDD could give me, but many times less than the result obtained by NVMe.
I've noticed in several tests, varying the amount of jobs or increasing the size of the blocks, that the larger the size of the blocks, the more I approximate the performance of the physical device to the bcache device. But it always seems that the amount of IOPS is limited to somewhere around 1500-1800 IOPS (maximum). By increasing the amount of jobs, I get better results and more IOPS, but if you divide the total IOPS by the amount of jobs, you can see that the IOPS are always limited in the range 1500-1800 per job.
The commands used to configure bcache were:
		Code:
	
	# echo writeback > /sys/block/bcache0/bcache/cache_mode
# echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
##
## Then I tried everything also with the commands below, but there was no improvement.
##
# echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
# echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_usMonitoring with dstat, it is possible to notice that when activating the fio command, the writing is all done in the cache device (a second partition of NVMe), until the end of the test. The spinning disk is only written after the time has passed and it is possible to see the read on the NVMe and the write on the spinning disk (which means the transfer of data in the background).
		Code:
	
	--dsk/sdb---dsk/nvme0n1-dsk/bcache0 ---io/sdb----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
 read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |8462B 8000B|0.03 0.15 0.31|  1   0  99   0   0| 250   383 |09-05 15:19:47|   0
   0     0 :4096B  454k:   0   336k|   0     0 :1.00   184 :   0   170 |4566B 4852B|0.03 0.15 0.31|  2   2  94   1   0|1277  3470 |09-05 15:19:48|   1B
   0  8192B:   0  8022k:   0  6512k|   0  2.00 :   0  3388 :   0  3254 |3261B 2827B|0.11 0.16 0.32|  0   2  93   5   0|4397    16k|09-05 15:19:49|   1B
   0     0 :   0  7310k:   0  6460k|   0     0 :   0  3240 :   0  3231 |6773B 6428B|0.11 0.16 0.32|  0   1  93   6   0|4190    16k|09-05 15:19:50|   1B
   0     0 :   0  7313k:   0  6504k|   0     0 :   0  3252 :   0  3251 |6719B 6201B|0.11 0.16 0.32|  0   2  92   6   0|4482    16k|09-05 15:19:51|   1B
   0     0 :   0  7313k:   0  6496k|   0     0 :   0  3251 :   0  3250 |4743B 4016B|0.11 0.16 0.32|  0   1  93   6   0|4243    16k|09-05 15:19:52|   1B
   0     0 :   0  7329k:   0  6496k|   0     0 :   0  3289 :   0  3245 |6107B 6062B|0.11 0.16 0.32|  1   1  90   8   0|4706    18k|09-05 15:19:53|   1B
   0     0 :   0  5373k:   0  4184k|   0     0 :   0  2946 :   0  2095 |6387B 6062B|0.26 0.19 0.33|  0   2  95   4   0|3774    12k|09-05 15:19:54|   1B
   0     0 :   0  6966k:   0  5668k|   0     0 :   0  3270 :   0  2834 |7264B 7546B|0.26 0.19 0.33|  0   1  93   5   0|4214    15k|09-05 15:19:55|   1B
   0     0 :   0  7271k:   0  6252k|   0     0 :   0  3258 :   0  3126 |5928B 4584B|0.26 0.19 0.33|  0   2  93   5   0|4156    16k|09-05 15:19:56|   1B
   0     0 :   0  7419k:   0  6504k|   0     0 :   0  3308 :   0  3251 |5226B 5650B|0.26 0.19 0.33|  2   1  91   6   0|4433    16k|09-05 15:19:57|   1B
   0     0 :   0  6444k:   0  5704k|   0     0 :   0  2873 :   0  2851 |6494B 8021B|0.26 0.19 0.33|  1   1  91   7   0|4352    16k|09-05 15:19:58|   0
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |6030B 7204B|0.24 0.19 0.32|  0   0 100   0   0| 209   279 |09-05 15:19:59|   0This means that the writeback cache mechanism appears to be working as it should, except for the performance limitation.
With ioping it is also possible to notice a limitation, as the latency of the bcache0 device is around 1.5ms, while in the case of the raw device (a partition of NVMe), the same test is only 82.1us.
		Code:
	
	root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=1 time=1.52 ms (warmup)
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=2 time=1.60 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=3 time=1.55 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=4 time=1.59 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=5 time=1.52 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=6 time=1.44 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=7 time=1.01 ms (fast)
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=8 time=968.6 us (fast)
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=9 time=1.12 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=10 time=1.12 ms
--- /dev/bcache0 (block device 931.5 GiB) ioping statistics ---
9 requests completed in 11.9 ms, 36 KiB written, 754 iops, 2.95 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 968.6 us / 1.33 ms / 1.60 ms / 249.1 us
-------------------------------------------------------------------
root@pve-20:/# dstat -drnlcyt -D sdb,nvme0n1,bcache0 --aio
--dsk/sdb---dsk/nvme0n1-dsk/bcache0 ---io/sdb----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
 read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
 332B  181k: 167k  937k:  20B  303k|0.01  11.2 :11.9  42.1 :0.00  5.98 |   0     0 |0.10 0.31 0.36|  0   0  99   0   0| 392   904 |09-05 15:26:35|   0
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |2200B 2506B|0.09 0.31 0.36|  0   0  99   0   0| 437   538 |09-05 15:26:40|   0
   0     0 :   0  5632B:   0  4096B|   0     0 :   0  4.00 :   0  3.00 |8868B 8136B|0.09 0.31 0.36|  0   0 100   0   0| 247   339 |09-05 15:26:41|   0
   0     0 :   0  5632B:   0  4096B|   0     0 :   0  4.00 :   0  3.00 |7318B 7372B|0.09 0.31 0.36|  0   0  99   0   0| 520  2153 |09-05 15:26:42|   0
   0     0 :   0  5632B:   0  4096B|   0     0 :   0  4.00 :   0  3.00 |3315B 2768B|0.09 0.31 0.36|  1   0  97   2   0|1130  2214 |09-05 15:26:43|   0
   0     0 :   0  5632B:   0  4096B|   0     0 :   0  4.00 :   0  3.00 |9526B   12k|0.09 0.31 0.36|  1   0  99   0   0| 339   564 |09-05 15:26:44|   0
   0  4096B:4096B 6656B:   0  4096B|   0  1.00 :1.00  6.00 :   0  3.00 |6142B 6536B|0.08 0.30 0.36|  0   1  98   0   0| 316   375 |09-05 15:26:45|   0
   0  4096B:4096B 5632B:   0  4096B|   0  1.00 :1.00  4.00 :   0  3.00 |3378B 3714B|0.08 0.30 0.36|  0   0 100   0   0| 191   328 |09-05 15:26:46|   0
   0  4096B:4096B 6656B:   0  4096B|   0  1.00 :1.00  6.00 :   0  3.00 |  10k   21k|0.08 0.30 0.36|  1   0  99   0   0| 387   468 |09-05 15:26:47|   0
   0  4096B:4096B 5632B:   0  4096B|   0  1.00 :1.00  4.00 :   0  3.00 |7650B 8602B|0.08 0.30 0.36|  0   0  97   2   0| 737  2627 |09-05 15:26:48|   0
   0  4096B:4096B 6144B:   0  4096B|   0  1.00 :1.00  5.00 :   0  3.00 |9025B 8083B|0.08 0.30 0.36|  0   0 100   0   0| 335   510 |09-05 15:26:49|   0
   0  4096B:4096B 5632B:   0  4096B|   0  1.00 :1.00  4.00 :   0  3.00 |  12k   11k|0.08 0.30 0.35|  0   0 100   0   0| 290   496 |09-05 15:26:50|   0
   0  4096B:4096B    0 :   0     0 |   0  1.00 :1.00     0 :   0     0 |5467B 5365B|0.08 0.30 0.35|  0   0 100   0   0| 404   300 |09-05 15:26:51|   0
   0  4096B:4096B    0 :   0     0 |   0  1.00 :1.00     0 :   0     0 |7973B 7315B|0.08 0.30 0.35|  0   0 100   0   0| 195   304 |09-05 15:26:52|   0
   0  4096B:4096B    0 :   0     0 |   0  1.00 :1.00     0 :   0     0 |6183B 4929B|0.08 0.30 0.35|  0   0  99   1   0| 683  2542 |09-05 15:26:53|   0
   0  4096B:4096B   12k:   0     0 |   0  1.00 :1.00  2.00 :   0     0 |4995B 4998B|0.08 0.30 0.35|  0   0 100   0   0| 199   422 |09-05 15:26:54|   0
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |8353B 8059B|0.07 0.29 0.35|  0   0 100   0   0| 164   217 |09-05 15:26:55|   0
=====================================================================================================
root@pve-20:~# ioping -c10 /dev/nvme0n1p2 -D -Y -WWW -s4k
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=1 time=81.2 us (warmup)
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=2 time=82.7 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=3 time=82.4 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=4 time=94.4 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=5 time=95.1 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=6 time=67.5 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=7 time=85.1 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=8 time=63.5 us (fast)
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=9 time=82.2 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=10 time=86.1 us
--- /dev/nvme0n1p2 (block device 300 GiB) ioping statistics ---
9 requests completed in 739.2 us, 36 KiB written, 12.2 k iops, 47.6 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 63.5 us / 82.1 us / 95.1 us / 10.0 us
-----------------------------------------------------------------------------------------
root@pve-20:/# dstat -drnlcyt -D sdb,nvme0n1,bcache0 --aio
--dsk/sdb---dsk/nvme0n1-dsk/bcache0 ---io/sdb----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
 read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
 332B  181k: 167k  935k:  20B  302k|0.01  11.2 :11.9  42.0 :0.00  5.96 |   0     0 |0.18 0.25 0.32|  0   0  99   0   0| 392   904 |09-05 15:30:49|   0
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |4443B 4548B|0.16 0.25 0.32|  0   0 100   0   0| 108   209 |09-05 15:30:55|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |3526B 3844B|0.16 0.25 0.32|  1   0  99   0   0| 316   434 |09-05 15:30:56|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |5855B 4707B|0.16 0.25 0.32|  0   0 100   0   0| 146   277 |09-05 15:30:57|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |8897B 7349B|0.16 0.25 0.32|  0   0  99   1   0| 740  2323 |09-05 15:30:58|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |7802B 7280B|0.15 0.24 0.32|  0   0 100   0   0| 118   235 |09-05 15:30:59|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |5610B 4593B|0.15 0.24 0.32|  2   0  98   0   0| 667   682 |09-05 15:31:00|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |9046B 8254B|0.15 0.24 0.32|  4   0  96   0   0| 515   707 |09-05 15:31:01|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |5323B 5129B|0.15 0.24 0.32|  0   0 100   0   0| 191   247 |09-05 15:31:02|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |4249B 3549B|0.15 0.24 0.32|  0   0  98   2   0| 708  2565 |09-05 15:31:03|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |7577B 7351B|0.14 0.24 0.32|  0   0 100   0   0| 291   350 |09-05 15:31:04|   0
   0     0 :2080k 4096B:   0     0 |   0     0 :62.0  1.00 :   0     0 |5731B 5692B|0.14 0.24 0.32|  0   0 100   0   0| 330   462 |09-05 15:31:05|   0
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |7347B 5852B|0.14 0.24 0.32|  1   0  99   0   0| 306   419 |09-05 15:31:06|   0The cache was configured directly on one of the NVMe partitions (in this case, the first partition). I did several tests using fio and ioping, testing on a partition on the NVMe device, without partition and directly on the raw block, on a first partition, on the second, with or without configuring bcache. I did all this to remove any doubt as to the method. The results of tests performed directly on the hardware device, without going through bcache are always fast and similar.
But tests in bcache are always slower. If you use writethrough, of course, it gets much worse, because the performance is equal to the raw spinning disk.
Using writeback improves a lot, but still doesn't use the full speed of NVMe.
But I've also noticed that there is a limit on writing sequential data, which is a little more than half of the maximum write rate shown in direct tests by the NVMe device.
Processing doesn't seem to be going up like the tests.
Please would anyone know, what could be causing these limits?
Tanks
			
				Last edited: 
				
		
	
										
										
											
	
										
									
								 
	 
	
