Bcache in NVMe for 4K and fsync. Are IOPS limited?

adriano_da_silva · May 9, 2022

Hello guys.

I'm trying to set up a very fast enterprise NVMe (960GB datacenter devices with tantalum capacitors) as a cache for two or three isolated (I will use 2TB disks, but in these tests I used a 1TB one) spinning disks that I have on a Proxmox node.

The goal, depending on the results I get in benchmark tests, would be to set up an identical configuration for all my ten hyperconverged Ceph nodes, putting the OSD's to run on top of bcache and with DB/Wall on the same NVMe, but on a separate partition .

I know there's a risk in putting two or three spinning disks depending on a single NVMe, but as this risk will be shared with other nodes (I use three copies), I think it's an acceptable risk.

The fact is that I already have ten nodes, several HDDs and also 1 NVMe for each node and I need to leverage this hardware in a rational way to get better performance in a safe way.

Testing the fio in NVME, it performs well enough at 4K random writes, even using direct and fsync flags.

Code:

root@pve-20:~# fio --filename=/dev/nvme0n1p2 --direct=1 --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth=1 --runtime=10 --time_based --group_reporting --name=journal-test --ioengine=libaio
  write: IOPS=32.9k, BW=129MiB/s (135MB/s)(1286MiB/10001msec); 0 zone resets
  lat (nsec)   : 1000=0.01%
  lat (usec)   : 2=0.01%, 20=0.01%, 50=99.73%, 100=0.12%, 250=0.01%
  lat (usec)   : 500=0.02%, 750=0.11%, 1000=0.01%
  cpu          : usr=11.59%, sys=18.37%, ctx=329115, majf=0, minf=14
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,329119,0,329118 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=1286MiB (1348MB), run=10001-10001msec

But when I do the same test on bcache writeback, the performance drops a lot. Of course, it's better than the performance of spinning disks, but much worse than when accessed directly from the NVMe device hardware.

Code:

root@pve-20:~# fio --filename=/dev/bcache0 --direct=1 --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth=1 --runtime=10 --time_based --group_reporting --name=journal-test --ioengine=libaio
  write: IOPS=1548, BW=6193KiB/s (6342kB/s)(60.5MiB/10001msec); 0 zone resets
  lat (usec)   : 50=0.41%, 100=31.42%, 250=66.20%, 500=1.01%, 750=0.31%
  lat (usec)   : 1000=0.15%
  lat (msec)   : 2=0.20%, 4=0.08%, 10=0.08%, 20=0.15%
  cpu          : usr=3.72%, sys=11.67%, ctx=44541, majf=0, minf=12
Run status group 0 (all jobs):
  WRITE: bw=6193KiB/s (6342kB/s), 6193KiB/s-6193KiB/s (6342kB/s-6342kB/s), io=60.5MiB (63.4MB), run=10001-10001msec

Disk stats (read/write):
    bcache0: ios=0/30596, merge=0/0, ticks=0/8492, in_queue=8492, util=98.99%, aggrios=0/16276, aggrmerge=0/0, aggrticks=0/4528, aggrin_queue=578, aggrutil=98.17%
  sdb: ios=0/2, merge=0/0, ticks=0/1158, in_queue=1156, util=5.59%
  nvme0n1: ios=1/32550, merge=0/0, ticks=1/7898, in_queue=0, util=98.17%

As we can see, the same test done on the bcache0 device only got 1548 IOPS and that yielded only 6.3 KB/s.

This is much more than any spinning HDD could give me, but many times less than the result obtained by NVMe.

I've noticed in several tests, varying the amount of jobs or increasing the size of the blocks, that the larger the size of the blocks, the more I approximate the performance of the physical device to the bcache device. But it always seems that the amount of IOPS is limited to somewhere around 1500-1800 IOPS (maximum). By increasing the amount of jobs, I get better results and more IOPS, but if you divide the total IOPS by the amount of jobs, you can see that the IOPS are always limited in the range 1500-1800 per job.

The commands used to configure bcache were:

Code:

# echo writeback > /sys/block/bcache0/bcache/cache_mode
# echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
##
## Then I tried everything also with the commands below, but there was no improvement.
##
# echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
# echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us

Monitoring with dstat, it is possible to notice that when activating the fio command, the writing is all done in the cache device (a second partition of NVMe), until the end of the test. The spinning disk is only written after the time has passed and it is possible to see the read on the NVMe and the write on the spinning disk (which means the transfer of data in the background).

Code:

--dsk/sdb---dsk/nvme0n1-dsk/bcache0 ---io/sdb----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
 read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |8462B 8000B|0.03 0.15 0.31|  1   0  99   0   0| 250   383 |09-05 15:19:47|   0
   0     0 :4096B  454k:   0   336k|   0     0 :1.00   184 :   0   170 |4566B 4852B|0.03 0.15 0.31|  2   2  94   1   0|1277  3470 |09-05 15:19:48|   1B
   0  8192B:   0  8022k:   0  6512k|   0  2.00 :   0  3388 :   0  3254 |3261B 2827B|0.11 0.16 0.32|  0   2  93   5   0|4397    16k|09-05 15:19:49|   1B
   0     0 :   0  7310k:   0  6460k|   0     0 :   0  3240 :   0  3231 |6773B 6428B|0.11 0.16 0.32|  0   1  93   6   0|4190    16k|09-05 15:19:50|   1B
   0     0 :   0  7313k:   0  6504k|   0     0 :   0  3252 :   0  3251 |6719B 6201B|0.11 0.16 0.32|  0   2  92   6   0|4482    16k|09-05 15:19:51|   1B
   0     0 :   0  7313k:   0  6496k|   0     0 :   0  3251 :   0  3250 |4743B 4016B|0.11 0.16 0.32|  0   1  93   6   0|4243    16k|09-05 15:19:52|   1B
   0     0 :   0  7329k:   0  6496k|   0     0 :   0  3289 :   0  3245 |6107B 6062B|0.11 0.16 0.32|  1   1  90   8   0|4706    18k|09-05 15:19:53|   1B
   0     0 :   0  5373k:   0  4184k|   0     0 :   0  2946 :   0  2095 |6387B 6062B|0.26 0.19 0.33|  0   2  95   4   0|3774    12k|09-05 15:19:54|   1B
   0     0 :   0  6966k:   0  5668k|   0     0 :   0  3270 :   0  2834 |7264B 7546B|0.26 0.19 0.33|  0   1  93   5   0|4214    15k|09-05 15:19:55|   1B
   0     0 :   0  7271k:   0  6252k|   0     0 :   0  3258 :   0  3126 |5928B 4584B|0.26 0.19 0.33|  0   2  93   5   0|4156    16k|09-05 15:19:56|   1B
   0     0 :   0  7419k:   0  6504k|   0     0 :   0  3308 :   0  3251 |5226B 5650B|0.26 0.19 0.33|  2   1  91   6   0|4433    16k|09-05 15:19:57|   1B
   0     0 :   0  6444k:   0  5704k|   0     0 :   0  2873 :   0  2851 |6494B 8021B|0.26 0.19 0.33|  1   1  91   7   0|4352    16k|09-05 15:19:58|   0
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |6030B 7204B|0.24 0.19 0.32|  0   0 100   0   0| 209   279 |09-05 15:19:59|   0

This means that the writeback cache mechanism appears to be working as it should, except for the performance limitation.

With ioping it is also possible to notice a limitation, as the latency of the bcache0 device is around 1.5ms, while in the case of the raw device (a partition of NVMe), the same test is only 82.1us.

Code:

root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=1 time=1.52 ms (warmup)
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=2 time=1.60 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=3 time=1.55 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=4 time=1.59 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=5 time=1.52 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=6 time=1.44 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=7 time=1.01 ms (fast)
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=8 time=968.6 us (fast)
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=9 time=1.12 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=10 time=1.12 ms

--- /dev/bcache0 (block device 931.5 GiB) ioping statistics ---
9 requests completed in 11.9 ms, 36 KiB written, 754 iops, 2.95 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 968.6 us / 1.33 ms / 1.60 ms / 249.1 us

-------------------------------------------------------------------

root@pve-20:/# dstat -drnlcyt -D sdb,nvme0n1,bcache0 --aio
--dsk/sdb---dsk/nvme0n1-dsk/bcache0 ---io/sdb----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
 read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
 332B  181k: 167k  937k:  20B  303k|0.01  11.2 :11.9  42.1 :0.00  5.98 |   0     0 |0.10 0.31 0.36|  0   0  99   0   0| 392   904 |09-05 15:26:35|   0
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |2200B 2506B|0.09 0.31 0.36|  0   0  99   0   0| 437   538 |09-05 15:26:40|   0
   0     0 :   0  5632B:   0  4096B|   0     0 :   0  4.00 :   0  3.00 |8868B 8136B|0.09 0.31 0.36|  0   0 100   0   0| 247   339 |09-05 15:26:41|   0
   0     0 :   0  5632B:   0  4096B|   0     0 :   0  4.00 :   0  3.00 |7318B 7372B|0.09 0.31 0.36|  0   0  99   0   0| 520  2153 |09-05 15:26:42|   0
   0     0 :   0  5632B:   0  4096B|   0     0 :   0  4.00 :   0  3.00 |3315B 2768B|0.09 0.31 0.36|  1   0  97   2   0|1130  2214 |09-05 15:26:43|   0
   0     0 :   0  5632B:   0  4096B|   0     0 :   0  4.00 :   0  3.00 |9526B   12k|0.09 0.31 0.36|  1   0  99   0   0| 339   564 |09-05 15:26:44|   0
   0  4096B:4096B 6656B:   0  4096B|   0  1.00 :1.00  6.00 :   0  3.00 |6142B 6536B|0.08 0.30 0.36|  0   1  98   0   0| 316   375 |09-05 15:26:45|   0
   0  4096B:4096B 5632B:   0  4096B|   0  1.00 :1.00  4.00 :   0  3.00 |3378B 3714B|0.08 0.30 0.36|  0   0 100   0   0| 191   328 |09-05 15:26:46|   0
   0  4096B:4096B 6656B:   0  4096B|   0  1.00 :1.00  6.00 :   0  3.00 |  10k   21k|0.08 0.30 0.36|  1   0  99   0   0| 387   468 |09-05 15:26:47|   0
   0  4096B:4096B 5632B:   0  4096B|   0  1.00 :1.00  4.00 :   0  3.00 |7650B 8602B|0.08 0.30 0.36|  0   0  97   2   0| 737  2627 |09-05 15:26:48|   0
   0  4096B:4096B 6144B:   0  4096B|   0  1.00 :1.00  5.00 :   0  3.00 |9025B 8083B|0.08 0.30 0.36|  0   0 100   0   0| 335   510 |09-05 15:26:49|   0
   0  4096B:4096B 5632B:   0  4096B|   0  1.00 :1.00  4.00 :   0  3.00 |  12k   11k|0.08 0.30 0.35|  0   0 100   0   0| 290   496 |09-05 15:26:50|   0
   0  4096B:4096B    0 :   0     0 |   0  1.00 :1.00     0 :   0     0 |5467B 5365B|0.08 0.30 0.35|  0   0 100   0   0| 404   300 |09-05 15:26:51|   0
   0  4096B:4096B    0 :   0     0 |   0  1.00 :1.00     0 :   0     0 |7973B 7315B|0.08 0.30 0.35|  0   0 100   0   0| 195   304 |09-05 15:26:52|   0
   0  4096B:4096B    0 :   0     0 |   0  1.00 :1.00     0 :   0     0 |6183B 4929B|0.08 0.30 0.35|  0   0  99   1   0| 683  2542 |09-05 15:26:53|   0
   0  4096B:4096B   12k:   0     0 |   0  1.00 :1.00  2.00 :   0     0 |4995B 4998B|0.08 0.30 0.35|  0   0 100   0   0| 199   422 |09-05 15:26:54|   0
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |8353B 8059B|0.07 0.29 0.35|  0   0 100   0   0| 164   217 |09-05 15:26:55|   0
=====================================================================================================


root@pve-20:~# ioping -c10 /dev/nvme0n1p2 -D -Y -WWW -s4k
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=1 time=81.2 us (warmup)
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=2 time=82.7 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=3 time=82.4 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=4 time=94.4 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=5 time=95.1 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=6 time=67.5 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=7 time=85.1 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=8 time=63.5 us (fast)
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=9 time=82.2 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=10 time=86.1 us

--- /dev/nvme0n1p2 (block device 300 GiB) ioping statistics ---
9 requests completed in 739.2 us, 36 KiB written, 12.2 k iops, 47.6 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 63.5 us / 82.1 us / 95.1 us / 10.0 us

-----------------------------------------------------------------------------------------

root@pve-20:/# dstat -drnlcyt -D sdb,nvme0n1,bcache0 --aio
--dsk/sdb---dsk/nvme0n1-dsk/bcache0 ---io/sdb----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
 read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
 332B  181k: 167k  935k:  20B  302k|0.01  11.2 :11.9  42.0 :0.00  5.96 |   0     0 |0.18 0.25 0.32|  0   0  99   0   0| 392   904 |09-05 15:30:49|   0
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |4443B 4548B|0.16 0.25 0.32|  0   0 100   0   0| 108   209 |09-05 15:30:55|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |3526B 3844B|0.16 0.25 0.32|  1   0  99   0   0| 316   434 |09-05 15:30:56|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |5855B 4707B|0.16 0.25 0.32|  0   0 100   0   0| 146   277 |09-05 15:30:57|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |8897B 7349B|0.16 0.25 0.32|  0   0  99   1   0| 740  2323 |09-05 15:30:58|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |7802B 7280B|0.15 0.24 0.32|  0   0 100   0   0| 118   235 |09-05 15:30:59|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |5610B 4593B|0.15 0.24 0.32|  2   0  98   0   0| 667   682 |09-05 15:31:00|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |9046B 8254B|0.15 0.24 0.32|  4   0  96   0   0| 515   707 |09-05 15:31:01|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |5323B 5129B|0.15 0.24 0.32|  0   0 100   0   0| 191   247 |09-05 15:31:02|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |4249B 3549B|0.15 0.24 0.32|  0   0  98   2   0| 708  2565 |09-05 15:31:03|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |7577B 7351B|0.14 0.24 0.32|  0   0 100   0   0| 291   350 |09-05 15:31:04|   0
   0     0 :2080k 4096B:   0     0 |   0     0 :62.0  1.00 :   0     0 |5731B 5692B|0.14 0.24 0.32|  0   0 100   0   0| 330   462 |09-05 15:31:05|   0
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |7347B 5852B|0.14 0.24 0.32|  1   0  99   0   0| 306   419 |09-05 15:31:06|   0

The cache was configured directly on one of the NVMe partitions (in this case, the first partition). I did several tests using fio and ioping, testing on a partition on the NVMe device, without partition and directly on the raw block, on a first partition, on the second, with or without configuring bcache. I did all this to remove any doubt as to the method. The results of tests performed directly on the hardware device, without going through bcache are always fast and similar.

But tests in bcache are always slower. If you use writethrough, of course, it gets much worse, because the performance is equal to the raw spinning disk.

Using writeback improves a lot, but still doesn't use the full speed of NVMe.
But I've also noticed that there is a limit on writing sequential data, which is a little more than half of the maximum write rate shown in direct tests by the NVMe device.

Processing doesn't seem to be going up like the tests.

Please would anyone know, what could be causing these limits?

Tanks

adriano_da_silva · May 11, 2022

I think it must be some fine tuning.

One curious thing I noticed, is that writing is always taking place on the flash, never on the spinning disk. This is expected and should give the same fast response as the flash device. However, this is not what happens when going through bcache.

But when I remove the fsync flag in the test with fio, which tells the application to wait for the write response, the 4K write happens much faster, reaching 73.6 MB/s and 17k IOPS. This is half the device's performance, but it's more than enough for my case. The fsync flag makes no significant difference to the performance of my flash disk when testing directly on it. The fact that bcache speeds up when the fsync flag is removed makes me believe that bcache is not slow to write, but for some reason, bcache is taking a while to respond that the write is complete. I think that should be the point!

And without fsync, ioping tests also speed up, albeit less. In this case, I can see that the latency drops to something around 600~700us.

Nothing compared to the 84us (4k ioping write) obtained when writing directly to the flash device (with or without fsync). But it's still much better than the 1.5ms you get in bcache when you add the fsync flag to wait for the write response on the same bcache device.

That is, what it looks like is that there is a wait placed by the bcache layer between the write being sent to it, it waiting for the disk response, and then sending the response to the application. This is increasing latency and consequently reducing performance. I think it must be some fine tuning (or no?). Someone help?

Tanks,

dlasher · Jan 9, 2023

Curious where you ended up on this @adriano_da_silva - considering migrating one of my clusters from NVME DB/WAL + Spinners, to bcache NVME+Spinners.. Curious what 6 months of experience had done to your views.

dlasher · Mar 12, 2024

dlasher said:
Curious where you ended up on this @adriano_da_silva - considering migrating one of my clusters from NVME DB/WAL + Spinners, to bcache NVME+Spinners.. Curious what 6 months of experience had done to your views.

As a follow up -- did just that, running ceph on top of bcache for the last 12 months, zero issues. The access times are great (thank you NVME) and the rebuild times are much faster.

adriano_da_silva · Mar 13, 2024

I've been running like this for more than a year.

It's okay for now. Safe and performs much better than using only spinning disks.

dlasher · Mar 13, 2024

adriano_da_silva said:
I've been running like this for more than a year.

It's okay for now. Safe and performs much better than using only spinning disks.

Agreed.

And much safer than moving the DB/WAL over to another device. Using bcache, if the cache device fails, it just falls through to the disks. In CEPH if the DB/WAL device fails - you lose all the devices it was serving.

Search

Search

Bcache in NVMe for 4K and fsync. Are IOPS limited?

adriano_da_silva

Member

adriano_da_silva

Member

dlasher

Renowned Member

dlasher

Renowned Member

adriano_da_silva

Member

dlasher

Renowned Member