the problem with dd as benchmark, is that is like iodeph=1 and sequential.
so you'll be limited by the latency (network + cpu frenquency).
with 18 osd ssd, replication x3 , big 2x12 cores 3.1ghz cpu, I'm able to reach around 700k iops randread 4K, and 150-200k randwrite 4K.
(fio, iodepth=64...