Single thread performance like you have with one disk inside one VM will always be disappointing, especially with small block sizes.
You can just throw faster hardware at the problem.
BTW: Your rados bench uses 16 parallel threads and a block size of 4MB. This will show you nearly the maximum...