Didn't know how to label this issue really. I couldn't find any similar post on the forums either.
Two identical DELL servers, same specs. I am running PVE 8.0.4, updated to latest version. Servers are in a cluster (no QDevice yet, it's being prepared and is in testing at the moment). I have HW RAID, on top of that I have LVM. Then on top of that I have ZFS.
Now, before some ZFS purist starts posting opinions learned from FreeNAS forums and similar places -> this is actually the best performing option on my servers to have ZFS. Yes, I did my due diligence and spent literally tens of hours testing various configurations (testing is hard), including of course ZFS on HBA with ZIL on SSD. These servers come with a PERC H740P with 8 GB BBWC cache... a HW RAID10+LVM+ZFS consistently gives better performance than any "give disks to ZFS directly" options I could try. Including having ZIL on SSD - HW RAID + LVM + ZFS on trop was still faster. I just wanted to get that out of the way first, before someone even starts. I can post results if someone is interested.
So, to continue towards the problem now. On both servers I have a ZFS datastore (yes, ZFS on LVM on HW RAID). When I test IOPS in a VM (using fio) I get 4k sync random writes around 3500 IOPS - which is fine (this is basic RAID10 on SATA drives). I have a cluster so Proxmox gives me an option to migrate the VM from one node to another, and that works fine. I check IOPS on the other node that I migrated the VM to and IOS inside the VM stay the same, around 3500 IOPS 4k synced random write.
Now I turn on replication of this VM and set the VM to replicate from node pve1 to nove pve2. That's the reason I had to go for ZFS. Can't have storage replication working without it. OK that works. Now I can migrate the VM much more quickly. Great, it's sort of a poor mans DR plan. Not I have my VM on pve1 and that VM is "storage replicated" to pve2. I make another fio test on pve1 and still get around 3500 IOPS. ZFS snapshots have no visible cost. Life is great,
Then I migrate this (storage replicated) VM from pve1 to pve2. It goes very fast, because it is already replicated, and that is great.
But now this is what happens: After I replicate the VM (from node pve1 to node pve2) and then migrate it (from pve1 to pve2) and now do a fio test inside the VM on pve2 -> I get about 200-400 IOPS ! Instead of 3500 that I normally if I migrate without replicating first.
Now that's more that 10 times lower IOPS !
I spend the last cca 12 hours trying to pinpoint the root cause of this.
If I remove replication from the VM and then migrate the VM to another node -> I get 3500 IOPS inside the VM on the other node. Always.
If I have replication active and them migrate -> on the other node I get 200-400 IOPS in the VM. It doesn't mater from which node I migrate to the other - if there was replication and then migration -> IOPS on target will suck hard. Always.
Now get this: if I replicate then migrate (then have bad IOPS in the VM on the other node) and then I do this: on the target node, I move the VM disk to another local storage (on the same node) and then move the disk back to (local) ZFS storage -> then I get 3500 IOPS again inside the VM.
For some time I though this was due to ZFS replication, but it isn't. I can replicate, then migrate, then remove replication (so it removes all snapshots) but I still get same low IOPS.
Just to answer in advance, yes I always remove the fio test file before performing another test. This way ZFS writes all test data again, for each test. In my understating of how ZFS works, this is always news blocks and (provided space isn't heavily fragmented and I have enough free space - which I have 10s of TB and my VM is only 20 GB) that should be written to contiguous storage and, as such, not cause a problem. Even if I had snapshots, which I don't have. I still have very low IOPS until I either:
- migrate to another node (with replication removed) -> then I get full IOPS again, or
- move the VM storage to another local datastore and then move it back to ZFS datastore -> then again I get full IOPS back
I am at a loss here. This should not be happening and I have no clue why it happens.
If I get 10x less IOPS after replicating to another node what good does this give me for a DR scenario ? I have a VM ready to start, but... I have 1/10 of less IOPS on that VM if I do it this way. So I can't use storage replication. I'm stuck...
Does anyone have any idea what might be the cause of this. At this time I'm thinking it's some sort of a bug, somewhere. Bu I have no idea where to look.
I tried: changing cache mode, aio mode, changing in-VM mount options (someone somewhere found something about atime -> turned off atime) but nothing.
I can reproduce this problem consistently. And these not very practical workarounds as well.
Very grateful if anyone has an idea of what's going on here.
Two identical DELL servers, same specs. I am running PVE 8.0.4, updated to latest version. Servers are in a cluster (no QDevice yet, it's being prepared and is in testing at the moment). I have HW RAID, on top of that I have LVM. Then on top of that I have ZFS.
Now, before some ZFS purist starts posting opinions learned from FreeNAS forums and similar places -> this is actually the best performing option on my servers to have ZFS. Yes, I did my due diligence and spent literally tens of hours testing various configurations (testing is hard), including of course ZFS on HBA with ZIL on SSD. These servers come with a PERC H740P with 8 GB BBWC cache... a HW RAID10+LVM+ZFS consistently gives better performance than any "give disks to ZFS directly" options I could try. Including having ZIL on SSD - HW RAID + LVM + ZFS on trop was still faster. I just wanted to get that out of the way first, before someone even starts. I can post results if someone is interested.
So, to continue towards the problem now. On both servers I have a ZFS datastore (yes, ZFS on LVM on HW RAID). When I test IOPS in a VM (using fio) I get 4k sync random writes around 3500 IOPS - which is fine (this is basic RAID10 on SATA drives). I have a cluster so Proxmox gives me an option to migrate the VM from one node to another, and that works fine. I check IOPS on the other node that I migrated the VM to and IOS inside the VM stay the same, around 3500 IOPS 4k synced random write.
Now I turn on replication of this VM and set the VM to replicate from node pve1 to nove pve2. That's the reason I had to go for ZFS. Can't have storage replication working without it. OK that works. Now I can migrate the VM much more quickly. Great, it's sort of a poor mans DR plan. Not I have my VM on pve1 and that VM is "storage replicated" to pve2. I make another fio test on pve1 and still get around 3500 IOPS. ZFS snapshots have no visible cost. Life is great,
Then I migrate this (storage replicated) VM from pve1 to pve2. It goes very fast, because it is already replicated, and that is great.
But now this is what happens: After I replicate the VM (from node pve1 to node pve2) and then migrate it (from pve1 to pve2) and now do a fio test inside the VM on pve2 -> I get about 200-400 IOPS ! Instead of 3500 that I normally if I migrate without replicating first.
Now that's more that 10 times lower IOPS !
I spend the last cca 12 hours trying to pinpoint the root cause of this.
If I remove replication from the VM and then migrate the VM to another node -> I get 3500 IOPS inside the VM on the other node. Always.
If I have replication active and them migrate -> on the other node I get 200-400 IOPS in the VM. It doesn't mater from which node I migrate to the other - if there was replication and then migration -> IOPS on target will suck hard. Always.
Now get this: if I replicate then migrate (then have bad IOPS in the VM on the other node) and then I do this: on the target node, I move the VM disk to another local storage (on the same node) and then move the disk back to (local) ZFS storage -> then I get 3500 IOPS again inside the VM.
For some time I though this was due to ZFS replication, but it isn't. I can replicate, then migrate, then remove replication (so it removes all snapshots) but I still get same low IOPS.
Just to answer in advance, yes I always remove the fio test file before performing another test. This way ZFS writes all test data again, for each test. In my understating of how ZFS works, this is always news blocks and (provided space isn't heavily fragmented and I have enough free space - which I have 10s of TB and my VM is only 20 GB) that should be written to contiguous storage and, as such, not cause a problem. Even if I had snapshots, which I don't have. I still have very low IOPS until I either:
- migrate to another node (with replication removed) -> then I get full IOPS again, or
- move the VM storage to another local datastore and then move it back to ZFS datastore -> then again I get full IOPS back
I am at a loss here. This should not be happening and I have no clue why it happens.
If I get 10x less IOPS after replicating to another node what good does this give me for a DR scenario ? I have a VM ready to start, but... I have 1/10 of less IOPS on that VM if I do it this way. So I can't use storage replication. I'm stuck...
Does anyone have any idea what might be the cause of this. At this time I'm thinking it's some sort of a bug, somewhere. Bu I have no idea where to look.
I tried: changing cache mode, aio mode, changing in-VM mount options (someone somewhere found something about atime -> turned off atime) but nothing.
I can reproduce this problem consistently. And these not very practical workarounds as well.
Very grateful if anyone has an idea of what's going on here.