Problem: PVE 8 migrating a replicated VM causes the VM to get 3-10x less IOPS

lucius_the · Nov 15, 2023

Didn't know how to label this issue really. I couldn't find any similar post on the forums either.

Two identical DELL servers, same specs. I am running PVE 8.0.4, updated to latest version. Servers are in a cluster (no QDevice yet, it's being prepared and is in testing at the moment). I have HW RAID, on top of that I have LVM. Then on top of that I have ZFS.

Now, before some ZFS purist starts posting opinions learned from FreeNAS forums and similar places -> this is actually the best performing option on my servers to have ZFS. Yes, I did my due diligence and spent literally tens of hours testing various configurations (testing is hard), including of course ZFS on HBA with ZIL on SSD. These servers come with a PERC H740P with 8 GB BBWC cache... a HW RAID10+LVM+ZFS consistently gives better performance than any "give disks to ZFS directly" options I could try. Including having ZIL on SSD - HW RAID + LVM + ZFS on trop was still faster. I just wanted to get that out of the way first, before someone even starts. I can post results if someone is interested.

So, to continue towards the problem now. On both servers I have a ZFS datastore (yes, ZFS on LVM on HW RAID). When I test IOPS in a VM (using fio) I get 4k sync random writes around 3500 IOPS - which is fine (this is basic RAID10 on SATA drives). I have a cluster so Proxmox gives me an option to migrate the VM from one node to another, and that works fine. I check IOPS on the other node that I migrated the VM to and IOS inside the VM stay the same, around 3500 IOPS 4k synced random write.

Now I turn on replication of this VM and set the VM to replicate from node pve1 to nove pve2. That's the reason I had to go for ZFS. Can't have storage replication working without it. OK that works. Now I can migrate the VM much more quickly. Great, it's sort of a poor mans DR plan. Not I have my VM on pve1 and that VM is "storage replicated" to pve2. I make another fio test on pve1 and still get around 3500 IOPS. ZFS snapshots have no visible cost. Life is great,

Then I migrate this (storage replicated) VM from pve1 to pve2. It goes very fast, because it is already replicated, and that is great.

But now this is what happens: After I replicate the VM (from node pve1 to node pve2) and then migrate it (from pve1 to pve2) and now do a fio test inside the VM on pve2 -> I get about 200-400 IOPS ! Instead of 3500 that I normally if I migrate without replicating first.

Now that's more that 10 times lower IOPS !
I spend the last cca 12 hours trying to pinpoint the root cause of this.

If I remove replication from the VM and then migrate the VM to another node -> I get 3500 IOPS inside the VM on the other node. Always.
If I have replication active and them migrate -> on the other node I get 200-400 IOPS in the VM. It doesn't mater from which node I migrate to the other - if there was replication and then migration -> IOPS on target will suck hard. Always.
Now get this: if I replicate then migrate (then have bad IOPS in the VM on the other node) and then I do this: on the target node, I move the VM disk to another local storage (on the same node) and then move the disk back to (local) ZFS storage -> then I get 3500 IOPS again inside the VM.

For some time I though this was due to ZFS replication, but it isn't. I can replicate, then migrate, then remove replication (so it removes all snapshots) but I still get same low IOPS.

Just to answer in advance, yes I always remove the fio test file before performing another test. This way ZFS writes all test data again, for each test. In my understating of how ZFS works, this is always news blocks and (provided space isn't heavily fragmented and I have enough free space - which I have 10s of TB and my VM is only 20 GB) that should be written to contiguous storage and, as such, not cause a problem. Even if I had snapshots, which I don't have. I still have very low IOPS until I either:
- migrate to another node (with replication removed) -> then I get full IOPS again, or
- move the VM storage to another local datastore and then move it back to ZFS datastore -> then again I get full IOPS back

I am at a loss here. This should not be happening and I have no clue why it happens.
If I get 10x less IOPS after replicating to another node what good does this give me for a DR scenario ? I have a VM ready to start, but... I have 1/10 of less IOPS on that VM if I do it this way. So I can't use storage replication. I'm stuck...

Does anyone have any idea what might be the cause of this. At this time I'm thinking it's some sort of a bug, somewhere. Bu I have no idea where to look.
I tried: changing cache mode, aio mode, changing in-VM mount options (someone somewhere found something about atime -> turned off atime) but nothing.

I can reproduce this problem consistently. And these not very practical workarounds as well.
Very grateful if anyone has an idea of what's going on here.

lucius_the · Nov 15, 2023

Package versions:

Code:

proxmox-ve: 8.0.2 (running kernel: 6.2.16-19-pve)
pve-manager: 8.0.4 (running version: 8.0.4/d258a813cfa6b390)
pve-kernel-6.2: 8.0.5
proxmox-kernel-helper: 8.0.3
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
proxmox-kernel-6.2: 6.2.16-19
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.7-pve1
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx5
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.5
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.10
libpve-guest-common-perl: 5.0.5
libpve-http-server-perl: 5.0.5
libpve-rs-perl: 0.8.6
libpve-storage-perl: 8.0.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.4-1
proxmox-backup-file-restore: 3.0.4-1
proxmox-kernel-helper: 8.0.3
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.2
proxmox-widget-toolkit: 4.0.9
pve-cluster: 8.0.4
pve-container: 5.0.5
pve-docs: 8.0.5
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.8-3
pve-ha-manager: 4.0.2
pve-i18n: 3.0.7
pve-qemu-kvm: 8.0.2-7
pve-xtermjs: 5.3.0-2
qemu-server: 8.0.7
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.13-pve1

LnxBil · Nov 15, 2023

What disks are those SATA? Real harddisks (spinning rust) or SSDs? What about the actual ZFS layout, what about striping in HW-RAID (one of the reasons you don't run on HW-RAID)

lucius_the · Nov 15, 2023

Well, sure, but that's missing the point - I really didn't want to go in another "well, that's not an optimal ZFS configuration" discussion.

Problem is, storage replication does something to cause low IOPS. Please don't get me wrong, as soon as I have time I will post my testing results with all the details, to help others understand why I didn't choose ZFS on drives directly. Contrary to what many say, ZFS doesn't always perform better than a proper HW RAID. BBWC enables fast sync writes and that's just... well, something ZFS can't have (not without a pair of very good SSDs anyway). With my SSD-s at least, ok I did get more IOPS, but not not as much as HW RAID (+LVM+ZFS on top) combination provided. That's why I choose this setup. Not because I didn't want ZFS on drives directly -> I actually wanted to, because it's cleaner and leaner, but numbers simply didn't work in favor od ZFS - in short: random sync writes are a problem for ZFS, while they are zero problem for H740P - which shouldn't be a surprise.
Anyway, this is my setup in short:
- 4 x SATS HDD 7200k rpm 8 TB drives
- 2 x ~~Intel DC series 240 GB (RI) SSD drives~~ update: wrong info: not Intel DC actually, my mistake, rather it's some junk (with nothing close to Intel DC series in terms of IOPS).
I have those SATA HDD-s on PERC H740P in a RAID10 configuration. On one server it's using 256K stripe, on another it's 64k stripe. fio tests inside a VM (when VM disk is using LVM) show around 7500 4k random write SYNC IOPS. In both cases, they are near each other, stripe width doesn't seem to matter that much in my tests.

So, that RAID10 is presented as a device to PVE. There's an LVM on that device where proxmox is installed. I didn't install it on SSD mirror, because I plan to use those SSD-s for something else. So, I made another (2 TB) LV on that LVM VG and I've put ZFS on it (ZFS is using LVM device as it's sole backing store, meaning TS RAID 0). This is my ZFS status:

Code:

pool: zfs_on_lvm
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        zfs_on_lvm  ONLINE       0     0     0
          zfs_dev   ONLINE       0     0     0

errors: No known data errors

That pool is being used for this VMs only drive. No other VM-s on both servers.
I have another SSD pool (a ZFS mirror, using those 2 SSD-s) and that's not being used for anything.

So that's my ZFS config. I would like to focus more on why storage replication is causing low IOPS when VM is being storage replicated to another node. Please read through all details in my initial post. I do doubt it has anything to do with ZFS being on LVM on HW RAID. I get good IOPS any time on any node easily, unless I try moving VM disks between nodes using storage replication. In that case and only in that case I get low (very low) IOPS in a migrated VM. This I believe has something to do with storage replication, something around ZFS snapshots and ZFS send/receive. But I don't have enough in-depth around that to guess what the problem is. It doesn't seem logical, to me, that I would be able to get good IOPS any time in this VM, on any node, and only get bad IOPS if I migrate after storage replicating.

This looks more like a bug to me. Why would a replicated VM disk not have good IOPS ? While at the same time, that same VM (only if not being replicated, but just migrated directly) then gets good IOPS. Every time.

(edit: corrected typos and wrong info about SSD - not relevant for this case, but wrong anyways)

LnxBil · Nov 15, 2023

lucius_the said:
Contrary to what many say, ZFS doesn't always perform better than a proper HW RAID.

Yes, that's total bogus. Just by looking at it, it cannot be faster, the I/O path is much longer. That's not the reason I choose it, it's atomic snapshots, transparent compression, self-healing and replication. None of that is possible in any other mainstream local Linux filesystem besides btrfs.

lucius_the said:
Well, sure, but that's missing the point - I really didn't want to go in another "well, that's not an optimal ZFS configuration" discussion.

The problem is that no one can tell why it is slow, only guess (my guess below). Everything between the ZFS and the disks, so the controller, is a black box.

lucius_the said:
7500 4k random write SYNC IOPS

You have not run them for that long, have you? That number is IMHO total bogus and this is what I suspect: Your RAID10 of 4 disks has effective 4KB (or even only 512B) sync random IOPS of about 150, so 300 in total if you would not use your HW cache. Depending on the used controller cache size, you will fill your controller cache and it'll then break down to much less. If you have a ZFS replication stream running, you will have in the worst case a random write to the disks but with bigger blocksizes and more stuff and therefore fill your cache much faster. Due to the much greater stripe size on your controller, you will read and write more (write and read amplification) for each change on disk.

lucius_the · Nov 15, 2023

Hi LnxBil,
Thanks for answering. I agree with you, actually, on all points. RAID controller is a black box mostly, true. Problem is it's faster than ZFS directly on disks, so I choose it.

LnxBil said:
You have not run them for that long, have you?

Yeah, not really. The 7500 4k rnd write IOPS is not something a HDD mirror (that's what it is in writes) can achieve of course. That's all RAID write cache. I run the test for 60 seconds, that's not much. Controller has 8 GB of cache and if my quick math is right that's less then 2 GB of writes in total, so that's all RAID write cache, yes. But that's why that cache is there. Normal workload on those server will rarely, if ever, hit such pressure. So I do count on that cache while testing. Maybe that's wrong approach, but realistically I don't expect sustained 4k random writes, so that's my worst case; if that works fine, other things will work fine too.

LnxBil said:
If you have a ZFS replication stream running, you will have in the worst case a random write to the disks but with bigger blocksizes and more stuff and therefore fill your cache much faster. Due to the much greater stripe size on your controller, you will read and write more (write and read amplification) for each change on disk.

I understand this. And agree. However, the replication is not running. It only runs once then stops. It takes a minute or two to copy my 20 GB test VM over to the other node, then replication isn't running any more. No IOPS happening on drives whatsoever. No HDD leds blinking on the serer (nothing is really running on these servers except for PVE and the test VM).

So replication (or more precisely: the VM storage replication job in PVE) completes, once, and nothing else is running after that. I then migrate the VM over to the next node (which takes less than 10 seconds now that ZFS doesn't have much to send from the last snapshot). Then again, storage goes silent. Then I run a test inside a VM and suddenly my IOPS inside this VM drops from 3500 (7500 is when storage is on LVM directly, 3500 when I have ZFS on top of it and use that for VM storage) drops to 200-300, sometimes it goes up to 900 and sometimes gets as low as 150 IOPS.

This is something I don't understand. Most interesting thing: if I just migrate the VM (without having replication run once before I migrate) I get normal IOPS in the VM after migrating it to the other node (around 3500, give or take). Only if it has been replicated once first and then migrated (note that after these operations run, there is nothing happening on storage afterwards) I get poor IOPS in range of 200-300.

Actually a value about 400-500 IOPS sounds like a maximum that this array would be able to do in case RAID write cache was off. The poor IOPS I get suspiciously fit closely in that ballpark (some ZFS overhead and we're about there). So, does this mean I lost HW RAID cache all of a sudden because replication was run once ? It might look that way, but that doesn't sound reasonable. HW RAID knows nothing about Proxmox or ZFS or anything like that for this to make any sense...

lucius_the · Nov 15, 2023

Perhaps I overestimated these HDD's write IOPS - they're actually SATA 7.2k rpm so that's going to be closer to 150-175, so in RAID10 300-350 on writes max. So we're even closer to those numbers. But there is nothing hitting storage except my tests, that's what doesn't make sense. These numbers are suspiciously close to raw HDD speed, I agree with that.

spirit · Nov 15, 2023

can you send your fio config ?

Do you have tried:
- source node : fast
- migrate the vm to target node with zfs replication enable
- target node : slow
stop/start the vm
result ? slow or fast ?

lucius_the · Nov 15, 2023

So... yeah... I looks like something actually might be causing that virtual disk to stop using RAID cache. Once it's replicated, that VM disk (or more precisely: that ZFS volume) behaves like there is no RAID cache.

And once I move it to another datastore (mode it to local LVM), and then move it back to ZFS - then it get's "unlocked" and starts using RAID cache as I see in my tests. Well, the symptoms do point in that direction.
Maybe someone here will know if such a thing is possible.

spirit · Nov 15, 2023

what is the cache= mode on your vm disk ? writeback ? none ?

lucius_the · Nov 15, 2023

spirit said:
can you send your fio config ?

Do you have tried:
- source node : fast
- migrate the vm to target node with zfs replication enable
- target node : slow
stop/start the vm
result ? slow or fast ?

I made many fio tests but in the end I was doing only this one while trying different combinations with storage replication:

Code:

rm temp.file
fio --rw=randwrite --name=TEST --size=1g --io_size=1g --bs=4k --direct=1 --filename=temp.file --numjobs=2 --ioengine=libaio --fsync=1 --iodepth=1 --refill_buffers --group_reporting --runtime=60 --time_based

I lowered the size to 1g to have it prepare the test filer more quickly.

lucius_the · Nov 15, 2023

spirit said:
what is the cache= mode on your vm disk ? writeback ? none ?

Tried them all. Originally I used the defaults, but lately I'm using cache=directsync and aio=native, since that gave me best results with HW RAID with BBWC that I have below everything.

Directsync also helps skip some layers of cacheing to help make results more consistent, if I understood it all correctly.

spirit · Nov 15, 2023

lucius_the said:
Tried them all. Originally I used the defaults, but lately I'm using cache=directsync and aio=native, since that gave me best results with HW RAID with BBWC that I have below everything.

I think you HW raid is caching the fsync in memory, so it's fine.

But I don't understand why it's look like the HW raid is not caching anymore write.
Does a stop/start (not reboot) of the target vm is changing speed ?

lucius_the · Nov 15, 2023

Yeah, tried that. A start/stop doesn't change anything. Only when I move the disk to another store and then move back. Or if I migrate to another node (but I have to remove replication job first). In those cases I get speed back. Otherwise I don't get speed back. It's like it's "locked" to not using HW RAID cache, it would seem.

spirit · Nov 15, 2023

lucius_the said:
Yeah, tried that. A start/stop doesn't change anything.

ok, so we are sure it's not a bug in qemu.

lucius_the said:
Only when I move the disk to another store and then move back. Or if I migrate to another node (but I have to remove replication job first). In those cases I get speed back. Otherwise I don't get speed back. It's like it's "locked" to not using HW RAID cache, it would seem.

I really don't known. Maybe it is some magic algorithm in your hw raid cache.

Is is the same if you: enable replication, stop the source vm, do the migration offline, start the target_vm ? (So we only the copy through zfs snapshot replication)

lucius_the · Nov 15, 2023

Well, let me try that I get back to you. I think I tried replicating with VM being offline, but I'm not sure anymore. Gimme 20 minutes.

lucius_the · Nov 16, 2023

fio test inside VM: 3800. Ok.
VM powered off.
Replication job added (with schedule once a week on Sunday, so it doesn't kick in now).
Replication job scheduled and run once to replicate VM disk to the other node. Log of that here:

Code:

2023-11-15 23:59:00 100-0: start replication job
2023-11-15 23:59:00 100-0: guest => VM 100, running => 0
2023-11-15 23:59:00 100-0: volumes => zfs_on_lvm:vm-100-disk-0
2023-11-15 23:59:01 100-0: create snapshot '__replicate_100-0_1700089140__' on zfs_on_lvm:vm-100-disk-0
2023-11-15 23:59:01 100-0: using insecure transmission, rate limit: none
2023-11-15 23:59:01 100-0: full sync 'zfs_on_lvm:vm-100-disk-0' (__replicate_100-0_1700089140__)
2023-11-15 23:59:03 100-0: full send of zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__ estimated size is 19.7G
2023-11-15 23:59:03 100-0: total estimated size is 19.7G
2023-11-15 23:59:04 100-0: TIME        SENT   SNAPSHOT zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:04 100-0: 23:59:04    890M   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:05 100-0: 23:59:05   1.81G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:06 100-0: 23:59:06   2.82G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:07 100-0: 23:59:07   3.90G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:08 100-0: 23:59:08   4.81G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:09 100-0: 23:59:09   5.84G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:10 100-0: 23:59:10   6.74G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:11 100-0: 23:59:11   7.69G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:12 100-0: 23:59:12   8.69G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:13 100-0: 23:59:13   9.63G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:14 100-0: 23:59:14   10.4G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:15 100-0: 23:59:15   11.0G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:16 100-0: 23:59:16   11.6G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:17 100-0: 23:59:17   12.1G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:18 100-0: 23:59:18   12.6G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:19 100-0: 23:59:19   13.1G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:20 100-0: 23:59:20   13.5G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:21 100-0: 23:59:21   14.0G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:22 100-0: 23:59:22   14.6G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:23 100-0: 23:59:23   15.2G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:24 100-0: 23:59:24   15.7G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:25 100-0: 23:59:25   16.2G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:26 100-0: 23:59:26   16.7G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:27 100-0: 23:59:27   17.2G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:28 100-0: 23:59:28   17.6G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:29 100-0: 23:59:29   18.2G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:30 100-0: 23:59:30   18.6G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:31 100-0: 23:59:31   19.2G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:32 100-0: 23:59:32   19.7G   zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089140__
2023-11-15 23:59:40 100-0: [pve1] successfully imported 'zfs_on_lvm:vm-100-disk-0'
2023-11-15 23:59:41 100-0: end replication job

Replication done.
VM migrated to the other node. Log of that here:

Code:

2023-11-16 00:00:47 use dedicated network address for sending migration traffic (10.56.43.11)
2023-11-16 00:00:48 starting migration of VM 100 to node 'pve1' (10.56.43.11)
2023-11-16 00:00:48 found local, replicated disk 'zfs_on_lvm:vm-100-disk-0' (attached)
2023-11-16 00:00:48 replicating disk images
2023-11-16 00:00:48 start replication job
2023-11-16 00:00:48 guest => VM 100, running => 0
2023-11-16 00:00:48 volumes => zfs_on_lvm:vm-100-disk-0
2023-11-16 00:00:49 create snapshot '__replicate_100-0_1700089248__' on zfs_on_lvm:vm-100-disk-0
2023-11-16 00:00:49 using insecure transmission, rate limit: none
2023-11-16 00:00:49 incremental sync 'zfs_on_lvm:vm-100-disk-0' (__replicate_100-0_1700089140__ => __replicate_100-0_1700089248__)
2023-11-16 00:00:50 send from @__replicate_100-0_1700089140__ to zfs_on_lvm/vm-100-disk-0@__replicate_100-0_1700089248__ estimated size is 624B
2023-11-16 00:00:50 total estimated size is 624B
2023-11-16 00:00:51 [pve1] successfully imported 'zfs_on_lvm:vm-100-disk-0'
2023-11-16 00:00:51 delete previous replication snapshot '__replicate_100-0_1700089140__' on zfs_on_lvm:vm-100-disk-0
2023-11-16 00:00:51 (remote_finalize_local_job) delete stale replication snapshot '__replicate_100-0_1700089140__' on zfs_on_lvm:vm-100-disk-0
2023-11-16 00:00:51 end replication job
2023-11-16 00:00:51 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve1' root@10.56.43.11 pvesr set-state 100 \''{"local/pve2":{"fail_count":0,"last_node":"pve2","duration":3.678823,"last_try":1700089248,"last_iteration":1700089248,"last_sync":1700089248,"storeid_list":["zfs_on_lvm"]}}'\'
2023-11-16 00:00:53 migration finished successfully (duration 00:00:06)
TASK OK

Done.
VM powered on.
fio test inside VM: good ! around 3550 IOPS.

So... when replicating then migrating (while VM is offline) everything is good !
I guess I didn't test that after all. Since that's not what I need, I need to replicate running VMs. So I probably wan't testing this one :/

Yeah, it seems to work if VM is offline.

lucius_the · Nov 16, 2023

Hm... Let me go through everything one more time, because this is not what I've been seeing before (I have log of what I did before and found this situation to not be working). Will get back as soon as I figure out what I did differently this time.

lucius_the · Nov 16, 2023

OK I found the difference. This time I was using 8k random write test instead of 4k random write test - copied the wrong line to console (well, I've been awake for nearly 40 hours now - that's gotta be a record for me - and errors start to happen).

So I tried all this again.
VM on pve2. Test: 3500 IOPS.
VM goes off.
Replication job setup.
Replication is run once.
VM is migrated off to another node.
VM is powered on.
Test (with 4k random writes): 330 IOPS.

Interesting.
Due to my error I found out that only the 4k random write test is not going good. But if I test with 8k random write - that works well, I get 3869 random write sync IOPS when testing with 8k. I test again with 4k -> same low result.

Now that's interesting... And even more confusing.

lucius_the · Nov 16, 2023

I can migrate the VM back on the node where it was previously migrated from: there I get about 2800 IOPS with 4k test now.

If I then remove replication job (and with that, the ZFS snapshots) and then migrate to the other node: I get about 3246 IOPS there now (with 4k random write test). All of this is live migration - I think doesn't make a difference if it's powered on or not.

I can migrate to the other node now as well and would get same (good) result. As long as I don't configure and use a replication job before migration, I get consistent results.

Only now we know that this seems limited to 4k random write test only. The 8k test doesn't show a problem.
I could test other sizes but really I will have to go to sleep now

will try some more sizes tomorrow and post here... Interesting indeed.

Problem: PVE 8 migrating a replicated VM causes the VM to get 3-10x less IOPS

Member

Member

Distinguished Member

Member

Distinguished Member

Member

Member

Distinguished Member

Member

Distinguished Member

Member

Member

Distinguished Member

Member

Distinguished Member

Member

Member

Member

Member

Member

We value your privacy