Failed Fail Over

esi_y · Aug 7, 2024

BTW Is this some RAIDZ or single vdev / (striped) mirror only?

ballybob · Aug 7, 2024

esi_y said:
Thanks, I mostly wanted to know if zvol_use_blk_mq was not set (it's not), but may find something more later on. I just experienced more weird behaviour with ZVOLs over time. Please let us know later in case the "fix" was not real, but I will just assume that thick provisioning the volume made it for you.

I need to fail it more times to know for sure, I only did it once, but it worked immediately which it hasn't done before.

LnxBil · Aug 10, 2024

esi_y said:
@ballybob FWIW I personally avoid ZVOLs for VMs, rather than anectodal evidence, if I pull quick search results this is still in the top:

https://jrs-s.net/2018/03/13/zvol-vs-qcow2-with-kvm/

Interesting. I like QCOW2 on $anything because of its tree-like snapshot structure, yet the argument I always get when I say that I use QCOW2 for some machines on ZFS (datasets) is that you would have COW on COW and with snapshots, the performance gets unpredicable and slow. I have to concur with that. All VMs feel much more sluggish, slower and its backup takes much more time. If I clone the VM to a zvol, it is noticebly faster, including the backup. I did not run artificial tasks, I ran real world tasks like creating oracle databases per script so that you can "see" the runtime, install OSes. The article does not mention this at all, just non-snapshot-states and nothing about backup speed, which would be - as much as I love fio - a real world perfectly comparable test.

QCOW2 is also not trimmable as easy as a ZVOL, so that you will waste much more space on your machines. You need to offline compact the file in order to get back space.

I run a server with 40+ machines on 6 enterprise ssds in a RAID10 on ZVOL and it worked great for all machines with only linear snapshots, for the tree-like ones, I used a dataset.

esi_y · Aug 10, 2024

LnxBil said:
Interesting. I like QCOW2 on $anything because of its tree-like snapshot structure, yet the argument I always get when I say that I use QCOW2 for some machines on ZFS (datasets) is that you would have COW on COW and with snapshots, the performance gets unpredicable and slow. I have to concur with that. All VMs feel much more sluggish, slower and its backup takes much more time.

Yes, but another way of looking at is that ZFS (or any COW) is simply unsuitable for it, not the other way around.

LnxBil said:
If I clone the VM to a zvol, it is noticebly faster, including the backup.

Yes, but again, it's simply removing one layer of the two, you picked the QCOW2.

LnxBil said:
I did not run artificial tasks, I ran real world tasks like creating oracle databases per script so that you can "see" the runtime, install OSes. The article does not mention this at all, just non-snapshot-states and nothing about backup speed, which would be - as much as I love fio - a real world perfectly comparable test.

It's not my post there, the second link came up with opposite results even, I included both on purpose.

LnxBil said:
QCOW2 is also not trimmable as easy as a ZVOL, so that you will waste much more space on your machines. You need to offline compact the file in order to get back space.

Yet if ZVOL fails and is literally not recommended to be run thin...

LnxBil said:
I run a server with 40+ machines on 6 enterprise ssds in a RAID10 on ZVOL and it worked great for all machines with only linear snapshots, for the tree-like ones, I used a dataset.

That's anectodal evidence, but yes, it's possible. My concern is for the quality of ZVOL implementation. E.g. I do not convince anyone to prefer BTRFS as much as it never failed me personally.

The question here really is ... why thick provisioning that ZVOL with efidisk on mostly empty dataset solved something at all.

LnxBil · Aug 14, 2024

First: good discussion ...

esi_y said:
Yet if ZVOL fails and is literally not recommended to be run thin...

You mentioned that already, yet I don't know why. Where do you get this? I've never stumbled upton this. I run everything in ZFS thin for almost a decade.

esi_y said:
Yes, but another way of looking at is that ZFS (or any COW) is simply unsuitable for it, not the other way around.

Which type of storage is on feature-par with this? I don't know of any.

esi_y said:
That's anectodal evidence, but yes, it's possible. My concern is for the quality of ZVOL implementation.

I don't have any other information, so of course it is anectodal. It would be great if one could reproduce the problem.

esi_y said:
The question here really is ... why thick provisioning that ZVOL with efidisk on mostly empty dataset solved something at all.

That I don't know.

esi_y · Aug 14, 2024

LnxBil said:
You mentioned that already, yet I don't know why. Where do you get this? I've never stumbled upton this. I run everything in ZFS thin for almost a decade.

ZVOLs have been riddled with issues for a long time, e.g.:

https://github.com/openzfs/zfs/issues/7631
https://github.com/openzfs/zfs/issues/10095

There are others, but you might say they are were not default for PVE:

https://github.com/openzfs/zfs/issues/15351

If you ask me about thin specifically, they are literally "not recommended" (without further reasoning):
https://openzfs.github.io/openzfs-docs/man/master/7/zfsprops.7.html

Obviously, I can't easily argue when a problem arises due to ZVOL being sparse and when it was universally a ZVOL issue, but anecdotal evidence supports the logical conclusion that there's more to break apart when a ZVOL is sparse.

LnxBil said:
Which type of storage is on feature-par with this? I don't know of any.

I just run it RAW, if I need a VM. I would then get my "features" within the VM.

LnxBil said:
I don't have any other information, so of course it is anectodal. It would be great if one could reproduce the problem.

Can you reproduce e.g. this:

https://forum.proxmox.com/threads/w...reported-avail-and-refreserv-on-zvols.151874/

LnxBil said:
That I don't know.

I find this case of OP extreme - it's a tiny volume, almost nothing changing there.

galvesribeiro · Jan 18, 2025

Sorry for Necro this post.

I'm in the same situation. Tried the set the reservation as @esi_y suggested but no luck. The migration run all the way to the end with "successful" messages but when it is time to commit the change, it fair with the I/O error on EFI disk. I'm using the exact same HomeAssistant VM which was created using the community script.

If I shutdown the VM and do a cold migration, it works. Just the live one which is failing.

If I create however a new VM with the exact same config, no OS installed, it migrates just fine. Something weird is happening with this HAOS VM...

Any clues on what to look for?

Thanks in advance!

galvesribeiro · Jan 18, 2025

Adding to it - I was able to narrow down the problem. The reason the other VM was working and the HAOS wasn't is because the HAOS was using the CPU as "Host" while the new one the default "x86-64-v2-AES". The former always fail. The latter always worked. Changed HAOS tot he default CPU and it just work.

I wonder how that would be related to an "I/O" error on the UEFI disk...

fiona · Jan 20, 2025

Hi,

galvesribeiro said:
Adding to it - I was able to narrow down the problem. The reason the other VM was working and the HAOS wasn't is because the HAOS was using the CPU as "Host" while the new one the default "x86-64-v2-AES". The former always fail. The latter always worked. Changed HAOS tot he default CPU and it just work.

FYI, you can only use host CPU type when you have the exact same CPU model on source and target, see: https://pve.proxmox.com/pve-docs/chapter-qm.html#_cpu_type

galvesribeiro said:
I wonder how that would be related to an "I/O" error on the UEFI disk...

Sounds a bit like the target QEMU process might not have been running anymore, leading to the I/O error. You can check the system logs/journal on the target system.

galvesribeiro · Jan 20, 2025

fiona said:
Hi,

FYI, you can only use host CPU type when you have the exact same CPU model on source and target, see: https://pve.proxmox.com/pve-docs/chapter-qm.html#_cpu_type

Sounds a bit like the target QEMU process might not have been running anymore, leading to the I/O error. You can check the system logs/journal on the target system.

Thanks for the reply.

I've played a bit with the options and it was also related to the CPU. When I changed to "x86-64-v3" it worked just fine. The CPU on the main machine is a TRP 7995WX while the target is an Intel i9 13900K which I use as temporary host when TRP is under maintenance. The "v4" works on the TRP but not on the i9, so that did the trick.

Thank you!

LnxBil · Jan 21, 2025

galvesribeiro said:
I've played a bit with the options and it was also related to the CPU. When I changed to "x86-64-v3" it worked just fine. The CPU on the main machine is a TRP 7995WX while the target is an Intel i9 13900K which I use as temporary host when TRP is under maintenance. The "v4" works on the TRP but not on the i9, so that did the trick.

Please test it rigorously. I also had such a problem years ago and some VMs did work, others failed a couple of hours later and some failed directly.

Best to stick to the recommendation to NOT mix CPUs.

galvesribeiro · Jan 21, 2025

LnxBil said:
Please test it rigorously. I also had such a problem years ago and some VMs did work, others failed a couple of hours later and some failed directly.

Best to stick to the recommendation to NOT mix CPUs.

Yes I agree. This is what I do at production environments. Just doing this as an experiment at home.

Search

Search

Failed Fail Over

esi_y

Renowned Member

ballybob

Member

LnxBil

Distinguished Member

esi_y

Renowned Member

LnxBil

Distinguished Member

esi_y

Renowned Member

galvesribeiro

New Member

galvesribeiro

New Member

fiona

Proxmox Staff Member

galvesribeiro

New Member

LnxBil

Distinguished Member

galvesribeiro

New Member

We value your privacy