Failed Fail Over

Thanks, I mostly wanted to know if zvol_use_blk_mq was not set (it's not), but may find something more later on. I just experienced more weird behaviour with ZVOLs over time. Please let us know later in case the "fix" was not real, but I will just assume that thick provisioning the volume made it for you.
I need to fail it more times to know for sure, I only did it once, but it worked immediately which it hasn't done before.
 
  • Like
Reactions: esi_y
@ballybob FWIW I personally avoid ZVOLs for VMs, rather than anectodal evidence, if I pull quick search results this is still in the top:

https://jrs-s.net/2018/03/13/zvol-vs-qcow2-with-kvm/
Interesting. I like QCOW2 on $anything because of its tree-like snapshot structure, yet the argument I always get when I say that I use QCOW2 for some machines on ZFS (datasets) is that you would have COW on COW and with snapshots, the performance gets unpredicable and slow. I have to concur with that. All VMs feel much more sluggish, slower and its backup takes much more time. If I clone the VM to a zvol, it is noticebly faster, including the backup. I did not run artificial tasks, I ran real world tasks like creating oracle databases per script so that you can "see" the runtime, install OSes. The article does not mention this at all, just non-snapshot-states and nothing about backup speed, which would be - as much as I love fio - a real world perfectly comparable test.

QCOW2 is also not trimmable as easy as a ZVOL, so that you will waste much more space on your machines. You need to offline compact the file in order to get back space.

I run a server with 40+ machines on 6 enterprise ssds in a RAID10 on ZVOL and it worked great for all machines with only linear snapshots, for the tree-like ones, I used a dataset.
 
Interesting. I like QCOW2 on $anything because of its tree-like snapshot structure, yet the argument I always get when I say that I use QCOW2 for some machines on ZFS (datasets) is that you would have COW on COW and with snapshots, the performance gets unpredicable and slow. I have to concur with that. All VMs feel much more sluggish, slower and its backup takes much more time.

Yes, but another way of looking at is that ZFS (or any COW) is simply unsuitable for it, not the other way around.

If I clone the VM to a zvol, it is noticebly faster, including the backup.

Yes, but again, it's simply removing one layer of the two, you picked the QCOW2.

I did not run artificial tasks, I ran real world tasks like creating oracle databases per script so that you can "see" the runtime, install OSes. The article does not mention this at all, just non-snapshot-states and nothing about backup speed, which would be - as much as I love fio - a real world perfectly comparable test.

It's not my post there, the second link came up with opposite results even, I included both on purpose.

QCOW2 is also not trimmable as easy as a ZVOL, so that you will waste much more space on your machines. You need to offline compact the file in order to get back space.

Yet if ZVOL fails and is literally not recommended to be run thin...

I run a server with 40+ machines on 6 enterprise ssds in a RAID10 on ZVOL and it worked great for all machines with only linear snapshots, for the tree-like ones, I used a dataset.

That's anectodal evidence, but yes, it's possible. My concern is for the quality of ZVOL implementation. E.g. I do not convince anyone to prefer BTRFS as much as it never failed me personally.

The question here really is ... why thick provisioning that ZVOL with efidisk on mostly empty dataset solved something at all.
 
First: good discussion ...

Yet if ZVOL fails and is literally not recommended to be run thin...
You mentioned that already, yet I don't know why. Where do you get this? I've never stumbled upton this. I run everything in ZFS thin for almost a decade.

Yes, but another way of looking at is that ZFS (or any COW) is simply unsuitable for it, not the other way around.
Which type of storage is on feature-par with this? I don't know of any.

That's anectodal evidence, but yes, it's possible. My concern is for the quality of ZVOL implementation.
I don't have any other information, so of course it is anectodal. It would be great if one could reproduce the problem.

The question here really is ... why thick provisioning that ZVOL with efidisk on mostly empty dataset solved something at all.
That I don't know.
 
You mentioned that already, yet I don't know why. Where do you get this? I've never stumbled upton this. I run everything in ZFS thin for almost a decade.

ZVOLs have been riddled with issues for a long time, e.g.:

https://github.com/openzfs/zfs/issues/7631
https://github.com/openzfs/zfs/issues/10095

There are others, but you might say they are were not default for PVE:

https://github.com/openzfs/zfs/issues/15351

If you ask me about thin specifically, they are literally "not recommended" (without further reasoning):
https://openzfs.github.io/openzfs-docs/man/master/7/zfsprops.7.html

Obviously, I can't easily argue when a problem arises due to ZVOL being sparse and when it was universally a ZVOL issue, but anecdotal evidence supports the logical conclusion that there's more to break apart when a ZVOL is sparse.

Which type of storage is on feature-par with this? I don't know of any.

I just run it RAW, if I need a VM. I would then get my "features" within the VM.

I don't have any other information, so of course it is anectodal. It would be great if one could reproduce the problem.

Can you reproduce e.g. this:

https://forum.proxmox.com/threads/w...reported-avail-and-refreserv-on-zvols.151874/

That I don't know.

I find this case of OP extreme - it's a tiny volume, almost nothing changing there.
 
Last edited:
Sorry for Necro this post.

I'm in the same situation. Tried the set the reservation as @esi_y suggested but no luck. The migration run all the way to the end with "successful" messages but when it is time to commit the change, it fair with the I/O error on EFI disk. I'm using the exact same HomeAssistant VM which was created using the community script.

If I shutdown the VM and do a cold migration, it works. Just the live one which is failing.

If I create however a new VM with the exact same config, no OS installed, it migrates just fine. Something weird is happening with this HAOS VM...

Any clues on what to look for?

Thanks in advance!
 
Adding to it - I was able to narrow down the problem. The reason the other VM was working and the HAOS wasn't is because the HAOS was using the CPU as "Host" while the new one the default "x86-64-v2-AES". The former always fail. The latter always worked. Changed HAOS tot he default CPU and it just work.

I wonder how that would be related to an "I/O" error on the UEFI disk...
 
Hi,
Adding to it - I was able to narrow down the problem. The reason the other VM was working and the HAOS wasn't is because the HAOS was using the CPU as "Host" while the new one the default "x86-64-v2-AES". The former always fail. The latter always worked. Changed HAOS tot he default CPU and it just work.
FYI, you can only use host CPU type when you have the exact same CPU model on source and target, see: https://pve.proxmox.com/pve-docs/chapter-qm.html#_cpu_type
I wonder how that would be related to an "I/O" error on the UEFI disk...
Sounds a bit like the target QEMU process might not have been running anymore, leading to the I/O error. You can check the system logs/journal on the target system.
 
Hi,

FYI, you can only use host CPU type when you have the exact same CPU model on source and target, see: https://pve.proxmox.com/pve-docs/chapter-qm.html#_cpu_type

Sounds a bit like the target QEMU process might not have been running anymore, leading to the I/O error. You can check the system logs/journal on the target system.
Thanks for the reply.

I've played a bit with the options and it was also related to the CPU. When I changed to "x86-64-v3" it worked just fine. The CPU on the main machine is a TRP 7995WX while the target is an Intel i9 13900K which I use as temporary host when TRP is under maintenance. The "v4" works on the TRP but not on the i9, so that did the trick.

Thank you!
 
I've played a bit with the options and it was also related to the CPU. When I changed to "x86-64-v3" it worked just fine. The CPU on the main machine is a TRP 7995WX while the target is an Intel i9 13900K which I use as temporary host when TRP is under maintenance. The "v4" works on the TRP but not on the i9, so that did the trick.
Please test it rigorously. I also had such a problem years ago and some VMs did work, others failed a couple of hours later and some failed directly.

Best to stick to the recommendation to NOT mix CPUs.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!