Proxmox always pre-allocates when migrating? (LVM to LVM)

Moonrocks

Member
Apr 14, 2023
30
3
13
Hello,

I have a 4 node cluster, each cluster has a thin lvm pool with 4 drives each.
Whenever I create a new VM, the space is not pre-allocated (which is great), but when I migrate a VM from one node to another node the full disk size gets used up on the target node. Is pre-allocation done by default when migrating, if yes, how do I avoid this?

TIA! :)
 


Thank you for pointing these out, the issue that I have is that I don't have Qemu-agent installed on the VMs and these are customer VMs so cant really go in and run the trim. I have setup a new node where I want to migrate these VMs (and avoid any downtime). I can setup a different type of storage on this new node (ZFS etc). I need a way to be able to live migrate without having to run trim on the guest machines and without pre-allocating I am open to using different type of storage on the new node. Do I have any options?
 
Why not use Ceph?
CEPH Sounds really attractive and I'm seriously considering it. However, I am a bit concerned regarding the performance in comparison to RAW NVMEs. I am currently using RAW NVMe. How does CEPH Compare to that?

I have ConnectX4 and X3 (40G) NICs on these servers.
 
The performance is largely dependent on the network, with 4 DC NVMe per host you can definitely saturate up to 100-200G if you have 40G, do you need more than that?

The consideration is mostly cost-performance-safety, if your host dies, how will you recover your current environment? Is the data loss or downtime you currently can experience acceptable? If so, Ceph with unsafe cache on and 2 copies will perform just as well, you’ll get the same data security but with the benefit of a shorter data loss and recovery time frame.

If you want the synchronous 3-way data safety of your default Ceph install, obviously you have to sacrifice some speed, but your data loss is going to be near zero and recovery time frame measured in seconds.

Most performance stories here have to do with desktop grade equipment.
 
Last edited:
  • Like
Reactions: Moonrocks
The performance is largely dependent on the network, with 4 DC NVMe per host you can definitely saturate up to 100-200G if you have 40G, do you need more than that?

The consideration is mostly cost-performance-safety, if your host dies, how will you recover your current environment? Is the data loss or downtime you currently can experience acceptable? If so, Ceph with unsafe cache on and 2 copies will perform just as well, you’ll get the same data security but with the benefit of a shorter data loss and recovery time frame.

If you want the synchronous 3-way data safety of your default Ceph install, obviously you have to sacrifice some speed, but your data loss is going to be near zero and recovery time frame measured in seconds.

Most performance stories here have to do with desktop grade equipment.
Thanks you for the insight, regarding hardware, I'm running R740s for compute with 2x Platinum 8268. How would things look like if I populate each storage node with 1x Xeon E5-2683 v4?
 
That processor is about 10 years old, you may still be able to saturate 40G though but a single E5 with DDR4 will increase latency and won’t be able to saturate your NVMe. Good enough for testing or spinning disk backup/archive, wouldn’t rely on anything that old for production.
 
Space will be preallocated (that is, thin provisioning will be lost) on any non-shared storage if you live migrate the VM due to the fact that QEMU needs to set the source disk in "mirror" state so every write done to the source disk is written synchronously to the destination disk too. That even copies "zeros" instead of sparse areas of the source disk. If you migrate the VM while it's off, QEMU uses disk clone which preserves sparse areas.

You should definitely tell the VM administrator to install QEMU agent (if Windows, the full VirtIO + balloon driver is a must too). Besides the option to automatically run a fstrim after disk clone or disk migration, QEMU Agent will help to make backups consistent at the filesystem level.

IMHO going with local storage can only be justified if both very very low latency disk access is required and high RTO/RPO can be tolerated in case of host failure. I would use Ceph and dimension the storage accordingly.

to a directory backed store
If you want snapshots you have to use qcow2 format for disks and creating/deleting snapshots with it is quite slower than with LVM/ZFS/Ceph.
 
The issue that I currently have is that we are considering setting up CEPH for non latency intensive VMs and migrating VMs from our current Thin-LVM backed storage over to our CEPH storage, from what I understand so far regardless of what storage backend we migrate to the preallocation will be made if we want to live-migrate from our current Thin-LVM storage. What we’re considering now is that we can take a backup of the VM -> Spin Up backup on the new storage backed once VM is live there we shutdown the existing one (however we are trying to avoid doing this since it can have its own problems in terms of data consistency on the VMs)

I also tried running fstrim inside a test VM but that didn’t release the unused space.
 
considering setting up CEPH for non latency intensive VMs
Do you have numbers on what the performance should be? Without them, you can't decide which VMs are "non latency intensive". Ceph isn't slow by any means, but of course you have the added latency and capacity limit of the network. How much that may affect real usage performance depends on many factors like write patterns or sync vs async writes. Take data from your current storage to find out your current perf usage (netdada and Proxmox integrated Metric Server + Grafana will help you here).

the preallocation will be made if we want to live-migrate from our current Thin-LVM storage
Yes, it will. That's why you need QEMU Agent in every VM. Or make customer pay for the whole drive they provision so you can buy more disks to allocate the whole space ;)

What we’re considering now is that we can take a backup of the VM
You will lose minutes, maybe even hours of data in those VMs. How important that is for you to decide.

I also tried running fstrim inside a test VM but that didn’t release the unused space.
You are doing it wrong. It's detailed in the post linked before [1]. With proper VM configuration it works 100% of the time if the VMs OS supports trimming.

[1] https://forum.proxmox.com/threads/v...fter-migrating-it-to-lvm-thin-storage.142070/
 
  • Like
Reactions: Moonrocks
Do you have numbers on what the performance should be? Without them, you can't decide which VMs are "non latency intensive". Ceph isn't slow by any means, but of course you have the added latency and capacity limit of the network. How much that may affect real usage performance depends on many factors like write patterns or sync vs async writes. Take data from your current storage to find out your current perf usage (netdada and Proxmox integrated Metric Server + Grafana will help you here).
The pre-req for us to setup CEPH is to be able to keep the thin-provisioning so we are trying to figure that out before gathering the data. Once we have a clear roadmap we will work on aggregating the data.
You will lose minutes, maybe even hours of data in those VMs. How important that is for you to decide.
This would likely be the last resort for us.
You are doing it wrong. It's detailed in the post linked before [1]. With proper VM configuration it works 100% of the time if the VMs OS supports trimming.

[1] https://forum.proxmox.com/threads/v...fter-migrating-it-to-lvm-thin-storage.142070/
You're right, followed the post to the t and was able to get the space released. Now, would it be possible to:

1) Install Qemu Guest Agent via CloudInit? We have cloudinit setup on these VMs. I'm thinking something like this:
-> Setup a CloudInit Script: /var/lib/vz/snippets/qemu-ga.yaml

Code:
#cloud-config
packages:
  - qemu-guest-agent

runcmd:
  - systemctl enable --now qemu-guest-agent


2) Enable the Qemu Guest Agent trim on these VMs (is enabling this possible without a power cycle?)


TIA :-)
 
Note that most modern database systems use some form of network replication before considering data to be “stable”. If Ceph adding a few hundred nanoseconds is a problem, you must consider the entire VM/SDN network architecture as well.

As far as the trim, depending on the OS, if you have cloudinit, then you can just install/enable qemu-agent, make sure you have VirtIO disks and continuous trim should happen automatically on most mainstream modern Linux.
 
  • Like
Reactions: Moonrocks
Note that most modern database systems use some form of network replication before considering data to be “stable”. If Ceph adding a few hundred nanoseconds is a problem, you must consider the entire VM/SDN network architecture as well.
We do have archival backup for that purpose but you're right, it does make RTO much longer.
As far as the trim, depending on the OS, if you have cloudinit, then you can just install/enable qemu-agent,
Perfect, I think this would be the way to go since we need the trim even if we want to move over to CEPH.

make sure you have VirtIO disks and continuous trim should happen automatically on most mainstream modern Linux.
We do have VirtIO Disks and discard is on (PFA the screen grab), does this mean if we do a phased migration and wait it out the trims will happen automatically? We are trying to avoid having to reboot the VMs (for cloudinit to do its thing and enable the Guest Agent on proxmox side for each VM).
 

Attachments

  • SS.png
    SS.png
    28.9 KB · Views: 2
Again, depends on your environment but default Ubuntu 22, 24 and RHEL 9 have fstrim on a timer (systemctl list-timers). Windows 10/2016+ does as well.

Now if you use ZFS or FAT32 file partitions, or you disabled it on XFS or set nodiscard in the mount options then you need to check what needs to be done for trim.

You don’t need to reboot to enable qemu-guest-agent, you should be able to start it on the guest manually or through Ansible deployments or however you manage your guests and it will work and trim will immediately work provided you have it set to discard in the VM configuration as well.
 
  • Like
Reactions: Moonrocks