Live Migration / ZFS / Discard

Londeril · Mar 24, 2024

Hello everyone.

A little bit of background
I'm quite new to the Proxmox world - I'm coming from the VMWare world and been implementing vSphere solutions as a systems engineer for over 15 years now. So I'm pretty familiar with virtualization. Due to the Broadcom fallout we've decided to abandon VMWare and use Proxmox from now on. I've been playing with Proxmox VE for the past month and I'm pretty happy with it. Now we are looking at clustering and migration. This is where I'm struggling. I've searched the interwebs and these forums but I can't find a answer / explanation for what I'm seeing happeing.

The System
We are running the following configuration:
- 2x HPE Proliant DL 380 Gen 10 Servers (one Node with 1 CPU and 128GB RAM, one Node with 2 CPUs and 256 GB RAM)
- Both systems are using local ZFS-RAID10 storage on 6x600GB HDDs. The disks are backed by a RAID controller in HBA mode.
- The two nodes are connected to the management/vm network with a 4x1GB Linux Bond and to the SAN network with 1x10GB
- One node is on 8.1.4 (with a subscription) the other on 8.1.5 (without a subscription) - both are reporting to have no updates available.

The Problem
I've got a Test VM running on Node1 and want to (live) Migrate it to Node2 / and back again.
The disk attached to the VM is using the VirtIO SCSI Single controller and with "Write Back" caching and Discard and SSD Emulation turned on. The disk is 100GiB big (thin provisioned - actual usage 33GiB)
When I migrate the VM the process takes 43Minutes. The target Node will use 80%-95% CPU for a long time and no network traffic is beeing generated, sitting at "drive mirror is starting for drive-scsi0".
Then after some time the Network traffic starts (300-400MiB/s). When the tranfer hits the 33GB mark (after which the VM disk is empty) data is still beeing transfered over the network but the target disk on the target node stops growing (I would expect that the transfer would be much faster after the actual data has been send, but that doesn't happen).

Now to the conondrum I can't get my head around
If I stop the VM disable "Discard" and "SSD Emulation", start the VM and do the migration again the process takes 4 Minutes the transfer starts instantly and does not wait at "drive mirror is starting for drive-scsi0". What still happens though is that the transfer continues into the "empty space" and does generate network traffic - why is date beeing transfered for a "empty disk"?

What's going on? Why does the discard flag makes the live migration take that much longer, and generate that much stress on the target CPU? I've read that migrating a VM with discard on will "pre fill the target disk with zeroes" but to my understanding this should not happen with ZFS? But what I'm seeing sure feels like the host would fill the disks with zeroes before starting the migration?

Can anyone shed some light on this issue for me please? We'd like to use discard in producation but also want to use live migration quite a bit. the best workaround so far is to shutdown the vm, remove the discard flag, start the vm, migrate the vm, stop the vm, enable discard, start the vm... which is not ideal for production servers.

I'm happy to share log files with you guys if that helps.

Thanks for any insight into this problem

Cheers

Daniel

(edited for typos)

LnxBil · Mar 24, 2024

I cannot comment on the issue at hand, because normally you won't do it that way and most people will not have experience doing it.
I want to show you ways that are MUCH faster:

use a shared storage, migration will be instant without copying files
use ZFS replication and do a switch in a second after the asynchronous transfer is finished

Everything else will be very slow and not a setup I would want to use in production.

Londeril · Mar 24, 2024

LnxBil said:
I cannot comment on the issue at hand, because normally you won't do it that way and most people will not have experience doing it.
I want to show you ways that are MUCH faster:

use a shared storage, migration will be instant without copying files

use ZFS replication and do a switch in a second after the asynchronous transfer is finished

Everything else will be very slow and not a setup I would want to use in production.

Hi

Thanks for the reply. A shared storage would be ideal, yes - but that's out of our budged range atm. Could CEPH be an alternative?

Can you elaborate on ZFS replication? How would that work? I never used that before.

Thanks!

LnxBil · Mar 24, 2024

Londeril said:
Thanks for the reply. A shared storage would be ideal, yes - but that's out of our budged range atm. Could CEPH be an alternative?

That is a distributed shared storage and totally fine for this (if you have at least 3 nodes).

Londeril said:
The two nodes are connected to the management/vm network with a 4x1GB Linux Bond and to the SAN network with 1x10GB

What's that SAN then? This is a dedicated shared storage that can be used with PVE.

Londeril said:
Can you elaborate on ZFS replication? How would that work? I never used that before.

Sure, this chapter in the reference documentation.

Londeril · Mar 24, 2024

LnxBil said:
That is a distributed shared storage and totally fine for this (if you have at least 3 nodes).

I'll give that a good look then. We'll have 3 nodes once we get rid off all ESXi Hosts

LnxBil said:
What's that SAN then? This is a dedicated shared storage that can be used with PVE.

This is "just" e dedicated 10Gbit Network for Cluster and Storage traffic - so the PVEs will have an exclusive 10Gbit link to talk to each other and transfer data

LnxBil said:
Sure, this chapter in the reference documentation.

Cheers! I'm looking at replication as we speak and this might just be the answer to all I've been doing "wrong" with migration! Thanks for pointing me in the right direction!

LnxBil · Mar 25, 2024

Londeril said:
I'll give that a good look then. We'll have 3 nodes once we get rid off all ESXi Hosts

Then CEPH is the way to go and you don't need to invest time in ZFS replication, which is mainly for a kind of standby system and mostly used in a two-independendet-node-scenario. CEPH is a hyperconverged cluster and exactly what you need. If I understand you correctly, you may have a chicken-egg-problem while migrating from the machines that will be the PVE cluster, so you may still need to invest in ZFS at least for the time of the migration.

FormerVMW · Jun 8, 2024

Londeril said:
Hello everyone.

A little bit of background
I'm quite new to the Proxmox world - I'm coming from the VMWare world and been implementing vSphere solutions as a systems engineer for over 15 years now. So I'm pretty familiar with virtualization. Due to the Broadcom fallout we've decided to abandon VMWare and use Proxmox from now on. I've been playing with Proxmox VE for the past month and I'm pretty happy with it. Now we are looking at clustering and migration. This is where I'm struggling. I've searched the interwebs and these forums but I can't find a answer / explanation for what I'm seeing happeing.

The System
We are running the following configuration:
- 2x HPE Proliant DL 380 Gen 10 Servers (one Node with 1 CPU and 128GB RAM, one Node with 2 CPUs and 256 GB RAM)
- Both systems are using local ZFS-RAID10 storage on 6x600GB HDDs. The disks are backed by a RAID controller in HBA mode.
- The two nodes are connected to the management/vm network with a 4x1GB Linux Bond and to the SAN network with 1x10GB
- One node is on 8.1.4 (with a subscription) the other on 8.1.5 (without a subscription) - both are reporting to have no updates available.

The Problem
I've got a Test VM running on Node1 and want to (live) Migrate it to Node2 / and back again.
The disk attached to the VM is using the VirtIO SCSI Single controller and with "Write Back" caching and Discard and SSD Emulation turned on. The disk is 100GiB big (thin provisioned - actual usage 33GiB)
When I migrate the VM the process takes 43Minutes. The target Node will use 80%-95% CPU for a long time and no network traffic is beeing generated, sitting at "drive mirror is starting for drive-scsi0".
Then after some time the Network traffic starts (300-400MiB/s). When the tranfer hits the 33GB mark (after which the VM disk is empty) data is still beeing transfered over the network but the target disk on the target node stops growing (I would expect that the transfer would be much faster after the actual data has been send, but that doesn't happen).

Now to the conondrum I can't get my head around
If I stop the VM disable "Discard" and "SSD Emulation", start the VM and do the migration again the process takes 4 Minutes the transfer starts instantly and does not wait at "drive mirror is starting for drive-scsi0". What still happens though is that the transfer continues into the "empty space" and does generate network traffic - why is date beeing transfered for a "empty disk"?

What's going on? Why does the discard flag makes the live migration take that much longer, and generate that much stress on the target CPU? I've read that migrating a VM with discard on will "pre fill the target disk with zeroes" but to my understanding this should not happen with ZFS? But what I'm seeing sure feels like the host would fill the disks with zeroes before starting the migration?

Can anyone shed some light on this issue for me please? We'd like to use discard in producation but also want to use live migration quite a bit. the best workaround so far is to shutdown the vm, remove the discard flag, start the vm, migrate the vm, stop the vm, enable discard, start the vm... which is not ideal for production servers.

I'm happy to share log files with you guys if that helps.

Thanks for any insight into this problem

Cheers

Daniel

(edited for typos)

I posted a new thread on this topic if you want to join: https://forum.proxmox.com/threads/live-migration-with-discard.148548/

Search

Search

Live Migration / ZFS / Discard

Londeril

New Member

LnxBil

Distinguished Member

Londeril

New Member

LnxBil

Distinguished Member

Londeril

New Member

LnxBil

Distinguished Member

FormerVMW

New Member

We value your privacy