Migration is a nightmare

gk_emmo · Oct 24, 2020

Dear Members!

We are quite new to Proxmox, and we are in the proccess to deploy our first small cluster with 5 nodes:
5x Dell R730xd with Dual Xeons
192GB RAM each
8xSamsung PM SSD 1TB for OSD's in each node
2xSamsung NVME for WAL and DB in each node

We are trying to migrate from an old VMware / HyperV environment, which was not created by us. We used HyperV for a longer time now, but we really want to try to come over to Proxmox.

We are at the point, when PVE nodes and cluster is up, CEPH is also OK, with 7-800 MB / sec write with radosbench. So it looks OK.

But, we struggle to proceed with the actual migration.

There are ~25 machines at all, half of them linux, other half is Windows SRV (2008/2012 / 2016).

We stuck with the first 2 servers. One of them is 2012(not R2 , i know...), and the other is 2008 R2.

The 2012 won't let us boot into the machine, and with the 2008 the situation is, that qm importdisk is doing the import soo slow, compared to the other 3 disks what it imported already, with the same size, and from the same source.

With the 2012 we tried nearly all of the things what we knew, boot is OK, but there is no joy with the system itself, it keeps saying errors due to hardware changes.

With importdisk, is there anybody who knows what can cause this big difference in an import proccess when everything is the same like with the 3 other disks before? Even the size is identical...

I apprecciate any kind of direction or advice, we need to finish with the deployment until tomorrow evening...

Thank You in advance,

Gabor

t.lamprecht · Oct 24, 2020

Hi,

gk_emmo said:
The 2012 won't let us boot into the machine, and with the 2008 the situation is, that qm importdisk is doing the import soo slow, compared to the other 3 disks what it imported already, with the same size, and from the same source.

gk_emmo said:
The 2012 won't let us boot into the machine

yeah, windows can be quite picky about HW changes...

What CPU, disk and network options did you set for the VMs? Can you post VM config(s)? qm config VMID

In general, it may help to move over the disks to Virtio-SCSI, the network also to VirtIO, if you have not already done so.
See: https://pve.proxmox.com/wiki/Windows_VirtIO_Drivers

gk_emmo said:
that qm importdisk is doing the import soo slow, compared to the other 3 disks what it imported already, with the same size, and from the same source.

From what storage do you import the disk?

gk_emmo · Oct 24, 2020

t.lamprecht said:
Hi,

yeah, windows can be quite picky about HW changes...

What CPU, disk and network options did you set for the VMs? Can you post VM config(s)? qm config VMID

In general, it may help to move over the disks to Virtio-SCSI, the network also to VirtIO, if you have not already done so.
See: https://pve.proxmox.com/wiki/Windows_VirtIO_Drivers

From what storage do you import the disk?

We tried with many options. It is an older VM without UEFI, with 4 cpu cores, 1 sockets, 1 virtual disk, 1 network card and 10gigs of RAM. The config now looks like this (we tried to change the CPU, and the network already, the disk is IDE from the start.)

bootdisk: ide0
cores: 4
ide0: VM-POOL:vm-100-disk-0,size=237465M
ide2: none,media=cdrom
memory: 10000
name: SRVWIN63-1
numa: 0
onboot: 1
ostype: win8
parent: chkdskutan
scsihw: virtio-scsi-pci
smbios1: uuid=b76fd356-d84a-4d12-b6f1-04153aed4204
sockets: 1
vmgenid: 5bf31f9f-c4fb-4115-a0f1-afb9a01b2605

For now, we are doing another export from the VM, but we will install VirtIO drivers before import, and we will go with this devices after the newly exported virtual disk is imported into ceph.

The import, which is slow, is coming from a single SSD, which is attached to the same HBA cards like the other SSD's which are used as OSD's. 1TB Samsung SSD. We use 2 SSD's for copying these files, it just looked easier, cuz the old servers has no 10Gbit. We successfully migrated 3 virtual diska with ~500GB size into the same ceph pool. The 4th, which is also ~500GB, and exported in the same proccess on the old server, does not want to do the job with the same speeds. Atop shows no busy disk (the disk itself is an SSD with exFAT so Window were able to write to it also, mounted with exfat-utils).

Thank You again!

gk_emmo · Oct 24, 2020

As an addition, i was curious, and created a new pool, with less PG. It does the same. I can't see the bottleneck, and i think we suspend the deployment, because i don't know what else i can do with it now.

t.lamprecht · Oct 24, 2020

gk_emmo said:
bootdisk: ide0
cores: 4
ide0: VM-POOL:vm-100-disk-0,size=237465M
ide2: none,media=cdrom
memory: 10000
name: SRVWIN63-1
numa: 0
onboot: 1
ostype: win8
parent: chkdskutan
scsihw: virtio-scsi-pci
smbios1: uuid=b76fd356-d84a-4d12-b6f1-04153aed4204
sockets: 1
vmgenid: 5bf31f9f-c4fb-4115-a0f1-afb9a01b2605

No network here in that VM?

gk_emmo said:
the disk is IDE from the start

What exactly do you mean here? You do switch that also to SCSI once the VirtIO drivers are installed?
As IDE is pretty terrible here, and should only be used for CD-ROM drivers or for legacy reason, where one must.

parent: chkdskutan

Is this a snapshot?

CPU type seems to be also default, so KVM, maybe try to use a more feature full one; if that cluster uses the same CPU in every node you could even go with "host".
Also check for the specter/meltdown flags you could enable for the VMs:
https://pve.proxmox.com/pve-docs/chapter-qm.html#_meltdown_spectre_related_cpu_flags

gk_emmo · Oct 26, 2020

t.lamprecht said:
No network here in that VM?

What exactly do you mean here? You do switch that also to SCSI once the VirtIO drivers are installed?
As IDE is pretty terrible here, and should only be used for CD-ROM drivers or for legacy reason, where one must.

Is this a snapshot?

CPU type seems to be also default, so KVM, maybe try to use a more feature full one; if that cluster uses the same CPU in every node you could even go with "host".
Also check for the specter/meltdown flags you could enable for the VMs:
https://pve.proxmox.com/pve-docs/chapter-qm.html#_meltdown_spectre_related_cpu_flags

We've tried many different configs. At the point where i exported the config there were no network added, to close out it, cuz we thought that it can cause trouble.

IDE is because in the old HyperV environment it was used with an IDE controller.

CPU was also switched many times.

We needed to suspend the whole deployment, and we swithced back to HyperV, and used one of the 4 nodes as a HyperV node, which left the other 3 nodes with PVE+CEPH to find out what causes and caused this issue.

About that, i started to delete the data which was already imported into the ceph pool what was created, so we can delete the pool, and everything, to start from square one. There were 3 ~500GB sized VHDX files already imported. I pushed the remove button of 1 file 2 days ago, and it's still deleting, its at 14% . So it look like the performance is terrible even when we try to delete something.

I think the problem is laying somewhere in the config, because that's where we lack knowledge. Can somebody pls give us some direction, where should we start looking?

Thank you again.

t.lamprecht · Oct 26, 2020

gk_emmo said:
I think the problem is laying somewhere in the config, because that's where we lack knowledge. Can somebody pls give us some direction, where should we start looking?

I gave various pointers, where you basically replied vaguely "you tried everything already", so not sure what directions one can give here.

IMO, this is hard to tell over the forum, as with the mix of changes you may have made or not made it could be something like this, or really just a missing VirtIO driver (which versions did you use?) or still using slow as hell IDE, as said without knowing what the exact steps done where and ideally some closed-up look at the current state it's quite impossible to give clear directions..

I'd start this out with importing some new or dummy VM resembling the windows VM which makes problems closely (from the inside), then start with that.

gk_emmo · Oct 26, 2020

t.lamprecht said:
I gave various pointers, where you basically replied vaguely "you tried everything already", so not sure what directions one can give here.

IMO, this is hard to tell over the forum, as with the mix of changes you may have made or not made it could be something like this, or really just a missing VirtIO driver (which versions did you use?) or still using slow as hell IDE, as said without knowing what the exact steps done where and ideally some closed-up look at the current state it's quite impossible to give clear directions..

I'd start this out with importing some new or dummy VM resembling the windows VM which makes problems closely (from the inside), then start with that.

Sorry Sir, maybe i didn't explained myself correctly. When i wrote slow, i meant it for the whole cluster. The deletion proccess from an imported disk of 500GB is still ongoing since saturday, and it is at 14%. That's what makes me mad, because all the OSD's are up, there is no network usage, and i think we should start from scratch.

I asked for directions, about any kind of bottleneck, or usual problem, what others have experience with. For an example, i tried to calculate PG size, but i did not found usable intel. I was able to find a calculator, and some equation, but not info about the usage type, where i mean small files, big files, virtualization, storage usage, etc.

I think i researched enough how to setup the cluster, and the OSD's, but i still don't know, what happened when CEPH dropped the import and delete performance on Saturday.

All in all, if you can link a guide, what contains information about how to check a working cluster from every angle so i can find the bottleneck, that would be very helpful. I apprecciate the help already.

Thank You.

Search

Search

Migration is a nightmare

gk_emmo

Member

t.lamprecht

Proxmox Staff Member

gk_emmo

Member

gk_emmo

Member

t.lamprecht

Proxmox Staff Member

gk_emmo

Member

t.lamprecht

Proxmox Staff Member

gk_emmo

Member