Migration is a nightmare

gk_emmo

Member
Oct 24, 2020
13
2
8
38
Dear Members!

We are quite new to Proxmox, and we are in the proccess to deploy our first small cluster with 5 nodes:
5x Dell R730xd with Dual Xeons
192GB RAM each
8xSamsung PM SSD 1TB for OSD's in each node
2xSamsung NVME for WAL and DB in each node

We are trying to migrate from an old VMware / HyperV environment, which was not created by us. We used HyperV for a longer time now, but we really want to try to come over to Proxmox.

We are at the point, when PVE nodes and cluster is up, CEPH is also OK, with 7-800 MB / sec write with radosbench. So it looks OK.

But, we struggle to proceed with the actual migration.

There are ~25 machines at all, half of them linux, other half is Windows SRV (2008/2012 / 2016).

We stuck with the first 2 servers. One of them is 2012(not R2 , i know...), and the other is 2008 R2.

The 2012 won't let us boot into the machine, and with the 2008 the situation is, that qm importdisk is doing the import soo slow, compared to the other 3 disks what it imported already, with the same size, and from the same source.

With the 2012 we tried nearly all of the things what we knew, boot is OK, but there is no joy with the system itself, it keeps saying errors due to hardware changes.

With importdisk, is there anybody who knows what can cause this big difference in an import proccess when everything is the same like with the 3 other disks before? Even the size is identical...

I apprecciate any kind of direction or advice, we need to finish with the deployment until tomorrow evening...

Thank You in advance,

Gabor
 
Last edited:
Hi,
The 2012 won't let us boot into the machine, and with the 2008 the situation is, that qm importdisk is doing the import soo slow, compared to the other 3 disks what it imported already, with the same size, and from the same source.
The 2012 won't let us boot into the machine
yeah, windows can be quite picky about HW changes...

What CPU, disk and network options did you set for the VMs? Can you post VM config(s)? qm config VMID

In general, it may help to move over the disks to Virtio-SCSI, the network also to VirtIO, if you have not already done so.
See: https://pve.proxmox.com/wiki/Windows_VirtIO_Drivers

that qm importdisk is doing the import soo slow, compared to the other 3 disks what it imported already, with the same size, and from the same source.
From what storage do you import the disk?
 
Hi,


yeah, windows can be quite picky about HW changes...

What CPU, disk and network options did you set for the VMs? Can you post VM config(s)? qm config VMID

In general, it may help to move over the disks to Virtio-SCSI, the network also to VirtIO, if you have not already done so.
See: https://pve.proxmox.com/wiki/Windows_VirtIO_Drivers


From what storage do you import the disk?
We tried with many options. It is an older VM without UEFI, with 4 cpu cores, 1 sockets, 1 virtual disk, 1 network card and 10gigs of RAM. The config now looks like this (we tried to change the CPU, and the network already, the disk is IDE from the start.)

bootdisk: ide0
cores: 4
ide0: VM-POOL:vm-100-disk-0,size=237465M
ide2: none,media=cdrom
memory: 10000
name: SRVWIN63-1
numa: 0
onboot: 1
ostype: win8
parent: chkdskutan
scsihw: virtio-scsi-pci
smbios1: uuid=b76fd356-d84a-4d12-b6f1-04153aed4204
sockets: 1
vmgenid: 5bf31f9f-c4fb-4115-a0f1-afb9a01b2605

For now, we are doing another export from the VM, but we will install VirtIO drivers before import, and we will go with this devices after the newly exported virtual disk is imported into ceph.

The import, which is slow, is coming from a single SSD, which is attached to the same HBA cards like the other SSD's which are used as OSD's. 1TB Samsung SSD. We use 2 SSD's for copying these files, it just looked easier, cuz the old servers has no 10Gbit. We successfully migrated 3 virtual diska with ~500GB size into the same ceph pool. The 4th, which is also ~500GB, and exported in the same proccess on the old server, does not want to do the job with the same speeds. Atop shows no busy disk (the disk itself is an SSD with exFAT so Window were able to write to it also, mounted with exfat-utils).

Thank You again!
 
As an addition, i was curious, and created a new pool, with less PG. It does the same. I can't see the bottleneck, and i think we suspend the deployment, because i don't know what else i can do with it now.
 
bootdisk: ide0
cores: 4
ide0: VM-POOL:vm-100-disk-0,size=237465M
ide2: none,media=cdrom
memory: 10000
name: SRVWIN63-1
numa: 0
onboot: 1
ostype: win8
parent: chkdskutan
scsihw: virtio-scsi-pci
smbios1: uuid=b76fd356-d84a-4d12-b6f1-04153aed4204
sockets: 1
vmgenid: 5bf31f9f-c4fb-4115-a0f1-afb9a01b2605

No network here in that VM?

the disk is IDE from the start
What exactly do you mean here? You do switch that also to SCSI once the VirtIO drivers are installed?
As IDE is pretty terrible here, and should only be used for CD-ROM drivers or for legacy reason, where one must.

parent: chkdskutan

Is this a snapshot?

CPU type seems to be also default, so KVM, maybe try to use a more feature full one; if that cluster uses the same CPU in every node you could even go with "host".
Also check for the specter/meltdown flags you could enable for the VMs:
https://pve.proxmox.com/pve-docs/chapter-qm.html#_meltdown_spectre_related_cpu_flags
 
No network here in that VM?


What exactly do you mean here? You do switch that also to SCSI once the VirtIO drivers are installed?
As IDE is pretty terrible here, and should only be used for CD-ROM drivers or for legacy reason, where one must.



Is this a snapshot?

CPU type seems to be also default, so KVM, maybe try to use a more feature full one; if that cluster uses the same CPU in every node you could even go with "host".
Also check for the specter/meltdown flags you could enable for the VMs:
https://pve.proxmox.com/pve-docs/chapter-qm.html#_meltdown_spectre_related_cpu_flags

We've tried many different configs. At the point where i exported the config there were no network added, to close out it, cuz we thought that it can cause trouble.

IDE is because in the old HyperV environment it was used with an IDE controller.

CPU was also switched many times.

We needed to suspend the whole deployment, and we swithced back to HyperV, and used one of the 4 nodes as a HyperV node, which left the other 3 nodes with PVE+CEPH to find out what causes and caused this issue.

About that, i started to delete the data which was already imported into the ceph pool what was created, so we can delete the pool, and everything, to start from square one. There were 3 ~500GB sized VHDX files already imported. I pushed the remove button of 1 file 2 days ago, and it's still deleting, its at 14% . So it look like the performance is terrible even when we try to delete something.

I think the problem is laying somewhere in the config, because that's where we lack knowledge. Can somebody pls give us some direction, where should we start looking?

Thank you again.
 
I think the problem is laying somewhere in the config, because that's where we lack knowledge. Can somebody pls give us some direction, where should we start looking?

I gave various pointers, where you basically replied vaguely "you tried everything already", so not sure what directions one can give here.

IMO, this is hard to tell over the forum, as with the mix of changes you may have made or not made it could be something like this, or really just a missing VirtIO driver (which versions did you use?) or still using slow as hell IDE, as said without knowing what the exact steps done where and ideally some closed-up look at the current state it's quite impossible to give clear directions..

I'd start this out with importing some new or dummy VM resembling the windows VM which makes problems closely (from the inside), then start with that.
 
I gave various pointers, where you basically replied vaguely "you tried everything already", so not sure what directions one can give here.

IMO, this is hard to tell over the forum, as with the mix of changes you may have made or not made it could be something like this, or really just a missing VirtIO driver (which versions did you use?) or still using slow as hell IDE, as said without knowing what the exact steps done where and ideally some closed-up look at the current state it's quite impossible to give clear directions..

I'd start this out with importing some new or dummy VM resembling the windows VM which makes problems closely (from the inside), then start with that.
Sorry Sir, maybe i didn't explained myself correctly. When i wrote slow, i meant it for the whole cluster. The deletion proccess from an imported disk of 500GB is still ongoing since saturday, and it is at 14%. That's what makes me mad, because all the OSD's are up, there is no network usage, and i think we should start from scratch.

I asked for directions, about any kind of bottleneck, or usual problem, what others have experience with. For an example, i tried to calculate PG size, but i did not found usable intel. I was able to find a calculator, and some equation, but not info about the usage type, where i mean small files, big files, virtualization, storage usage, etc.

I think i researched enough how to setup the cluster, and the OSD's, but i still don't know, what happened when CEPH dropped the import and delete performance on Saturday.

All in all, if you can link a guide, what contains information about how to check a working cluster from every angle so i can find the bottleneck, that would be very helpful. I apprecciate the help already.

Thank You.
 
  • Like
Reactions: Sourcenux

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!