HELP! - Almost quiting to VMware...

jose.cardoso

New Member
Nov 17, 2023
25
2
3
Hello all.

I'm quite new to Proxmox, but I've been using it in some small business systems.

Now, I have a new project, where I want to replace add a new server to an ecosystem of 3 existing VMware servers. The idea is to implement a new server, migrate some important VMs and, after the first step, migrate the 3 existing VMware servers to Proxmox, and have a Proxmox cluster with 4 servers.

But, I'm experiencing performance issues on the new server with Proxmox.

The host is a new 32 x Intel(R) Xeon(R) Silver 4309Y CPU @ 2.80GHz (2 Sockets) with 256GB RAM.
It have a hardware RAID controller Avago MegaRAID SAS 9341-4i, with 4x8TB in RAID 5 configuration (24TB).
Also, 2x10 Gb NIC Broadcom BCM57412 NetXtreme-E connected to a Ubiquiti 48 Enterprise switch with 10Gb fiber cable.

The main purpose of this new server is to build a new Windows File Server (Yes, it must be Windows on this business, although I work with Linux servers) and domain controller, so the network performance it's very important.

I've setup 2 Windows vms. Windows Server 2022 (which will be a new AD domain controller and File Server, and a Windows 10).

For testing purposes, I've downloaded 2 big files (1GB and 10GB), and copied the files between th 2 VMs over network file share.

Virtio drivers where installed and QEMU utils are running.

I'm experiencing performance issues when copying.

The 1GB file takes some 5 seconds to copy between the 2 VMs. With VMware is almost imediate.

The 10GB file can take some 30 seconds or even stall on copy, recovering the copy some time after. With VMware, it's more or less 15 seconds, with just a small hang at 60% (windows buffer?) but it's more or less whar I was expecting.

Also, I've tried to copy some files from an existing server, over the network, and the copy (robocopy) also hangs for some minutes, recovering and hanging all the time.

When this symptoms happens, I try to acess the windows server vm, and the machine is unresponsive (noVNC or MS RDP).

This doesn't happen with VMware. Everything is smooth and I experience no hangings or stall.

I'm on the latest Proxmox version 8.1.

I've tried with machine type 5.1, 8.1 and change the NICs MTU to 9000, as per some blog users suggstions.

But I'm stuck, and I would like to not return to VMware.

What am I'm doing wrong?
 
How is the VMware cluster configured? Does it have a vSan or a HW Raid? Which plates are used? What hardware is this used?
 
@sb-jw.

Thanks for your reply.
No VMware cluster at all. The tests I've made for comparision, were made with VMware ESXi 8.1 on the same machine, so it's direct comparing.
 
I remember that I had similar troubles with Proxmox in the beginning but found this page with this special passage:

Code:
##
# Adjust vfs cache
# https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
# Decriase dirty cache to faster flush on disk
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

It seems everything goes to RAM first of all and from RAM to disk which causes then your mentioned pause of copying stuff.
 
@showiproute , yes.
I've created the file sysctl-proxmox-tune.conf at the /etc/sysctl.d dir, with the code you've mentioned.

Rebooted the host, but the behavior is the same.

I don't know if I made something wrong, but I'm lost.
 
@showiproute , Thanks for the help.

Well, I think it's OK. The parameter looks like it's working as expected.
On the sysctl-proxmox-tune.conf, I've just added the vfs cache parameters. Should I add all of those of the sergey-dryabzhinsky repo?

1701681776998.png

1701681857057.png
 
You may but you also need to think what is useful for your usecase and what note - simply copy/pasting stuff may not be the best option.

Additional can you also share the VM specifications with us?
 
This is the server VM, which is a Windows Server 2022 Standard.
The hard disk where I'm trying to copy the files to, over the network is 10TB unit (SCSI1).

1701682841802.png

This is the Windows 10 VM, where I'm copying files from.

1701683089622.png
 
Have you tried to disable "iothread" at the drives and also cache=default (No cache).
This would be my standard where I do get throughput.
 
OK, I've tried with those settings, but I have the same behavior.
I can't do it now, but I'll make a video to show how the Windows Server gets stuck and the copy hangs.
I'm really frustrated!
 
Have you also checked the real iops on the server itself when your VM hangs?
In my opionion you are writing stuff to the RAM which needs to flushed to the storage after a period of time. Until the writing has been finished the IOs will be drastically reduced.

You may try to use the whole sysctl settings which I mentioned earlier but please adopt them to your personal needs.
 
May want to look at using an IT/HBA-mode controller. I avoid HW RAID controllers when possible.
 
The 1GB file takes some 5 seconds to copy between the 2 VMs. With VMware is almost imediate.
While I cant speak to what you were getting on your vsphere node since you didnt provide hardware loadout (specifically, storage) I can tell you that 200MB/S is not unreasonable throughput read-then-write on a single raid5 volume. I assume your 8TB disks are hdds.

The 10GB file can take some 30 seconds or even stall on copy,
I'd say you might want to reconsider your storage configuration. if performance is of paramount importance, you really want ssd/nvme in a striped mirror configuration.
Also, I've tried to copy some files from an existing server, over the network, and the copy (robocopy) also hangs for some minutes, recovering and hanging all the time.
post your host's network configuration for assistance, but its likely that your disk throughput on one side is slower than the network. also, robocopy isnt a panacea and is still impacted by size and quantity of files. does it behave the same if you send one file and 10000? whats the average throughput reported when the job is done?
When this symptoms happens, I try to acess the windows server vm, and the machine is unresponsive (noVNC or MS RDP).
unless you see the cpu pegged at 100% when this happens, you're probably just running out of iops on your storage. This might be alleviated somewhat by giving the guest more ram, but the solution for this is to separate your system drive to reside on another storage pool, preferably ssd, and NOT parity raid. you can keep your raid5 volume for payload, eg data volumes.
 
Unfortunately I had to give up Proxmox on this project. After multiple server hangs, locks, a reboots, I've started to have data corruption on the disks, ending with the need of data recovery from a "dead" VM.

I assume that was something wrong with my setup, because I don't know PVE deeply, but somehow my trust, at the moment, is lost.

I'll still following PVE, and using it on other projects, and investigate further for future implementations.

Thanks to everyone that helped me on this post.

Case closed.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!