Proxmox performance

CoMox

Active Member
Nov 9, 2018
13
0
41
41
For my company i am using Proxmox through a supplier for little over a year now and i am not to happy and i hope there are some people over here that can help me to find a solution. Hope you can bare with me as i write all and try to give a good view without making it to long.

We supply remote desktop services for over 10 years and were using VMware for al those years. Since 4 years we were running on a 4 node dell node system with single SAN below on a 2x 10 Gb network in the same dc rack. And we never had issues with performance. As the platform get fuller we were looking for replacement. And found a supplier of Proxmox which sounded great.

So as advised we bought 3 compute nodes ( 80 x Xeon Gold 6138 @2.00GHz 2 sockets / 768Gib mem ) and 3 storage nodes (don't have the hardware details) with only ssd's for 80% full. So ~10 ssd per node which on the old SAN was only spindle disks. All put in 3 datacentres with a 10gigabit network connecting the nodes. The Nodes are now running VE 6.3.2 and had one upgrade from some version of 5.x somewhere in the middle of 2020. Few weeks before any problem.

We stared moving our clients from VMWare to Proxmox in February 2020 manly reinstalling there systems completely so no (VMWare) drivers were left behind. After about 90% moving we stared to have performance issues. On the working days ( mostly Thursdays and Thursdays ) the system would get slow starting arount 12 and from ~15:00 would be to slow to work at all and we would get complains from everyone. No increased number of users at these times compared to the start of the day.
We started looking for problems with our supplier. We (from within Windows) and they measured (Windows and Proxmox) disk performance, network performance, Active directory, dns (on the AD) and never found something conclusive. We had a external person with 15+ years of experience with Linux and a small Proxmox cluster look at the system telling us: "if I look at what i measure, you have no problems at all" but clearly looking at working on the system, You have a problem. After adding NVMe to all compute nodes with no real improvements, around the end of October, we told the supplier we would go back to VMWare as we started to lose clients and we were no closer to a solution (3 weeks of little sleep). This made them put extra efford and said they found some problem with network under heavy cpu load and Windows but were not sure if this was a real problem as CPU's were never 100%. They added a small extra compute node to the cluster and the system started performing oke again.

We finished moving the last clients, added some new ones and after that, before running into the same problem again we got 2 extra nodes ( 48 x AMD EPYC 7272 12-core 2 sockets / 768Gib mem ) and added them to the cluster. Then Christmas and so on, started.

Right now we can continue and talking to other company's we know we made the right choice with Proxmox, but we are not relaxed because we have no idea when we will reach the same point and all starts to crumble all over, without knowing what to monitor and watch to make predictions and decisions.

As said we measured a lot even in the middle of the night and never found something always slow or always high. The network performance of Windows is always about 1/5 of the of the Linux machines (iPerf) Internode is slower for both but around the same 1/5 deference applies. Even now, running fine, that is the case trying several types of vNics and drivers and driver versions during the issues.
Adding a node to the system makes me think it is not the network between the nodes but i have no insight in that.
The nodes it self looked fine trough the eyes of the supplier (and me with to little experience) as there monitoring didn't give a warning. And they only added the node after 3 weeks of problems. So there we no direct indicators the system was full imo.
We have a subscription but i don't know how far the supplier used that. As a lot of times i got the answer that we would not get help with our "Windows" problems.
Last thing we did at the end of the year is buy a "home" machine with NVMe and installed Proxmox. That machine is for sure a lot faster than my profesional systems. A Windows AD with full Exchange boots in seconds and a login with mgmt tools loaded is ~3 seconds. I don't even get the with a fresh Windows install on the new empty nodes (using the Ceph storage nodes underneath not NVMe). I understand it is not a real comparison but it gives me the feeling there is room for (big) improvements somewhere.

My goal is to know when the system is getting to full capacity and i understand that i have to answer a lot of questions that i have to relay to the supplier to check if those thing are already monitored.
The other thing is, that i would like to find out why my home system is so much faster.

Thanks for reading and i will try to reply your questions as soon as i can will in production :)
 
Can you give us more details on your system configuration?
You are using Ceph as Storage for your VM's?

What you should check from my perfpective ([...] slow starting arount 12 and from ~15:00 would be to slow to work [...]):

Check nf_conntrack: This connection tracking and limiting system is the bane of many production Ceph clusters, and can be insidious in that everything is fine at first. As cluster topology and client workload grow, mysterious and intermittent connection failures and performance glitches manifest, becoming worse over time and at certain times of day. Check syslog history for table fillage events. You can mitigate this bother by raising nf_conntrack_max to a much higher value via sysctl. Be sure to raise nf_conntrack_buckets accordingly to nf_conntrack_max / 4, which may require action outside of sysctl e.g. "echo 131072 > /sys/module/nf_conntrack/parameters/hashsize More interdictive but fussier is to blacklist the associated kernel modules to disable processing altogether. This is fragile in that the modules vary among kernel versions, as does the order in which they must be listed. Even when blacklisted there are situations in which iptables or docker may activate connection tracking anyway, so a “set and forget” strategy for the tunables is advised. On modern systems this will not consume appreciable resources.
Look at: https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-osd/

Also check out what is set concerning scrubing. This may can lead to performance issues if done during the production hours.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#pveceph_scrub
https://docs.ceph.com/en/nautilus/rados/configuration/osd-config-ref/#scrubbing

There should be enought free space on ceph, otherwise it switches to read only. This depands on your cluster design and what your size and min_size. Ceph requires free disk space to move storage chunks, called placement groups, between different disks. As this free space is so critical to the underlying functionality, Ceph will go into HEALTH_WARN once any OSD reaches the near_full ratio (generally 85% full), and will stop write operations on the cluster by entering HEALTH_ERR state once an OSD reaches the full_ratio. However, unless your cluster is perfectly balanced across all OSDs there is likely much more capacity available, as OSDs are typically unevenly utilized.
 
Hi Christian,

Thank you for helping i relayed your questions and hope they will answer me. But so far no reply. The storage is just over 50% full so i don't expect any troubles there. Right now we don't have the performance drops anymore after adding a node. So i think i never was the Ceph storage or the 3 nodes (dedicated storage hardware) running it. As also the VM's running on the NVMe who are in de Compute nodes aren't extremely quick.

I don't have access to the CLI. Is there a way to see how much io the NVMe is doing in the web interface? I wonder if they are busy or not reaching there potential for some reason.
 
Last edited:
Check nf_conntrack: This connection tracking and limiting system is the bane of many production Ceph clusters, and can be insidious in that everything is fine at first. As cluster topology and client workload grow, mysterious and intermittent connection failures and performance glitches manifest, becoming worse over time and at certain times of day. Check syslog history for table fillage events. You can mitigate this bother by raising nf_conntrack_max to a much higher value via sysctl. Be sure to raise nf_conntrack_buckets accordingly to nf_conntrack_max / 4, which may require action outside of sysctl e.g. "echo 131072 > /sys/module/nf_conntrack/parameters/hashsize More interdictive but fussier is to blacklist the associated kernel modules to disable processing altogether. This is fragile in that the modules vary among kernel versions, as does the order in which they must be listed. Even when blacklisted there are situations in which iptables or docker may activate connection tracking anyway, so a “set and forget” strategy for the tunables is advised. On modern systems this will not consume appreciable resources.
Look at: https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-osd/
I don't think it's related, but you can define them in pve-firewall options. directly. (when the conntrack is full, you can't accep new connections, but it'll not be slower). 5millions entry is around 500mb, so yes, you can bump it (if you have a lot of public vms, with a lot of connections).
 
@CoMox can you send a vm windows config file ? (/etc/pve/qemu-server/<vmid>.conf).


do you have any stats about disk latency/throughput, network latency/throuput, cpu/ram when the problem occur ?
agent: 1
bootdisk: scsi0
cores: 8
memory: 16384
name: RDS51
net0: virtio=mac,bridge=vmbr999,tag=1
net1: virtio=mac,bridge=vmbr999,tag=2
numa: 0
ostype: win10
scsi0: nvme_local:vm-203-disk-0,backup=0,cache=unsafe,discard=on,iops_rd=4500,iops_rd_max=5000,iops_wr=4500,iops_wr_max=5000,size=60G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=
sockets: 1
vmgenid:

I am to new to Proxmox and wassen't counting on needing this as we have a supplier for all this so don't have any stats from withing Proxmox jet. Ik know that we could not find anything inside Windows that was causing it. Not saying it is not somewhere in there in combination with the change of platform. But so far could find it there.
 
@CoMox can you send a vm windows config file ? (/etc/pve/qemu-server/<vmid>.conf).


do you have any stats about disk latency/throughput, network latency/throuput, cpu/ram when the problem occur ?
Hi Spirit, how can i see the current iop/s to a storage ( 2x local NVMe as one store ) i don't have an overview in the GUI and could not find a command so far.
 
On the working days ( mostly Thursdays and Thursdays ) the system would get slow starting arount 12 and from ~15:00 would be to slow to work at all and we would get complains from everyone.

Do you have backups scheduled at this time?
 
agent: 1
bootdisk: scsi0
cores: 8
memory: 16384
name: RDS51
net0: virtio=mac,bridge=vmbr999,tag=1
net1: virtio=mac,bridge=vmbr999,tag=2
numa: 0
ostype: win10
scsi0: nvme_local:vm-203-disk-0,backup=0,cache=unsafe,discard=on,iops_rd=4500,iops_rd_max=5000,iops_wr=4500,iops_wr_max=5000,size=60G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=
sockets: 1
vmgenid:

I am to new to Proxmox and wassen't counting on needing this as we have a supplier for all this so don't have any stats from withing Proxmox jet. Ik know that we could not find anything inside Windows that was causing it. Not saying it is not somewhere in there in combination with the change of platform. But so far could find it there.

iops_rd=4500,iops_rd_max=5000,iops_wr=4500,iops_wr_max=5000,

I never modify these so no idea what is reasonable, but do all the VMs have these restrictions set? If so, have you tried removing these restrictions? Do you have the same restrictions set on your "home" machine?

Clear this with your supplier first....but for monitoring system performance you'll need to open a shell on the physical host or login via ssh and then install a couple monitors. You'll want to run the monitor commands first to get a baseline on the slow system. Make changes, then run them again to check for improvements. Repeat. Run the following 2 commands as root to install the monitors:

# install the sys stat package, iotop, and nmon packages
apt install sysstat iotop nmon -y

# this one installs quite a few packages...
apt install blktrace -y


# iostat - run IO stat to monitor read/writes to the disks - lots of display options for iostat
iostat 1 2000
iostat -dxz 1

# vmstat - run VM stat which monitors swapping (bad), memory, block IO, traps. disks, CPU, and others
vmstat 1 20000
vmstat -d 2000 1
vmstat -t -w -d 2 1000

# iotop - is like the monitor called "top", but for disk I/O, after running command hit o to see only active processes
iotop

# nmon - try the various options displayed on the start screen
nmon

# blktrace - is a specialized utility for tracing block I/O events


So there are a handful of monitors, google them for more information or check this page. Might be good to run them on your "home" machine while stressing the server, note the various performance indicators and processes, then repeat on your cluster. Not equivalent systems, but you may find a crack that you can dig into.
 
scsi0: nvme_local:vm-203-disk-0,backup=0,cache=unsafe,discard=on,iops_rd=4500,iops_rd_max=5000,iops_wr=4500,iops_wr_max=5000,size=60G,ssd=1

cache=unsafe ???? never do that in production. (maybe on a swap disk). you can corrupt your vm in case of poweroff.
use default cache=none for your nvme.
with cache=unsafe, the write are going to host memory but not flushed yet, and you can also have slowdown when host is going to flush is memory.


you can remove your iops limit too on nvme.


you can also enable numa, it could help to map numa.

as you have big server, with a lot of memory && cores, it could be interresing to look at numastat on the host

#apt install numactl
#numastat
 
Last edited:
On the working days ( mostly Thursdays and Thursdays ) the system would get slow starting arount 12 and from ~15:00 would be to slow to work at all and we would get complains from everyone.

Do you have backups scheduled at this time?
No we have no backups running at the time. And supplier let us know that they have the default settings for scrubbing on Ceph. So that should be 23:00 - 7:00 as i understand.
 
cache=unsafe ???? never do that in production. (maybe on a swap disk). you can corrupt your vm in case of poweroff.
use default cache=none for your nvme.
with cache=unsafe, the write are going to host memory but not flushed yet, and you can also have slowdown when host is going to flush is memory.


you can remove your iops limit too on nvme.



you can also enable numa, it could help to map numa.

as you have big server, with a lot of memory && cores, it could be interresing to look at numastat on the host

#apt install numactl
#numastat
We can lose these servers on a poweroff. Users will lose only the in memory work and will be redirected to other servers directly. So no worries for this servers with 'unsafe'. Other servers have other settings according to there role. But thank you for the warning.

We removed the iops during the problems. Got the advice to put a limit from the supplier but with no real advice on how high. Anyone who does work and have an rule for this?

I will look in to 'numa'. But would this not be something that should be enabled for Proxmox so it handles this part or is this something that the vm will always handle?
 
iops_rd=4500,iops_rd_max=5000,iops_wr=4500,iops_wr_max=5000,

I never modify these so no idea what is reasonable, but do all the VMs have these restrictions set? If so, have you tried removing these restrictions? Do you have the same restrictions set on your "home" machine?

Clear this with your supplier first....but for monitoring system performance you'll need to open a shell on the physical host or login via ssh and then install a couple monitors. You'll want to run the monitor commands first to get a baseline on the slow system. Make changes, then run them again to check for improvements. Repeat. Run the following 2 commands as root to install the monitors:

# install the sys stat package, iotop, and nmon packages
apt install sysstat iotop nmon -y

# this one installs quite a few packages...
apt install blktrace -y


# iostat - run IO stat to monitor read/writes to the disks - lots of display options for iostat
iostat 1 2000
iostat -dxz 1

# vmstat - run VM stat which monitors swapping (bad), memory, block IO, traps. disks, CPU, and others
vmstat 1 20000
vmstat -d 2000 1
vmstat -t -w -d 2 1000

# iotop - is like the monitor called "top", but for disk I/O, after running command hit o to see only active processes
iotop

# nmon - try the various options displayed on the start screen
nmon

# blktrace - is a specialized utility for tracing block I/O events


So there are a handful of monitors, google them for more information or check this page. Might be good to run them on your "home" machine while stressing the server, note the various performance indicators and processes, then repeat on your cluster. Not equivalent systems, but you may find a crack that you can dig into.
I removed restrictions on the machines i am testing with. But i will check if we should not remove at all or give a much bigger value.

Some tools i already tried on my machine but some i haven't used so far. Will look in to all of them and study on the results. Will take me some time. But will get back.

Thanks!
 
Very basic questions:
- what is the current (physical) CPU (total number and frequency) you are running i havd read 80c 2 GHz. Is this still the current Situation?
- how many VMs do you run in the infrastructure?
- are they all built the same (8vcpu)?
- whats the total number of virtual CPUs assigned?
 
No we have no backups running at the time. And supplier let us know that they have the default settings for scrubbing on Ceph. So that should be 23:00 - 7:00 as i understand.

Thought I'd check :)

I removed restrictions on the machines i am testing with. But i will check if we should not remove at all or give a much bigger value.

Some tools i already tried on my machine but some i haven't used so far. Will look in to all of them and study on the results. Will take me some time. But will get back.

Thanks!

In reading through all your posts the performance issues existed when those restrictions were not present. So only now realizing that removing those restrictions likely will not have an impact. But, my instincts say that a lot was changed so may want to retest by removing the restrictions off your test VMs and see if that helps.

Also, tell us more about the remote desktop environment. How many Win10 instances are you running per proxmox node? How are clients connecting to these remote desktops? I assume RDP and not VNC? Over what network? Local network or over the internet? I assume you've read this guide:
I assume you took a look at the performance tools on Windows, did they give any indication where the bottleneck could be? Are the Win10 systems scheduled to all perform search indexing at the same time or do they all install updates or virus scans around the same time or is other resource intensive work scheduled around the same time? How does the impact manifest itself? Does the CPU peg at 100% and is there ample memory or is Windows swapping a lot or is the NIC saturated? I'd think the Win10 performance tools could provide insight.

Very interesting that networking runs at 20% of max speed. I have noticed that Windows will say one thing but in reality it isn't true. I have an old WinXP VM running and it says the ethernet connection is 10 MB/s, but I can transfer data on/off that VM at over 100 MB/s. I assume something in the vertio drivers is causing WinXP to report the wrong connection speed.

Last question, is there a way to simulate a similar workload on the Win10 clients to force the performance issues to appear? Much easier to debug by forcing the issues to appear than to set traps and wait.

Lastly, you have a license for proxmox which entitles you to some level of support (unclear which one), so I would open up a ticket with them. I assume they can help confirm the proxmox layer is configured properly and not the root source of your performance issues. Might have updated best practices for Win10 clients.
 
Hi All,
I am in contact with someone who is using proxmox for many years now and he will have a look at the system. First simple test and his system is 3x faster so i hope he can boost it to that level :) Will report back when that is done what ever the result may be.
Thank you all so far for helping me!
 
Hi All,
The guy still couldn't find time to have a look at the system. So there is no progress on debugging.

But on 12 February we noticed the system performing great. It became fast en hasn't gone slow ever since. I asked the supplier and they said "no changes done". I found that the pve hosts were update on the 10th of February ( apt dist-upgrade in history.log ) but i don't know if that is the cause of the performance improvement as we only noticed 2 days later. Servers that had disk latencies peeking to 120ms at least once every minute now never go above 40ms. So measurable improvements and good enough for a fast system.

It still worries me that we never found anything in the monitoring to look at for future planning. Not even improvements now that al is fast. But we are going to enjoy it and hope it never returns.

All thanks again for trying to solve.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!