OK everyone! I've been an avid Proxmox user since 1.x, and I really love the platform. That being said, it's been a HUGE pain trying to figure things out, and now, with questions about performance and, well, just sensibility, I'm asking for some external eyes.
I'd like to lay out how my current production is, and openly ask what should I be doing, and how to do it.
My Current Setup
Currently, I'm running one cluster of 9 hosts, all Dell Poweredge rack servers or various age/hardware. Each device has 2 drive set up. The primary is 2 smaller drives in RAID1, and the secondary is 4+ drives in RAID10. Primary has the Proxmox installation, and second is mounted as ext4 with local qcow2 files. Drives are mostly 7200RPM SAS and SATA.
One backup server is an NFS share, and, using Ayufan's differential backups (https://ayufan.eu/projects/proxmox-ve-differential-backups/), I back up a few times a week over 1Gb network.
All of my VMs are Windows 2008/2012r2 guests with 1CPU/XXcores depending on need, using KVM64 for CPU. I use Virtio for drive/NW, and use ballooning for memory. Write back cache is enabled for all drives, and Best practices followed in VM setup.
Guests are moslty file services/ad, software (database driven), and terminal services.
Each server uses 2 network interfaces. vmbr0 is tied to eth0 and on the management lan for my servers. vmbr1 is eth1 and trunked with vlan support for my guests.
There are no other "tweaks" being done on these hosts. cpu units/vcpu count is default. I turned off "Use tablet for pointer" under options for best practices, and that's it. I've read and tries tons of random tweaks/settings, but nothing of note. If you have some that NEED to be used, please let know what, and how it helps. Thanks!
My Problems
Well, I do run into issues from time to time.
1. CLUSTER - My cluster is super finicky. It loves to fall apart. I've had this issue for years, and it's such a pain. I just did a clean install of 4.1 this weekend, and fingers crossed, the cluster is fine so far.
2. IOWAIT - 99.9999% of the time, this is the bottleneck for me. I have only a VERY small number of VMs per host (seriously less than 2-3) because IOWAIT can and will be a HUGE pain. I don't use LVM, which I think will assist in this, because, frankly, I need some simple assistance in understanding and deploying. More on that below.
3. Backup speeds - When backing up to my NFS server, which, BT runs Proxmox so it can acts as a test server/emergency backup if a VM or host dies, I average 100MB/s on our 1Gb network. Then, that backup process with randomly drop to 5MB/s and STAY THERE. I've often come in the next morning and found a guest still backing up when it should have taken 30 minutes max.
My Questions
One, without degrading my choices, as I was making the best based on my understanding, I need feedback, and moreso, help.
I don't know/understand LVM well enough. I know how to set it up for local storage. But, with my current config of each host server housing the files of the guest VMs, how to I snapshot to another server? Does that even work with NFS?
When a VM gets hung/locked by a long backup process, how do you resolve it? OR get notified?
With LVM, if it's the solution to my IOWAIT, do I continue with local storage, or is there a better way? If I use NW storage, what's best? I'm on 1Gb network, is that sufficient? I know nothing of NAS/iSCSI/DRBL, but I'm willing to learn. Seriously.
What am I missing? Besides the above, anything I SHOULD be doing, or shoudn't be.
Please, if you have suggestions and corrections, please give me direction to investigate/resolve. I'm self taught, and I'm willing to dig in, but I might have to catch up a little on some things. I really want to maximize the equipment I have to serve the best performance, and I feel I'm falling short.
I'd like to lay out how my current production is, and openly ask what should I be doing, and how to do it.
My Current Setup
Currently, I'm running one cluster of 9 hosts, all Dell Poweredge rack servers or various age/hardware. Each device has 2 drive set up. The primary is 2 smaller drives in RAID1, and the secondary is 4+ drives in RAID10. Primary has the Proxmox installation, and second is mounted as ext4 with local qcow2 files. Drives are mostly 7200RPM SAS and SATA.
One backup server is an NFS share, and, using Ayufan's differential backups (https://ayufan.eu/projects/proxmox-ve-differential-backups/), I back up a few times a week over 1Gb network.
All of my VMs are Windows 2008/2012r2 guests with 1CPU/XXcores depending on need, using KVM64 for CPU. I use Virtio for drive/NW, and use ballooning for memory. Write back cache is enabled for all drives, and Best practices followed in VM setup.
Guests are moslty file services/ad, software (database driven), and terminal services.
Each server uses 2 network interfaces. vmbr0 is tied to eth0 and on the management lan for my servers. vmbr1 is eth1 and trunked with vlan support for my guests.
There are no other "tweaks" being done on these hosts. cpu units/vcpu count is default. I turned off "Use tablet for pointer" under options for best practices, and that's it. I've read and tries tons of random tweaks/settings, but nothing of note. If you have some that NEED to be used, please let know what, and how it helps. Thanks!
My Problems
Well, I do run into issues from time to time.
1. CLUSTER - My cluster is super finicky. It loves to fall apart. I've had this issue for years, and it's such a pain. I just did a clean install of 4.1 this weekend, and fingers crossed, the cluster is fine so far.
2. IOWAIT - 99.9999% of the time, this is the bottleneck for me. I have only a VERY small number of VMs per host (seriously less than 2-3) because IOWAIT can and will be a HUGE pain. I don't use LVM, which I think will assist in this, because, frankly, I need some simple assistance in understanding and deploying. More on that below.
3. Backup speeds - When backing up to my NFS server, which, BT runs Proxmox so it can acts as a test server/emergency backup if a VM or host dies, I average 100MB/s on our 1Gb network. Then, that backup process with randomly drop to 5MB/s and STAY THERE. I've often come in the next morning and found a guest still backing up when it should have taken 30 minutes max.
My Questions
One, without degrading my choices, as I was making the best based on my understanding, I need feedback, and moreso, help.
I don't know/understand LVM well enough. I know how to set it up for local storage. But, with my current config of each host server housing the files of the guest VMs, how to I snapshot to another server? Does that even work with NFS?
When a VM gets hung/locked by a long backup process, how do you resolve it? OR get notified?
With LVM, if it's the solution to my IOWAIT, do I continue with local storage, or is there a better way? If I use NW storage, what's best? I'm on 1Gb network, is that sufficient? I know nothing of NAS/iSCSI/DRBL, but I'm willing to learn. Seriously.
What am I missing? Besides the above, anything I SHOULD be doing, or shoudn't be.
Please, if you have suggestions and corrections, please give me direction to investigate/resolve. I'm self taught, and I'm willing to dig in, but I might have to catch up a little on some things. I really want to maximize the equipment I have to serve the best performance, and I feel I'm falling short.
Last edited: