Ok, lets start off with some background. We have been using Proxmox since v 1.x, and quite successfully I might add. We've been happy with the performance, useability, features, the whole 9 yards.
Our Current "production" setup:
11 Nodes total
6x Nodes = 2x16-Core AMD Opteron w/ 256GB of memory - storage is 2x SSDs mirrored for System, and 6x SSDs in RAID5 for images, all connected to LSI RAID controllers
2x Nodes = 2x Intel E5649 CPUs w/ 96GB of memory (MB Limited) and 4x SSDs in RAID5 connected to LSI controller
2x Nodes = 2x Intel E5-2687W w/ 256GB of memory - storage is 2x SSDs mirrored for System, and 6x SSDs in RAID5 for images, all connected to LSI RAID controllers
1x Node = 1x Intel Xeon 5420 w 16GB of memory - storage is 4x 3TB WD Red drives with a highpoint RAID controller (This node is used only for testing and template cloning)
All servers have either 2x or 4x 1GB network connections (Depending on chassis), all bonded using LACP w/ mLAG across multiple Cisco devices (allowing for the best combo of speed and reliability).
All image storage is local on this setup.
As I mentioned, this has worked very well for us for multiple years. Currently, we are running 178 KVM machines on this setup.
As you might imagine, we're beginning to outgrow our hardware, so beginning last year, we embarked on an initiative to build the next iteration of our cloud.
The new hardware is as follows:
4x Nodes = 2x Intel E5-2667v2 w/ 512 GB of memory and 8x SSDs in RAID5 attached to an LSI RAID controller and 6x Intel 10Gbe interfaces
1x Node = 2x Intel E5-2687W w/ 512GB of memory and 8x SSDs in RAID5 attached to an LSI RAID controller and 4x Intel 10Gbe interfaces
4x Ceph Storage Nodes = 1x Intel E5-2667v2 w/ 256GB of memory and 24x 1TB SSDs attached in pass-through mode for OSDs (2x SSDs mirrored for OS) and 10x Intel 10Gbe Interfaces
In addition, all of the previous Nodes will be migrated into this infrastructure and upgraded to 10Gbe networking (Once all VMs have been moved)
Ceph is managed on it's own, and uses dedicated MONs (5 of them) and is managed using Inktank's calamari toolset. Proxmox servers are NOT managing ceph.
All networking is connected using LACP + mLAG bonding across multiple switches for performance and redundancy.
After about 3 weeks of testing and tweaking, we have the new cluster working at the performance levels we wanted. We've not only been able to avg about 1.4GB/s performace on individual VMs (Both read and write), but have also been able to sustain over 600MB/s on as many as 8 VMs simultaneously. Needless to say, we have the cluster to a very useable state.
However, and here's the rub, we've been unable to get anything even remotely close to acceptable performance running backups or restores of VMs using this setup. We have a specific VM on current production that we backup every hour. On the 1Gig network, backing up to NFS storage, we get about 70MB/s and it takes just over 14 mins for the 64GB VM every time. However, when I put a copy of the same VM over on the Ceph-backed infrastructure, the best we can get is 12MB/s and it take almost 2 hours to run the same backup. It's not the network, as we've tested the same VM using local image storage, and were able to backup to 10Gbe NFS shares at over 700MB/s.
I've done quite a bit of research over the past few weeks, and have found that the issues seems to be isolated to how KVM reads 64kb blocks when performing it's internal backup functionality. I understand that this is not easily changed (even checked out the source myself and confirmed this is a BIG thing to change), but the 64kb reads are definitely the culprit. Ceph uses 4MB "blocks" to storage data, and each 64kb read is actually reading a 4MB block. (in very rough math, this means Ceph is reading 625GB to backup a 64GB VM). This is also decimating the IOPS situation, which in turn leads to the slowdown.
So, my question, after all the back-story, is this... Is there any way to resolve this issue? One of the tricks used to speed up much of the performance of ceph itself was to use read-ahead buffers. However, I cannot find any information as to how to turn these on for vzdump. I've also seen Tom from Proxmox themselves mention that he runs backups out of Ceph nightly. I'm curious, if he sees this, if he has any pointers that he uses to help with this situation. We currently backups about 30% of our VMs on a nightly basis at a minimum, however with the speed our testing has shown, this would be impossible using Ceph. Also to note, we have tested out using ceph's own tools for exporting and importing images, but since this can lead to data-corruption when run against running VMs, this is not a viable solution.
Any tips or assistance would be greatly appreciated. I'm at the point where my google searches are no longer turning up any new information and feel like I've read the entire internet by now ;-)
Our Current "production" setup:
11 Nodes total
6x Nodes = 2x16-Core AMD Opteron w/ 256GB of memory - storage is 2x SSDs mirrored for System, and 6x SSDs in RAID5 for images, all connected to LSI RAID controllers
2x Nodes = 2x Intel E5649 CPUs w/ 96GB of memory (MB Limited) and 4x SSDs in RAID5 connected to LSI controller
2x Nodes = 2x Intel E5-2687W w/ 256GB of memory - storage is 2x SSDs mirrored for System, and 6x SSDs in RAID5 for images, all connected to LSI RAID controllers
1x Node = 1x Intel Xeon 5420 w 16GB of memory - storage is 4x 3TB WD Red drives with a highpoint RAID controller (This node is used only for testing and template cloning)
All servers have either 2x or 4x 1GB network connections (Depending on chassis), all bonded using LACP w/ mLAG across multiple Cisco devices (allowing for the best combo of speed and reliability).
All image storage is local on this setup.
As I mentioned, this has worked very well for us for multiple years. Currently, we are running 178 KVM machines on this setup.
As you might imagine, we're beginning to outgrow our hardware, so beginning last year, we embarked on an initiative to build the next iteration of our cloud.
The new hardware is as follows:
4x Nodes = 2x Intel E5-2667v2 w/ 512 GB of memory and 8x SSDs in RAID5 attached to an LSI RAID controller and 6x Intel 10Gbe interfaces
1x Node = 2x Intel E5-2687W w/ 512GB of memory and 8x SSDs in RAID5 attached to an LSI RAID controller and 4x Intel 10Gbe interfaces
4x Ceph Storage Nodes = 1x Intel E5-2667v2 w/ 256GB of memory and 24x 1TB SSDs attached in pass-through mode for OSDs (2x SSDs mirrored for OS) and 10x Intel 10Gbe Interfaces
In addition, all of the previous Nodes will be migrated into this infrastructure and upgraded to 10Gbe networking (Once all VMs have been moved)
Ceph is managed on it's own, and uses dedicated MONs (5 of them) and is managed using Inktank's calamari toolset. Proxmox servers are NOT managing ceph.
All networking is connected using LACP + mLAG bonding across multiple switches for performance and redundancy.
After about 3 weeks of testing and tweaking, we have the new cluster working at the performance levels we wanted. We've not only been able to avg about 1.4GB/s performace on individual VMs (Both read and write), but have also been able to sustain over 600MB/s on as many as 8 VMs simultaneously. Needless to say, we have the cluster to a very useable state.
However, and here's the rub, we've been unable to get anything even remotely close to acceptable performance running backups or restores of VMs using this setup. We have a specific VM on current production that we backup every hour. On the 1Gig network, backing up to NFS storage, we get about 70MB/s and it takes just over 14 mins for the 64GB VM every time. However, when I put a copy of the same VM over on the Ceph-backed infrastructure, the best we can get is 12MB/s and it take almost 2 hours to run the same backup. It's not the network, as we've tested the same VM using local image storage, and were able to backup to 10Gbe NFS shares at over 700MB/s.
I've done quite a bit of research over the past few weeks, and have found that the issues seems to be isolated to how KVM reads 64kb blocks when performing it's internal backup functionality. I understand that this is not easily changed (even checked out the source myself and confirmed this is a BIG thing to change), but the 64kb reads are definitely the culprit. Ceph uses 4MB "blocks" to storage data, and each 64kb read is actually reading a 4MB block. (in very rough math, this means Ceph is reading 625GB to backup a 64GB VM). This is also decimating the IOPS situation, which in turn leads to the slowdown.
So, my question, after all the back-story, is this... Is there any way to resolve this issue? One of the tricks used to speed up much of the performance of ceph itself was to use read-ahead buffers. However, I cannot find any information as to how to turn these on for vzdump. I've also seen Tom from Proxmox themselves mention that he runs backups out of Ceph nightly. I'm curious, if he sees this, if he has any pointers that he uses to help with this situation. We currently backups about 30% of our VMs on a nightly basis at a minimum, however with the speed our testing has shown, this would be impossible using Ceph. Also to note, we have tested out using ceph's own tools for exporting and importing images, but since this can lead to data-corruption when run against running VMs, this is not a viable solution.
Any tips or assistance would be greatly appreciated. I'm at the point where my google searches are no longer turning up any new information and feel like I've read the entire internet by now ;-)