Hi all,
I've been running Proxmox for a number of years and now have a 13 node cluster where last year I added Ceph to the mix (after a 5.2 upgrade) using the empty drive bays in some of the Proxmox nodes. Last Friday I upgraded all nodes to version 5.3.
The Ceph system has always felt slow and I've never really figured out what the read/write/IOPS numbers are supposed to be telling me with respect to performance of the cluster. Now I'm starting to regularly see slow requests blocked messages, but my concern is that when it happens, the blocked number climbs sometimes up to 2000, before dropping back to zero (meaning the warning message goes away). Then earlier today it seems to have gotten way out of hand up in the 7k range and not just slow requests but [whatever the next step is above slow]. After about 30 minutes it all settled back down.
I have three servers each hosting 5 OSDs of 5TB, and three other servers each hosting 3 OSDs of 4TB. All nodes have a dedicated 10G network for Ceph traffic. Using iperf, the network appears to be running at full speed (with 9000MTU). Across all the nodes, I have about 60 guests all using the Ceph system as their disk image storage. I don't want to get too specific with my setup to avoid going down rabbit holes and try to stick with good troubleshooting techniques for this problem, but if there is something important to know, I can answer questions.
What can I monitor to troubleshoot this issue? What things do I need to consider? I have not really found any good documentation laying out solid troubleshooting steps, just lots of posts dealing with specific issues, usually followed up with "I fixed it with a server reboot".
Thanks.
Troy
I've been running Proxmox for a number of years and now have a 13 node cluster where last year I added Ceph to the mix (after a 5.2 upgrade) using the empty drive bays in some of the Proxmox nodes. Last Friday I upgraded all nodes to version 5.3.
The Ceph system has always felt slow and I've never really figured out what the read/write/IOPS numbers are supposed to be telling me with respect to performance of the cluster. Now I'm starting to regularly see slow requests blocked messages, but my concern is that when it happens, the blocked number climbs sometimes up to 2000, before dropping back to zero (meaning the warning message goes away). Then earlier today it seems to have gotten way out of hand up in the 7k range and not just slow requests but [whatever the next step is above slow]. After about 30 minutes it all settled back down.
I have three servers each hosting 5 OSDs of 5TB, and three other servers each hosting 3 OSDs of 4TB. All nodes have a dedicated 10G network for Ceph traffic. Using iperf, the network appears to be running at full speed (with 9000MTU). Across all the nodes, I have about 60 guests all using the Ceph system as their disk image storage. I don't want to get too specific with my setup to avoid going down rabbit holes and try to stick with good troubleshooting techniques for this problem, but if there is something important to know, I can answer questions.
What can I monitor to troubleshoot this issue? What things do I need to consider? I have not really found any good documentation laying out solid troubleshooting steps, just lots of posts dealing with specific issues, usually followed up with "I fixed it with a server reboot".
Thanks.
Troy