Slow moving of storage

deepcloud

Member
Feb 12, 2021
128
17
23
India
deepcloud.in
Hello team,

I am on the latest version of Proxmox with the enterprise repo.

We are moving the disks from a ZFS storage to a ceph storage and all this on NVMe Enterprise SSD PCIE 4.0 * 2 per node = 6 SSD with a 100G Ethernet network.
So the IO should be high and it should not be an issue.

It's stuck at drive-scsi0: mirror-job finished.
for a long time - over 30–40 mins for each disk being moved. -any ideas on how to fix this.

1718945432600.png
 
Hello team,

I am on the latest version of Proxmox with the enterprise repo.

We are moving the disks from a ZFS storage to a ceph storage and all this on NVMe Enterprise SSD PCIE 4.0 * 2 per node = 6 SSD with a 100G Ethernet network.
In general, you should be careful with only 2 OSDs per node. If one fails, the other must be able to take over the entire capacity and still have some write reserve. So never fill the pool above 40%.
So the IO should be high and it should not be an issue.

It's stuck at drive-scsi0: mirror-job finished.
for a long time - over 30–40 mins for each disk being moved. -any ideas on how to fix this.
During the next migration, can you check whether you see high utilization on the disks or in the network?
I use the tool nmon to check, you can display disk and network together.
 
In general, you should be careful with only 2 OSDs per node. If one fails, the other must be able to take over the entire capacity and still have some write reserve. So never fill the pool above 40%.

During the next migration, can you check whether you see high utilization on the disks or in the network?
I use the tool nmon to check, you can display disk and network together.
Hi Falk,

Thanks for your reply. We are in the process of migrating from the old promox cluster - EPYC 7002 based to the new one EPYC 9554 based. so as we keep getting the workload moved to the new cluster we will move the SSD from the old cluster to the new one, these SSD are WD SN650 15.36TB PCIE Gen4 so they are pretty fast too.

We have 100G network in place but utilization is not crossing 8gbps as seen in the screenshot below (800MByte/ps = 8 Gbit/ps approx.)

My concern is about the low ceph performance when each disk is capable of over 2GBps of write performance.

1718957979091.png


1718957103115.png
 
Please show me your network configuration and the Ceph configuration.
 
I have small setups with 3 nodes and 4 small 1.9TB NVMe, where I achieve up to 60GBit utilization in the Ceph. Therefore I suspect some small configuration error.
 
You can increase the throughput somewhat by activating jumbo frames (MTU 9000), but the network switches must first be checked to ensure that jumbo frames are permitted everywhere.

I wonder what the VLAN1000 is for, but I can probably tell if you have posted the Ceph configuration.
 
Please show me your network configuration and the Ceph configuration.
Ceph config.

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.10.10.112/24
fsid = 24cd5cef-ad3f-44a6-a257-a5f6a7080180
mon_allow_pool_delete = true
mon_host = 10.10.10.112 10.10.10.217 10.10.10.117
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.10.10.112/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.w12]
host = w12
mds_standby_for_name = pve

[mds.w17]
host = w17
mds_standby_for_name = pve

[mds.w27]
host = w27
mds_standby_for_name = pve

[mon.w12]
public_addr = 10.10.10.112

[mon.w17]
public_addr = 10.10.10.117

[mon.w27]
public_addr = 10.10.10.217
 
You can increase the throughput somewhat by activating jumbo frames (MTU 9000), but the network switches must first be checked to ensure that jumbo frames are permitted everywhere.

I wonder what the VLAN1000 is for, but I can probably tell if you have posted the Ceph configuration.
Will do that in the night today. VLAN1000 is so that we have the storage network segmented and nobody unauthorized has no access to it
 
Will do that in the night today. VLAN1000 is so that we have the storage network segmented and nobody unauthorized has no access to it
You have already given the 100G Bond an IP. What is this used for? Normally I use the Ceph NICs for Ceph only and use the network natively.

You are also aware that with the Connect-X4 NICs you get a maximum throughput of 64 GBit per port.
If these are used cards, they have been converted from Infiniband to Ethernet and are usually delivered in Infiniband mode.
 
You have already given the 100G Bond an IP. What is this used for? Normally I use the Ceph NICs for Ceph only and use the network natively.

You are also aware that with the Connect-X4 NICs you get a maximum throughput of 64 GBit per port.
If these are used cards, they have been converted from Infiniband to Ethernet and are usually delivered in Infiniband mode.

We had received them as IB and had changed them to Ethernet, and thanks for letting me know that we can only get upto 64 Gbit per port so its 6400MByte/s approx and its 4 channels (25G * 4 = 100G) intenrally so each channel should be limited to 1800MB/s * 4 channels but we are still nowhere near that.

Hope this really helps everyone in optimizing their setup too and get a better output of their existing infra in palce.
 
We had received them as IB and had changed them to Ethernet, and thanks for letting me know that we can only get upto 64 Gbit per port so its 6400MByte/s approx and its 4 channels (25G * 4 = 100G) intenrally so each channel should be limited to 1800MB/s * 4 channels but we are still nowhere near that.

Hope this really helps everyone in optimizing their setup too and get a better output of their existing infra in palce.
The limitation comes from the PCIe3 bus. The 16 lanes are divided into 8 lanes per port and thus a theoretical maximum of 64 GBit is possible. Realistically, 60-62 GBit is more likely.
 
The limitation comes from the PCIe3 bus. The 16 lanes are divided into 8 lanes per port and thus a theoretical maximum of 64 GBit is possible. Realistically, 60-62 GBit is more likely.
Hi Falk,

I agree that Mellanox 4 is a PCI-E Gen3 card and even though I have a PCIe Gen5 board it will run at Gen3 speeds but at 16x its 15.7GByte/s which is more than 100Gbit/s so not sure if thats the bottleneck its somewhere else.

1718964061099.png
 
Hi Falk,

I agree that Mellanox 4 is a PCI-E Gen3 card and even though I have a PCIe Gen5 board it will run at Gen3 speeds but at 16x its 15.7GByte/s which is more than 100Gbit/s so not sure if thats the bottleneck its somewhere else.
and If you see the screenshot CEPH is not crossing 500-550MByte/s (is it due to the 3 way replication and the 3 copies being written on the disks)
1718964325252.png
 
Hi Falk,

I agree that Mellanox 4 is a PCI-E Gen3 card and even though I have a PCIe Gen5 board it will run at Gen3 speeds but at 16x its 15.7GByte/s which is more than 100Gbit/s so not sure if thats the bottleneck its somewhere else.

View attachment 70179
This is a Dual Port NIC and every Port has only 8 Lanes.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!