Slow moving of storage

deepcloud · Jun 21, 2024

Hello team,

I am on the latest version of Proxmox with the enterprise repo.

We are moving the disks from a ZFS storage to a ceph storage and all this on NVMe Enterprise SSD PCIE 4.0 * 2 per node = 6 SSD with a 100G Ethernet network.
So the IO should be high and it should not be an issue.

It's stuck at drive-scsi0: mirror-job finished.
for a long time - over 30–40 mins for each disk being moved. -any ideas on how to fix this.

Falk R. · Jun 21, 2024

deepcloud said:
Hello team,

I am on the latest version of Proxmox with the enterprise repo.

We are moving the disks from a ZFS storage to a ceph storage and all this on NVMe Enterprise SSD PCIE 4.0 * 2 per node = 6 SSD with a 100G Ethernet network.

In general, you should be careful with only 2 OSDs per node. If one fails, the other must be able to take over the entire capacity and still have some write reserve. So never fill the pool above 40%.

deepcloud said:
So the IO should be high and it should not be an issue.

It's stuck at drive-scsi0: mirror-job finished.
for a long time - over 30–40 mins for each disk being moved. -any ideas on how to fix this.

During the next migration, can you check whether you see high utilization on the disks or in the network?
I use the tool nmon to check, you can display disk and network together.

deepcloud · Jun 21, 2024

Falk R. said:
In general, you should be careful with only 2 OSDs per node. If one fails, the other must be able to take over the entire capacity and still have some write reserve. So never fill the pool above 40%.

During the next migration, can you check whether you see high utilization on the disks or in the network?
I use the tool nmon to check, you can display disk and network together.

Hi Falk,

Thanks for your reply. We are in the process of migrating from the old promox cluster - EPYC 7002 based to the new one EPYC 9554 based. so as we keep getting the workload moved to the new cluster we will move the SSD from the old cluster to the new one, these SSD are WD SN650 15.36TB PCIE Gen4 so they are pretty fast too.

We have 100G network in place but utilization is not crossing 8gbps as seen in the screenshot below (800MByte/ps = 8 Gbit/ps approx.)

My concern is about the low ceph performance when each disk is capable of over 2GBps of write performance.

Falk R. · Jun 21, 2024

Please show me your network configuration and the Ceph configuration.

Falk R. · Jun 21, 2024

I have small setups with 3 nodes and 4 small 1.9TB NVMe, where I achieve up to 60GBit utilization in the Ceph. Therefore I suspect some small configuration error.

deepcloud · Jun 21, 2024

Falk R. said:
I have small setups with 3 nodes and 4 small 1.9TB NVMe, where I achieve up to 60GBit utilization in the Ceph. Therefore I suspect some small configuration error.

Wow thats amazing

deepcloud · Jun 21, 2024

Falk R. said:
Please show me your network configuration and the Ceph configuration.

Network Config of one of the host. We have 2 more hosts with different IP. We will have 5-6 nodes with 5-6 NVME SSD per node eventually and more as the workload increases.

Falk R. · Jun 21, 2024

You can increase the throughput somewhat by activating jumbo frames (MTU 9000), but the network switches must first be checked to ensure that jumbo frames are permitted everywhere.

I wonder what the VLAN1000 is for, but I can probably tell if you have posted the Ceph configuration.

deepcloud · Jun 21, 2024

Falk R. said:
Please show me your network configuration and the Ceph configuration.

Ceph config.

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.10.10.112/24
fsid = 24cd5cef-ad3f-44a6-a257-a5f6a7080180
mon_allow_pool_delete = true
mon_host = 10.10.10.112 10.10.10.217 10.10.10.117
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.10.10.112/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.w12]
host = w12
mds_standby_for_name = pve

[mds.w17]
host = w17
mds_standby_for_name = pve

[mds.w27]
host = w27
mds_standby_for_name = pve

[mon.w12]
public_addr = 10.10.10.112

[mon.w17]
public_addr = 10.10.10.117

[mon.w27]
public_addr = 10.10.10.217

deepcloud · Jun 21, 2024

Falk R. said:
You can increase the throughput somewhat by activating jumbo frames (MTU 9000), but the network switches must first be checked to ensure that jumbo frames are permitted everywhere.

I wonder what the VLAN1000 is for, but I can probably tell if you have posted the Ceph configuration.

Will do that in the night today. VLAN1000 is so that we have the storage network segmented and nobody unauthorized has no access to it

Falk R. · Jun 21, 2024

deepcloud said:
Will do that in the night today. VLAN1000 is so that we have the storage network segmented and nobody unauthorized has no access to it

You have already given the 100G Bond an IP. What is this used for? Normally I use the Ceph NICs for Ceph only and use the network natively.

You are also aware that with the Connect-X4 NICs you get a maximum throughput of 64 GBit per port.
If these are used cards, they have been converted from Infiniband to Ethernet and are usually delivered in Infiniband mode.

deepcloud · Jun 21, 2024

Falk R. said:
You have already given the 100G Bond an IP. What is this used for? Normally I use the Ceph NICs for Ceph only and use the network natively.

You are also aware that with the Connect-X4 NICs you get a maximum throughput of 64 GBit per port.
If these are used cards, they have been converted from Infiniband to Ethernet and are usually delivered in Infiniband mode.

We had received them as IB and had changed them to Ethernet, and thanks for letting me know that we can only get upto 64 Gbit per port so its 6400MByte/s approx and its 4 channels (25G * 4 = 100G) intenrally so each channel should be limited to 1800MB/s * 4 channels but we are still nowhere near that.

Hope this really helps everyone in optimizing their setup too and get a better output of their existing infra in palce.

Falk R. · Jun 21, 2024

deepcloud said:
We had received them as IB and had changed them to Ethernet, and thanks for letting me know that we can only get upto 64 Gbit per port so its 6400MByte/s approx and its 4 channels (25G * 4 = 100G) intenrally so each channel should be limited to 1800MB/s * 4 channels but we are still nowhere near that.

Hope this really helps everyone in optimizing their setup too and get a better output of their existing infra in palce.

The limitation comes from the PCIe3 bus. The 16 lanes are divided into 8 lanes per port and thus a theoretical maximum of 64 GBit is possible. Realistically, 60-62 GBit is more likely.

deepcloud · Jun 21, 2024

Falk R. said:
The limitation comes from the PCIe3 bus. The 16 lanes are divided into 8 lanes per port and thus a theoretical maximum of 64 GBit is possible. Realistically, 60-62 GBit is more likely.

Hi Falk,

I agree that Mellanox 4 is a PCI-E Gen3 card and even though I have a PCIe Gen5 board it will run at Gen3 speeds but at 16x its 15.7GByte/s which is more than 100Gbit/s so not sure if thats the bottleneck its somewhere else.

deepcloud · Jun 21, 2024

deepcloud said:
Hi Falk,

I agree that Mellanox 4 is a PCI-E Gen3 card and even though I have a PCIe Gen5 board it will run at Gen3 speeds but at 16x its 15.7GByte/s which is more than 100Gbit/s so not sure if thats the bottleneck its somewhere else.
and If you see the screenshot CEPH is not crossing 500-550MByte/s (is it due to the 3 way replication and the 3 copies being written on the disks)

deepcloud said:
View attachment 70179

Falk R. · Jun 21, 2024

deepcloud said:
Hi Falk,

I agree that Mellanox 4 is a PCI-E Gen3 card and even though I have a PCIe Gen5 board it will run at Gen3 speeds but at 16x its 15.7GByte/s which is more than 100Gbit/s so not sure if thats the bottleneck its somewhere else.

View attachment 70179

This is a Dual Port NIC and every Port has only 8 Lanes.

Search

Search

Slow moving of storage

deepcloud

Member

Falk R.

Distinguished Member

deepcloud

Member

Falk R.

Distinguished Member

Falk R.

Distinguished Member

deepcloud

Member

deepcloud

Member

Falk R.

Distinguished Member

deepcloud

Member

deepcloud

Member

Falk R.

Distinguished Member

deepcloud

Member

Falk R.

Distinguished Member

deepcloud

Member

deepcloud

Member

Falk R.

Distinguished Member