Virtual Machines and Container extremely slow

weirdwiesel

Member
Jul 20, 2021
4
5
8
44
Dear Proxmoy-Experts,

Since some days now, the performance of every machine and container in my cluster is extremely slow.

Here some general Info of my setup
I am running a 3 node proxmox-cluster with up-to-date packages.
proxmox-ve: 8.0.2 (running kernel: 6.2.16-12-pve)
pve-manager: 8.0.4 (running version: 8.0.4/d258a813cfa6b390)
pve-kernel-6.2: 8.0.5
proxmox-kernel-helper: 8.0.3
pve-kernel-5.15: 7.4-3
proxmox-kernel-6.2.16-12-pve: 6.2.16-12
proxmox-kernel-6.2: 6.2.16-12
proxmox-kernel-6.2.16-10-pve: 6.2.16-10
proxmox-kernel-6.2.16-8-pve: 6.2.16-8
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph: 17.2.6-pve1+3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.5
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.8
libpve-guest-common-perl: 5.0.4
libpve-http-server-perl: 5.0.4
libpve-rs-perl: 0.8.5
libpve-storage-perl: 8.0.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.2-1
proxmox-backup-file-restore: 3.0.2-1
proxmox-kernel-helper: 8.0.3
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.6
pve-cluster: 8.0.3
pve-container: 5.0.4
pve-docs: 8.0.4
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.8-2
pve-ha-manager: 4.0.2
pve-i18n: 3.0.5
pve-qemu-kvm: 8.0.2-5
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.7
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1

All three cluster nodes are almost identical in heir hardware specs:
CPU
Intel® Core™ i7-12650H Prozessor, 10 Kerne/16 Threads
(24 MB Cache, up to 4,70 GHz)
GPU
Intel® UHD Graphics for Intel® Processors 12th gen (Frequence 1,40 GHz)

RAM
DDR4 16GB×2 Dual Channel SODIMM

Storage
M.2 2280 512 GB PCIe4.0 SSD

In each node, I have added a Smsung QVO 4TB 2,5-Zoll-SATA-SSD and put them into a Ceph-Cluster. These are the Ceph-Specs
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.0.0.1/24
fsid = fe55267f-6e22-4e16-b49e-3ff82fa193a4
mon_allow_pool_delete = true
mon_host = 10.0.0.1 10.0.0.2 10.0.0.3
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.0.0.1/24
[client]
keyring = /etc/pve/priv/$cluster.$name.keyring
[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring
[mds.prx-host01]
host = prx-host01
mds_standby_for_name = pve
[mds.prx-host02]
host = prx-host02
mds_standby_for_name = pve
[mds.prx-host03]
host = prx-host03
mds standby for name = pve
[mon.prx-host01]
public_addr = 10.0.0.1
[mon.prx-host02]
public_addr = 10.0.0.2
[mon.prx-host03]
public_addr = 10.0.0.3

Since there are 2 2,5GBit Network Ports on each Node. I have separated the Ceph-Network from the normal Data-Access-Network to the machines. All Nodes are connected through a Ubiquiti Enterprise Switch, which support 2,5GBit Connections. The Throughput on the Switch is not very high in general:
1695025127316.png
the blue graph is download and the purple graph is upload traffic.


Current situation
All three system have 4-10 VMs and containers running and the CPU-load is very low. The RAM is between 40 and 55% usage on all hosts and the system-storage is also not running full. here are screenshots of the hosts
1695025462441.png
1695025481007.png
1695025518210.png

What I noticed is the high IO Delay time - some forum posts say, it shouldn't be over 10%. I assume this is the reason for the really bad performance.

For instance, I tried to issue "docker ps" on one of the Ubuntu machines, and it took several minutes for the system to display the output - this is not normal for sure. Here are the specs for this particular machine:
1695025747449.png
1695025767209.png

I read through several posts about low Ceph performance and issued the following commands to test my ceph-cluster:
ceph -s
Code:
  cluster:
    id:     fe55267f-6e22-4e16-b49e-3ff82fa193a4
    health: HEALTH_WARN
            Module 'restful' has failed dependency: PyO3 modules may only be initialized once per interpreter process
            1 subtrees have overcommitted pool target_size_bytes
 
  services:
    mon: 3 daemons, quorum prx-host01,prx-host02,prx-host03 (age 11h)
    mgr: prx-host01(active, since 11h), standbys: prx-host02, prx-host03
    mds: 1/1 daemons up, 2 standby
    osd: 3 osds: 3 up (since 11h), 3 in (since 4d)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 105.13k objects, 403 GiB
    usage:   1.2 TiB used, 9.7 TiB / 11 TiB avail
    pgs:     96 active+clean
             1  active+clean+scrubbing+deep
 
  io:
    client:   6.6 MiB/s rd, 666 KiB/s wr, 84 op/s rd, 78 op/s wr

ceph tell osd.x bench
Code:
ceph tell osd.0 bench
{
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 11.266977012,
    "bytes_per_sec": 95299903.679256752,
    "iops": 22.721267623724163
}

ceph tell osd.1 bench
{
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 12.500486085,
    "bytes_per_sec": 85896005.699205577,
    "iops": 20.479203629304308
}

 ceph tell osd.2 bench
{
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 11.687114086999999,
    "bytes_per_sec": 91873991.817566141,
    "iops": 21.904466585532699
}

I also moved some volumes off of the ceph-storage an it takes a VERY long time to copy. 10GB took about 30minutes copying it from the ceph-Pool (on SSDs) to the local NVME-storage.

The problems started, I think, when I exchanged the 3 SSDs. First had 3 Samsung EVO with 1TB connected to each node. I exchanged on by one by doing:
  1. disable backfilling-flag
  2. set OSD to down
  3. set OSD to out
  4. removed the OSD-entry from the cluster-manager
  5. shutdown the node and replace the drives
  6. add in the new OSD in the cluster-manager
  7. set OSD to in and up
  8. enabled backfilling-flag
  9. wait till backfilling finished and moved on to next node
 

Attachments

  • 1695025370982.png
    1695025370982.png
    151.3 KB · Views: 3
  • 1695025362143.png
    1695025362143.png
    151.3 KB · Views: 3
  • 1695025348729.png
    1695025348729.png
    151.3 KB · Views: 3
  • 1695025339842.png
    1695025339842.png
    151.3 KB · Views: 6
Last edited:
Now the Apply/Commit Latency is very high (in my opinio):
1695026330359.png

the write-speed is also very low in my opinion with way under 1MiB/s:
1695030809314.png

I also adjusted (today) the value of the only CephFS-Pool i added by putting in a specific target Size, so it looks like this right now:
1695026398393.png

And this is the overview of all pools:
1695026428140.png
The ceph-setup is created by using (almost always) the standard values. I created the ceph-instance BEFORE I upgraded from Proxmox 7 to 8 and its running ceph-quincy now. The slow performance happens way after upgradeing the software-packages, os I don´t think this can be a reason.

I assume it's me not configuring ceph the right way, so the error sits in front of the device, because I´a newbie in the whole Ceph-Thing.

Right now I´m moving everything off of the ceph pool in case I need to recreate it and to test, if the performance is better on the local lvm storage.

Feel free to ask for further info and test I shall run

Thanks so much in advance for helping me out!
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!