Hi
Last week I upgraded our cluster of 3 identical nodes to Proxmox 6.2. The upgrade worked without problem, but since then cpu cores are often waiting for I/O.
Sometimes all cores are waiting and this blocks the clients for some seconds to access files. Since one of the client-VM is a samba server this is really problematic.
I think it is a ceph rbd problem, but I can not figure out how to fix it. I tried to identify what is using so much I/O with iotop but when the waiting happens there is actually very low throughput. Some KB/sec. It is hard to really reproduce the problem, but one of the cases that worked several times was to install al lot of debian packages (like upgrade strech to buster) in a container. Then it happens when dpkg unpackes a package. This can set all 8 cores of the host to 100% I/O wait for several seconds.
I already tried updating bios/firmware and network drivers. Shutting down the whole cluster including switches and starting again. Since hours (or actually days) I'm trying to find the source of my problems.
Any advice is appreciated.
Servers are Supermicro 1029P-WTRT with
- Intel(R) Xeon(R) Silver 4112 CPU @ 2.60GHz (8 Cores)
- 2 Disks per node for ceph (NVMe ssd)
- ceph network using Intel X722 10GBASE-T
- 64 GB ram
There are about 17 containers running debian buster. 2 Windows Server 2019 VM and 3 linux VM.
Ceph itself is fast enough:
This also produces a lot of I/O wait
And gets much faster and with less I/O wait with smaller io-threads number
If I don't find any solution I could also use hints for good workarounds (ceph alternatives) since this is a production system and the users are getting unhappy.
Thanks
Raffael
Last week I upgraded our cluster of 3 identical nodes to Proxmox 6.2. The upgrade worked without problem, but since then cpu cores are often waiting for I/O.
Sometimes all cores are waiting and this blocks the clients for some seconds to access files. Since one of the client-VM is a samba server this is really problematic.
I think it is a ceph rbd problem, but I can not figure out how to fix it. I tried to identify what is using so much I/O with iotop but when the waiting happens there is actually very low throughput. Some KB/sec. It is hard to really reproduce the problem, but one of the cases that worked several times was to install al lot of debian packages (like upgrade strech to buster) in a container. Then it happens when dpkg unpackes a package. This can set all 8 cores of the host to 100% I/O wait for several seconds.
I already tried updating bios/firmware and network drivers. Shutting down the whole cluster including switches and starting again. Since hours (or actually days) I'm trying to find the source of my problems.
Any advice is appreciated.
Servers are Supermicro 1029P-WTRT with
- Intel(R) Xeon(R) Silver 4112 CPU @ 2.60GHz (8 Cores)
- 2 Disks per node for ceph (NVMe ssd)
- ceph network using Intel X722 10GBASE-T
- 64 GB ram
There are about 17 containers running debian buster. 2 Windows Server 2019 VM and 3 linux VM.
proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-10 (running version: 6.2-10/a20769ed)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-4.15: 5.4-19
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-30-pve: 4.15.18-58
pve-kernel-4.15.18-26-pve: 4.15.18-54
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-13-pve: 4.15.18-37
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.18-8-pve: 4.15.18-28
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-4-pve: 4.13.13-35
pve-kernel-4.13.13-3-pve: 4.13.13-34
pve-kernel-4.13.13-1-pve: 4.13.13-31
pve-kernel-4.13.4-1-pve: 4.13.4-26
ceph: 14.2.9-pve1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-5
libpve-guest-common-perl: 3.1-1
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-9
pve-cluster: 6.1-8
pve-container: 3.1-11
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-11
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-10
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
pve-manager: 6.2-10 (running version: 6.2-10/a20769ed)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-4.15: 5.4-19
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-30-pve: 4.15.18-58
pve-kernel-4.15.18-26-pve: 4.15.18-54
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-13-pve: 4.15.18-37
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.18-8-pve: 4.15.18-28
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-4-pve: 4.13.13-35
pve-kernel-4.13.13-3-pve: 4.13.13-34
pve-kernel-4.13.13-1-pve: 4.13.13-31
pve-kernel-4.13.4-1-pve: 4.13.4-26
ceph: 14.2.9-pve1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-5
libpve-guest-common-perl: 3.1-1
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-9
pve-cluster: 6.1-8
pve-container: 3.1-11
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-11
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-10
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
Ceph itself is fast enough:
Total time run: 10.081
Total writes made: 1995
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 791.585
Stddev Bandwidth: 71.7229
Max bandwidth (MB/sec): 852
Min bandwidth (MB/sec): 600
Average IOPS: 197
Stddev IOPS: 17.9307
Max IOPS: 213
Min IOPS: 150
Average Latency(s): 0.0807779
Stddev Latency(s): 0.0511501
Max latency(s): 0.471042
Min latency(s): 0.0213218
Total writes made: 1995
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 791.585
Stddev Bandwidth: 71.7229
Max bandwidth (MB/sec): 852
Min bandwidth (MB/sec): 600
Average IOPS: 197
Stddev IOPS: 17.9307
Max IOPS: 213
Min IOPS: 150
Average Latency(s): 0.0807779
Stddev Latency(s): 0.0511501
Max latency(s): 0.471042
Min latency(s): 0.0213218
cluster:
id: 20d9beef-c58c-434e-b025-f14db5e1c5b3
health: HEALTH_WARN
1 nearfull osd(s)
1 pool(s) nearfull
services:
mon: 3 daemons, quorum pm1,pm2,pm3 (age 88m)
mgr: pm3(active, since 89m), standbys: pm1, pm2
osd: 6 osds: 6 up (since 88m), 6 in
data:
pools: 1 pools, 256 pgs
objects: 786.57k objects, 3.0 TiB
usage: 8.9 TiB used, 2.0 TiB / 11 TiB avail
pgs: 256 active+clean
io:
client: 682 B/s rd, 193 KiB/s wr, 0 op/s rd, 36 op/s wr
id: 20d9beef-c58c-434e-b025-f14db5e1c5b3
health: HEALTH_WARN
1 nearfull osd(s)
1 pool(s) nearfull
services:
mon: 3 daemons, quorum pm1,pm2,pm3 (age 88m)
mgr: pm3(active, since 89m), standbys: pm1, pm2
osd: 6 osds: 6 up (since 88m), 6 in
data:
pools: 1 pools, 256 pgs
objects: 786.57k objects, 3.0 TiB
usage: 8.9 TiB used, 2.0 TiB / 11 TiB avail
pgs: 256 active+clean
io:
client: 682 B/s rd, 193 KiB/s wr, 0 op/s rd, 36 op/s wr
This also produces a lot of I/O wait
bench type write io_size 8192 io_threads 512 bytes 1073741824 pattern sequential
SEC OPS OPS/SEC BYTES/SEC
1 8192 8337.17 68298112.19
2 13824 7000.01 57344049.14
3 19456 6466.33 52972148.98
4 25088 6201.56 50803144.23
5 30208 6071.15 49734872.56
6 35840 5503.19 45082127.65
7 40960 5418.53 44388638.45
8 47104 5534.03 45334789.54
9 51712 5363.42 43937146.07
10 57856 5425.44 44445174.23
12 61440 4147.77 33978512.18
13 62464 3221.09 26387195.40
14 64000 2530.86 20732796.36
15 66048 2361.00 19341339.05
16 70656 2159.25 17688543.85
17 73216 2093.89 17153106.81
18 73728 2410.96 19750591.83
20 74752 1950.65 15979763.67
21 79360 2233.56 18297314.55
22 83968 2233.56 18297314.53
23 89088 3080.75 25237486.74
24 94208 3644.13 29852722.20
25 100352 5293.64 43365461.77
26 105472 5226.59 42816188.99
27 110592 5270.00 43171810.64
28 116224 5329.15 43656381.49
29 121344 5435.90 44530908.30
30 126464 5311.64 43512953.41
elapsed: 31 ops: 131072 ops/sec: 4164.72 bytes/sec: 34117397.08
SEC OPS OPS/SEC BYTES/SEC
1 8192 8337.17 68298112.19
2 13824 7000.01 57344049.14
3 19456 6466.33 52972148.98
4 25088 6201.56 50803144.23
5 30208 6071.15 49734872.56
6 35840 5503.19 45082127.65
7 40960 5418.53 44388638.45
8 47104 5534.03 45334789.54
9 51712 5363.42 43937146.07
10 57856 5425.44 44445174.23
12 61440 4147.77 33978512.18
13 62464 3221.09 26387195.40
14 64000 2530.86 20732796.36
15 66048 2361.00 19341339.05
16 70656 2159.25 17688543.85
17 73216 2093.89 17153106.81
18 73728 2410.96 19750591.83
20 74752 1950.65 15979763.67
21 79360 2233.56 18297314.55
22 83968 2233.56 18297314.53
23 89088 3080.75 25237486.74
24 94208 3644.13 29852722.20
25 100352 5293.64 43365461.77
26 105472 5226.59 42816188.99
27 110592 5270.00 43171810.64
28 116224 5329.15 43656381.49
29 121344 5435.90 44530908.30
30 126464 5311.64 43512953.41
elapsed: 31 ops: 131072 ops/sec: 4164.72 bytes/sec: 34117397.08
And gets much faster and with less I/O wait with smaller io-threads number
bench type write io_size 8192 io_threads 64 bytes 1073741824 pattern sequential
SEC OPS OPS/SEC BYTES/SEC
1 25536 25196.86 206412711.79
2 49600 24832.01 203423855.68
3 72832 24234.06 198525385.43
4 96448 24128.01 197656684.32
5 120832 24121.32 197601870.32
elapsed: 5 ops: 131072 ops/sec: 23092.33 bytes/sec: 189172376.95
SEC OPS OPS/SEC BYTES/SEC
1 25536 25196.86 206412711.79
2 49600 24832.01 203423855.68
3 72832 24234.06 198525385.43
4 96448 24128.01 197656684.32
5 120832 24121.32 197601870.32
elapsed: 5 ops: 131072 ops/sec: 23092.33 bytes/sec: 189172376.95
If I don't find any solution I could also use hints for good workarounds (ceph alternatives) since this is a production system and the users are getting unhappy.
Thanks
Raffael