I/O wait after upgrade 5.X to 6.2 and Ceph Luminous to Nautilus.

raffael

Member
May 10, 2017
12
2
8
41
Hi

Last week I upgraded our cluster of 3 identical nodes to Proxmox 6.2. The upgrade worked without problem, but since then cpu cores are often waiting for I/O.

Sometimes all cores are waiting and this blocks the clients for some seconds to access files. Since one of the client-VM is a samba server this is really problematic.

I think it is a ceph rbd problem, but I can not figure out how to fix it. I tried to identify what is using so much I/O with iotop but when the waiting happens there is actually very low throughput. Some KB/sec. It is hard to really reproduce the problem, but one of the cases that worked several times was to install al lot of debian packages (like upgrade strech to buster) in a container. Then it happens when dpkg unpackes a package. This can set all 8 cores of the host to 100% I/O wait for several seconds.

I already tried updating bios/firmware and network drivers. Shutting down the whole cluster including switches and starting again. Since hours (or actually days) I'm trying to find the source of my problems.

Any advice is appreciated.


Servers are Supermicro 1029P-WTRT with

- Intel(R) Xeon(R) Silver 4112 CPU @ 2.60GHz (8 Cores)

- 2 Disks per node for ceph (NVMe ssd)

- ceph network using Intel X722 10GBASE-T

- 64 GB ram


There are about 17 containers running debian buster. 2 Windows Server 2019 VM and 3 linux VM.


proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)

pve-manager: 6.2-10 (running version: 6.2-10/a20769ed)

pve-kernel-5.4: 6.2-4

pve-kernel-helper: 6.2-4

pve-kernel-5.4.44-2-pve: 5.4.44-2

pve-kernel-4.15: 5.4-19

pve-kernel-4.13: 5.2-2

pve-kernel-4.15.18-30-pve: 4.15.18-58

pve-kernel-4.15.18-26-pve: 4.15.18-54

pve-kernel-4.15.18-24-pve: 4.15.18-52

pve-kernel-4.15.18-21-pve: 4.15.18-48

pve-kernel-4.15.18-20-pve: 4.15.18-46

pve-kernel-4.15.18-18-pve: 4.15.18-44

pve-kernel-4.15.18-13-pve: 4.15.18-37

pve-kernel-4.15.18-11-pve: 4.15.18-34

pve-kernel-4.15.18-10-pve: 4.15.18-32

pve-kernel-4.15.18-9-pve: 4.15.18-30

pve-kernel-4.15.18-8-pve: 4.15.18-28

pve-kernel-4.15.18-1-pve: 4.15.18-19

pve-kernel-4.13.16-4-pve: 4.13.16-51

pve-kernel-4.13.16-2-pve: 4.13.16-48

pve-kernel-4.13.13-6-pve: 4.13.13-42

pve-kernel-4.13.13-5-pve: 4.13.13-38

pve-kernel-4.13.13-4-pve: 4.13.13-35

pve-kernel-4.13.13-3-pve: 4.13.13-34

pve-kernel-4.13.13-1-pve: 4.13.13-31

pve-kernel-4.13.4-1-pve: 4.13.4-26

ceph: 14.2.9-pve1

ceph-fuse: 14.2.9-pve1

corosync: 3.0.4-pve1

criu: 3.11-3

glusterfs-client: 5.5-3

ifupdown: residual config

ifupdown2: 3.0.0-1+pve2

ksm-control-daemon: 1.3-1

libjs-extjs: 6.0.1-10

libknet1: 1.16-pve1

libproxmox-acme-perl: 1.0.4

libpve-access-control: 6.1-2

libpve-apiclient-perl: 3.0-3

libpve-common-perl: 6.1-5

libpve-guest-common-perl: 3.1-1

libpve-http-server-perl: 3.0-6

libpve-storage-perl: 6.2-5

libqb0: 1.0.5-1

libspice-server1: 0.14.2-4~pve6+1

lvm2: 2.03.02-pve4

lxc-pve: 4.0.2-1

lxcfs: 4.0.3-pve3

novnc-pve: 1.1.0-1

proxmox-mini-journalreader: 1.1-1

proxmox-widget-toolkit: 2.2-9

pve-cluster: 6.1-8

pve-container: 3.1-11

pve-docs: 6.2-5

pve-edk2-firmware: 2.20200531-1

pve-firewall: 4.1-2

pve-firmware: 3.1-1

pve-ha-manager: 3.0-9

pve-i18n: 2.1-3

pve-qemu-kvm: 5.0.0-11

pve-xtermjs: 4.3.0-1

qemu-server: 6.2-10

smartmontools: 7.1-pve2

spiceterm: 3.1-1

vncterm: 1.6-1

zfsutils-linux: 0.8.4-pve1


Ceph itself is fast enough:


Total time run: 10.081

Total writes made: 1995

Write size: 4194304

Object size: 4194304

Bandwidth (MB/sec): 791.585

Stddev Bandwidth: 71.7229

Max bandwidth (MB/sec): 852

Min bandwidth (MB/sec): 600

Average IOPS: 197

Stddev IOPS: 17.9307

Max IOPS: 213

Min IOPS: 150

Average Latency(s): 0.0807779

Stddev Latency(s): 0.0511501

Max latency(s): 0.471042

Min latency(s): 0.0213218


cluster:

id: 20d9beef-c58c-434e-b025-f14db5e1c5b3

health: HEALTH_WARN

1 nearfull osd(s)

1 pool(s) nearfull



services:

mon: 3 daemons, quorum pm1,pm2,pm3 (age 88m)

mgr: pm3(active, since 89m), standbys: pm1, pm2

osd: 6 osds: 6 up (since 88m), 6 in



data:

pools: 1 pools, 256 pgs

objects: 786.57k objects, 3.0 TiB

usage: 8.9 TiB used, 2.0 TiB / 11 TiB avail

pgs: 256 active+clean



io:

client: 682 B/s rd, 193 KiB/s wr, 0 op/s rd, 36 op/s wr

This also produces a lot of I/O wait
bench type write io_size 8192 io_threads 512 bytes 1073741824 pattern sequential
SEC OPS OPS/SEC BYTES/SEC
1 8192 8337.17 68298112.19
2 13824 7000.01 57344049.14
3 19456 6466.33 52972148.98
4 25088 6201.56 50803144.23
5 30208 6071.15 49734872.56
6 35840 5503.19 45082127.65
7 40960 5418.53 44388638.45
8 47104 5534.03 45334789.54
9 51712 5363.42 43937146.07
10 57856 5425.44 44445174.23
12 61440 4147.77 33978512.18
13 62464 3221.09 26387195.40
14 64000 2530.86 20732796.36
15 66048 2361.00 19341339.05
16 70656 2159.25 17688543.85
17 73216 2093.89 17153106.81
18 73728 2410.96 19750591.83
20 74752 1950.65 15979763.67
21 79360 2233.56 18297314.55
22 83968 2233.56 18297314.53
23 89088 3080.75 25237486.74
24 94208 3644.13 29852722.20
25 100352 5293.64 43365461.77
26 105472 5226.59 42816188.99
27 110592 5270.00 43171810.64
28 116224 5329.15 43656381.49
29 121344 5435.90 44530908.30
30 126464 5311.64 43512953.41
elapsed: 31 ops: 131072 ops/sec: 4164.72 bytes/sec: 34117397.08

And gets much faster and with less I/O wait with smaller io-threads number

bench type write io_size 8192 io_threads 64 bytes 1073741824 pattern sequential
SEC OPS OPS/SEC BYTES/SEC
1 25536 25196.86 206412711.79
2 49600 24832.01 203423855.68
3 72832 24234.06 198525385.43
4 96448 24128.01 197656684.32
5 120832 24121.32 197601870.32
elapsed: 5 ops: 131072 ops/sec: 23092.33 bytes/sec: 189172376.95

If I don't find any solution I could also use hints for good workarounds (ceph alternatives) since this is a production system and the users are getting unhappy.

Thanks
Raffael
 
usage: 8.9 TiB used, 2.0 TiB / 11 TiB avail
Add more OSDs. A drive failure will result in a stall cluster (no space left), since the data can't be placed on the other OSDs.

And gets much faster and with less I/O wait with smaller io-threads number
OFC, the test runs 512 threads on a 16 threads CPU.

Go through the release notes of Ceph, many things have changed (eg. osd_memory_traget).
https://docs.ceph.com/docs/master/releases/nautilus/
 
Hi Alwin

Thanks for your answer.

Add more OSDs. A drive failure will result in a stall cluster (no space left), since the data can't be placed on the other OSDs.
I already planned to add a second pool with new SATA drives ( not NVMe like the others) and distribute the data. But since ceph now misbehaves I'm not sure if this is the way to go. The reason for ceph in our situation is to not have a single point of failure (in hardware).
I'm now thinking about adding two disks per node in a raid1 and sync the data between de nodes. But the ceph solution (if working) still sounds better to me.

OFC, the test runs 512 threads on a 16 threads CPU.
I know but I tried to somehow reproduce the problems I'm experiencing.

Go through the release notes of Ceph, many things have changed (eg. osd_memory_traget).
https://docs.ceph.com/docs/master/releases/nautilus/
Thanks I'll use that for further investigations.
 
I think i'm starting to narrow down the problem. With the help of iostat i found out that there are a lot of write request added to the queue of a nvme disk. It happens for all the disks from time to time. Then the w_await time goes up drastically. Using iotop i found out that its the rocksdb:low1 from ceph-osd which blocks the cpu.

In iostat -x this looks like this:
Code:
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1          0.00    2.00      0.00     24.00     0.00    46.00   0.00  95.83    0.00    0.00   0.00     0.00    12.00   2.00   0.40
nvme1n1          0.00 1495.00      0.00   3924.00     0.00  6099.00   0.00  80.31    0.00  352.39 523.78     0.00     2.62   0.67 100.00

iotop
Code:
Total DISK READ:         0.00 B/s | Total DISK WRITE:      1573.47 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       3.43 M/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                                                                                                                                    
   2306 be/4 ceph        0.00 B/s 1533.22 K/s  0.00 % 99.99 % ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph [rocksdb:low1]


Searching though the logs of ceph I see that rocksdb is "compacting". I don't know what rocksdb in ceph exactly does and why and when it is compacting. I attached the ceph-osd log.
Does anyone know what this ist. Is there a way to improve this behavior?

Thanks
Raffael
 

Attachments

Last edited:
There seems to be a firmware update. I'll try that in about two ours. I don't want to try it while having a nearfull cluster in case the update kills a disk. And to move away some data I have to restart some containers that are currently used by users. As soon as the users are gone I'll try this.


Answer rate here and on the ceph list is great. Thank you very much!
 
  • Like
Reactions: Alwin
BTW: we also have Ceph octopus on the ceph testing repo. Just in case.
Before I upgrade even further I think i try to downgrade back to luminous if that is at all possible?! o_O
But that is last resort.

I tried to boot an older kernel from proxmox 5.4 (pve-kernel-4.15.18-30-pve) to see if that has an effect but it did not boot properly anymore.
 
Before I upgrade even further I think i try to downgrade back to luminous if that is at all possible?! o_O
No.

I tried to boot an older kernel from proxmox 5.4 (pve-kernel-4.15.18-30-pve) to see if that has an effect but it did not boot properly anymore.
4.15 is old and doesn't support all the features for krbd (eg. container). There has been 5.0 and 5.3 kernels, but ATM I doubt that it is so much of a kernel problem.
 
Hi Alwin

I hope I solved the problem by destroying and recreating all osds one at a time. Still in progress but looks promising.
Just wanted to thank you again for helping. When I'm in Vienna the next time I'll bring a "Kasten Bier" to the proxmox office.
Or send a Message when you are In Zurich to get a free beer!

Cheers,
Raffael
 
  • Like
Reactions: Alwin
Form the mailing list, I agree that queue size is unusually high for those NVMe OSDs. With re-creating the OSDs the Bluestore format changed to a newer version. Possibly it's fixed but it might just come back, when the OSD are near full. Just to be cautious.