Hi, I am Hans, I am using Proxmox for quite some time now and often found valuable help (reading) this community. Thanks a lot for so much valuable information!
Today I have some questions which I could not help myself, so I am posting my 1st post
I recently inherited a 3 node hyperconverged PVE 7.0 / Ceph cluster which ran, I'd say flawlessly for quite some time until a few weeks ago the disk I/O degraded massively and most of the guests became nearly unusable. It is mostly noticeable in Windows guests which now take ages to perform disk operations, suffer around 100% disk utilization and experience answer times of several hundred milliseconds.
Ceph shows "HEALTH_OK", there is no scrub job, no backup job, no known heavy IO-job running.
The 3 node cluster consists of 3 identical nodes which each provide 4 enterprise NMVe and 3 consumer grade SATA SSD (for bulk storage / very light workloads). Storage is separated on a dedicated 25G network.
Those OSD are utilized from the VMs and containers via two pools, named "ceph-nvme" and "ceph-ssd". There is a crushmap to sort this out.
What I noticed so far: One OSD (OSD.14, one of the consumer grade SATA SSD) shows significant higher "Apply/Commit Latency", like ~500 (sometimes even more), where all other OSD are usually at around 0 or less than 30. This respective SSD also shows "more" wearout (S.M.A.R.T.) than the others (on the "Node" -> "Disks" page). Please have a look at the attached screenshots.
1.) As far as I understand these high "Apply/Commit Latency" times on this one device can dramatically slow down the whole ceph I/O (of this pool?), right? So I am gonna swap this OSD.14 with a new SSD?
2.) But, what seems totally weird to me is the fact that the VMs that suffer from I/O problems almost entirely use "ceph-nvme" (for their system drives) storage... So how could this be? As per the crush configuration they should not use the SSDs at all?
Maybe there is something wrong with my CRUSH? See some configuration details, hope this is helpfull:
TL;DR so assuming this one consumer SSD affects the whole performance, it should still be limited to the SSDs and the vms using the SSD based pool. How is the connection to the slow I/O in the VMs (that are supposed) using only NVMe storage?
Thanks for any help with this!
Best regards, Hans.
Today I have some questions which I could not help myself, so I am posting my 1st post
I recently inherited a 3 node hyperconverged PVE 7.0 / Ceph cluster which ran, I'd say flawlessly for quite some time until a few weeks ago the disk I/O degraded massively and most of the guests became nearly unusable. It is mostly noticeable in Windows guests which now take ages to perform disk operations, suffer around 100% disk utilization and experience answer times of several hundred milliseconds.
Ceph shows "HEALTH_OK", there is no scrub job, no backup job, no known heavy IO-job running.
The 3 node cluster consists of 3 identical nodes which each provide 4 enterprise NMVe and 3 consumer grade SATA SSD (for bulk storage / very light workloads). Storage is separated on a dedicated 25G network.
Those OSD are utilized from the VMs and containers via two pools, named "ceph-nvme" and "ceph-ssd". There is a crushmap to sort this out.
What I noticed so far: One OSD (OSD.14, one of the consumer grade SATA SSD) shows significant higher "Apply/Commit Latency", like ~500 (sometimes even more), where all other OSD are usually at around 0 or less than 30. This respective SSD also shows "more" wearout (S.M.A.R.T.) than the others (on the "Node" -> "Disks" page). Please have a look at the attached screenshots.
1.) As far as I understand these high "Apply/Commit Latency" times on this one device can dramatically slow down the whole ceph I/O (of this pool?), right? So I am gonna swap this OSD.14 with a new SSD?
2.) But, what seems totally weird to me is the fact that the VMs that suffer from I/O problems almost entirely use "ceph-nvme" (for their system drives) storage... So how could this be? As per the crush configuration they should not use the SSDs at all?
Maybe there is something wrong with my CRUSH? See some configuration details, hope this is helpfull:
Code:
root@pve-node-02:~# ceph osd crush class ls
[
"nvme",
"ssd"
]
Code:
root@pve-node-02:~# ceph osd crush class ls-osd ssd
12
13
14
15
16
17
18
19
20
Code:
root@pve-node-02:~# ceph osd crush class ls-osd nvme
0
1
2
3
4
5
6
7
8
9
10
11
Code:
root@pve-node-02:~# ceph osd crush rule ls
replicated_rule
repl-ssd
Code:
root@pve-node-02:~# ceph osd crush tree --show-shadow
ID CLASS WEIGHT TYPE NAME
-12 ssd 16.37457 root default~ssd
-9 ssd 5.45819 host pve-node-01~ssd
12 ssd 1.81940 osd.12
13 ssd 1.81940 osd.13
14 ssd 1.81940 osd.14
-10 ssd 5.45819 host pve-node-02~ssd
15 ssd 1.81940 osd.15
16 ssd 1.81940 osd.16
17 ssd 1.81940 osd.17
-11 ssd 5.45819 host pve-node-03~ssd
18 ssd 1.81940 osd.18
19 ssd 1.81940 osd.19
20 ssd 1.81940 osd.20
-2 nvme 20.95917 root default~nvme
-4 nvme 6.98639 host pve-node-01~nvme
0 nvme 1.74660 osd.0
1 nvme 1.74660 osd.1
2 nvme 1.74660 osd.2
3 nvme 1.74660 osd.3
-6 nvme 6.98639 host pve-node-02~nvme
4 nvme 1.74660 osd.4
5 nvme 1.74660 osd.5
6 nvme 1.74660 osd.6
7 nvme 1.74660 osd.7
-8 nvme 6.98639 host pve-node-03~nvme
8 nvme 1.74660 osd.8
9 nvme 1.74660 osd.9
10 nvme 1.74660 osd.10
11 nvme 1.74660 osd.11
-1 37.33374 root default
-3 12.44458 host pve-node-01
0 nvme 1.74660 osd.0
1 nvme 1.74660 osd.1
2 nvme 1.74660 osd.2
3 nvme 1.74660 osd.3
12 ssd 1.81940 osd.12
13 ssd 1.81940 osd.13
14 ssd 1.81940 osd.14
-5 12.44458 host pve-node-02
4 nvme 1.74660 osd.4
5 nvme 1.74660 osd.5
6 nvme 1.74660 osd.6
7 nvme 1.74660 osd.7
15 ssd 1.81940 osd.15
16 ssd 1.81940 osd.16
17 ssd 1.81940 osd.17
-7 12.44458 host pve-node-03
8 nvme 1.74660 osd.8
9 nvme 1.74660 osd.9
10 nvme 1.74660 osd.10
11 nvme 1.74660 osd.11
18 ssd 1.81940 osd.18
19 ssd 1.81940 osd.19
20 ssd 1.81940 osd.20
TL;DR so assuming this one consumer SSD affects the whole performance, it should still be limited to the SSDs and the vms using the SSD based pool. How is the connection to the slow I/O in the VMs (that are supposed) using only NVMe storage?
Thanks for any help with this!
Best regards, Hans.