Hello,
I've upgraded a Proxmox 6.4-13 Cluster with Ceph 15.2.x - which works fine without any issues to Proxmox 7.0-14 and Ceph 16.2.6. The cluster is working fine without any issues until a node is rebooted. OSDs which generates the slow ops for Front and Back Slow Ops are not predictable, each time there are other osd affected.
Cluster:
3x Dell 740Xd
Each Server:
2x Xeon Gold 6130
384GB Ram
1x Intel X550 10GBe Nic for Wan - VMs
1x Intel X550 10GBe Nic for Corosync
2x Bonded Intel X550 10 GBE for Mirgration - switchless
2x Mellanox Connectx-6 100GBe (Ethernet Mode) as Meshnetwork - switchless - for Ceph (public/cluster) with default drivers provided by Proxmox
8x NVME Intel P4500 NVmes
So I thought it might be a problem with mellanox cards, so I switch to the Intel nics, same issue. After the node is offline the ceph network is not reachable for long time. a ceph -s timed out. Nodes that are online are pingable (10.10.10.x). Disks have no S.M.A.R.T. Errors network pung is >= 0.030ms, also if one node went down.
pveversion:
interfaces for mellanox mesh:
Ceph configuration:
configuration Database:
Crush Map:
After node is back slow ops are gone after view minutes and everything works fine.
Any ideas, on Proxmox 6.4-13 and Ceph 15.2.x I didn't not have such issues.
I've upgraded a Proxmox 6.4-13 Cluster with Ceph 15.2.x - which works fine without any issues to Proxmox 7.0-14 and Ceph 16.2.6. The cluster is working fine without any issues until a node is rebooted. OSDs which generates the slow ops for Front and Back Slow Ops are not predictable, each time there are other osd affected.
Cluster:
3x Dell 740Xd
Each Server:
2x Xeon Gold 6130
384GB Ram
1x Intel X550 10GBe Nic for Wan - VMs
1x Intel X550 10GBe Nic for Corosync
2x Bonded Intel X550 10 GBE for Mirgration - switchless
2x Mellanox Connectx-6 100GBe (Ethernet Mode) as Meshnetwork - switchless - for Ceph (public/cluster) with default drivers provided by Proxmox
8x NVME Intel P4500 NVmes
So I thought it might be a problem with mellanox cards, so I switch to the Intel nics, same issue. After the node is offline the ceph network is not reachable for long time. a ceph -s timed out. Nodes that are online are pingable (10.10.10.x). Disks have no S.M.A.R.T. Errors network pung is >= 0.030ms, also if one node went down.
Code:
Slow OSD heartbeats on back (longest 22272.685ms)
Slow OSD heartbeats on back from osd.8 [] to osd.2 [] 22272.685 msec
Slow OSD heartbeats on back from osd.8 [] to osd.5 [] 22271.643 msec
Slow OSD heartbeats on back from osd.13 [] to osd.2 [] 21997.950 msec
Slow OSD heartbeats on back from osd.13 [] to osd.5 [] 21997.931 msec
Slow OSD heartbeats on back from osd.11 [] to osd.2 [] 21806.339 msec
Slow OSD heartbeats on back from osd.5 [] to osd.9 [] 21188.398 msec
Slow OSD heartbeats on back from osd.5 [] to osd.14 [] 21188.013 msec possibly improving
Slow OSD heartbeats on back from osd.5 [] to osd.10 [] 21184.563 msec possibly improving
Slow OSD heartbeats on back from osd.5 [] to osd.8 [] 21184.539 msec
Slow OSD heartbeats on back from osd.5 [] to osd.11 [] 21184.367 msec
Truncated long network list. Use ceph daemon mgr.# dump_osd_network for more information
Code:
Slow OSD heartbeats on front (longest 22272.255ms)
Slow OSD heartbeats on front from osd.8 [] to osd.2 [] 22272.255 msec
Slow OSD heartbeats on front from osd.8 [] to osd.5 [] 22272.178 msec
Slow OSD heartbeats on front from osd.13 [] to osd.2 [] 21998.689 msec possibly improving
Slow OSD heartbeats on front from osd.13 [] to osd.5 [] 21998.052 msec possibly improving
Slow OSD heartbeats on front from osd.11 [] to osd.2 [] 21806.150 msec
Slow OSD heartbeats on front from osd.5 [] to osd.13 [] 21188.659 msec possibly improving
Slow OSD heartbeats on front from osd.5 [] to osd.11 [] 21188.538 msec
Slow OSD heartbeats on front from osd.5 [] to osd.12 [] 21188.376 msec
Slow OSD heartbeats on front from osd.5 [] to osd.8 [] 21187.845 msec possibly improving
Slow OSD heartbeats on front from osd.5 [] to osd.9 [] 21184.700 msec
Truncated long network list. Use ceph daemon mgr.# dump_osd_network for more information
pveversion:
Code:
proxmox-ve: 7.0-2 (running kernel: 5.11.22-7-pve)
pve-manager: 7.0-14 (running version: 7.0-14/a9dbe7e3)
pve-kernel-helper: 7.1-4
pve-kernel-5.11: 7.0-10
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-5-pve: 5.11.22-10
pve-kernel-5.11.22-1-pve: 5.11.22-2
ceph: 16.2.6-pve2
ceph-fuse: 16.2.6-pve2
corosync: 3.1.5-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve1
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-12
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-3
libpve-storage-perl: 7.0-13
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.13-1
proxmox-backup-file-restore: 2.0.13-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.1-1
pve-docs: 7.0-5
pve-edk2-firmware: 3.20210831-1
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.5-1
pve-qemu-kvm: 6.1.0-1
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-18
smartmontools: 7.2-1
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve1
interfaces for mellanox mesh:
Code:
auto enp59s0f0np0
iface enp59s0f0np0 inet manual
mtu 9000
auto enp59s0f1np1
iface enp59s0f1np1 inet manual
mtu 9000
auto bond1
iface bond1 inet static
address 10.10.10.1/24
bond-slaves enp59s0f0np0 enp59s0f1np1
bond-miimon 100
bond-mode broadcast
mtu 9000
#San
Ceph configuration:
Code:
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.10.10.0/24
fsid = d7fb8413-521b-43eb-9deb-c24fd2f8fec4
mon_allow_pool_delete = true
mon_host = 10.10.10.1 10.10.10.2 10.10.10.3
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.10.10.0/24
[client]
keyring = /etc/pve/priv/$cluster.$name.keyring
rbd_cache_size = 134217728
[mon.pve-01]
public_addr = 10.10.10.1
[mon.pve-02]
public_addr = 10.10.10.2
[mon.pve-03]
public_addr = 10.10.10.3
configuration Database:
Code:
mon auth_allow_insecure_global_id_reclaim false
mgr mgr/pg_autoscaler/autoscale_profilw scale-down
Crush Map:
Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54
# devices
device 0 osd.0 class nvme
device 1 osd.1 class nvme
device 2 osd.2 class nvme
device 3 osd.3 class nvme
device 4 osd.4 class nvme
device 5 osd.5 class nvme
device 6 osd.6 class nvme
device 7 osd.7 class nvme
device 8 osd.8 class nvme
device 9 osd.9 class nvme
device 10 osd.10 class nvme
device 11 osd.11 class nvme
device 12 osd.12 class nvme
device 13 osd.13 class nvme
device 14 osd.14 class nvme
device 15 osd.15 class nvme
device 16 osd.16 class nvme
device 17 osd.17 class nvme
device 18 osd.18 class nvme
device 19 osd.19 class nvme
device 20 osd.20 class nvme
device 21 osd.21 class nvme
device 22 osd.22 class nvme
device 23 osd.23 class nvme
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root
# buckets
host pve-01 {
id -3 # do not change unnecessarily
id -2 class nvme # do not change unnecessarily
# weight 7.278
alg straw2
hash 0 # rjenkins1
item osd.7 weight 0.910
item osd.6 weight 0.910
item osd.5 weight 0.910
item osd.4 weight 0.910
item osd.3 weight 0.910
item osd.2 weight 0.910
item osd.0 weight 0.910
item osd.1 weight 0.910
}
host pve-02 {
id -5 # do not change unnecessarily
id -4 class nvme # do not change unnecessarily
# weight 7.278
alg straw2
hash 0 # rjenkins1
item osd.15 weight 0.910
item osd.14 weight 0.910
item osd.13 weight 0.910
item osd.12 weight 0.910
item osd.11 weight 0.910
item osd.10 weight 0.910
item osd.9 weight 0.910
item osd.8 weight 0.910
}
host pve-03 {
id -7 # do not change unnecessarily
id -6 class nvme # do not change unnecessarily
# weight 7.205
alg straw2
hash 0 # rjenkins1
item osd.23 weight 0.873
item osd.22 weight 0.910
item osd.21 weight 0.910
item osd.20 weight 0.910
item osd.19 weight 0.873
item osd.18 weight 0.910
item osd.17 weight 0.910
item osd.16 weight 0.910
}
root default {
id -1 # do not change unnecessarily
id -8 class nvme # do not change unnecessarily
# weight 21.760
alg straw2
hash 0 # rjenkins1
item pve-01 weight 7.278
item pve-02 weight 7.278
item pve-03 weight 7.205
}
# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map
Server View
Logs
After node is back slow ops are gone after view minutes and everything works fine.
Any ideas, on Proxmox 6.4-13 and Ceph 15.2.x I didn't not have such issues.
Last edited: