[SOLVED] OSD down and in on one of 3 hosts

sarman

New Member
Jul 30, 2023
4
0
1
So I have been fighting this issue for a while, but can not seem to figure out what is happening. My setup is this:
ProxmoxVE 7.4-16
I have 2 datacenters, each datacenter has 3 hosts. My main VLAN (Proxmox) is separate from my CEPH VLAN. On both datacenters I have 1 complete host that refuses to get all OSD's from all hosts in a up and in the state. The issue tends to be isolated to a single host and found on all 4 OSD's, but the host does change.
I will post details from a single host as I am sure whatever the issue is if I can resolve it on the 1 datacenter, I should be able to follow that and resolve the other datacenter:
1690681112143.png
pve version:
Code:
proxmox-ve: 7.4-1 (running kernel: 5.15.108-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-4
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph: 17.2.6-pve1
ceph-fuse: 17.2.6-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.3-1
proxmox-backup-file-restore: 2.4.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
ceph.conf
Code:
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = Y.Y.10.Y/27
     fsid = [redacted]
     mon_allow_pool_delete = true
     mon_host = X.X.X.66 X.X.X.67 X.X.X.68
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = X.X.X.66/27

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.vmhost1]
     host = vmhost1
     mds_standby_for_name = pve

[mds.vmhost2]
     host = vmhost2
     mds_standby_for_name = pve

[mds.vmhost3]
     host = vmhost3
     mds standby for name = pve

[mon.vmhost1]
     public_addr = X.X.X.66

[mon.vmhost2]
     public_addr = X.X.X.67

[mon.vmhost3]
     public_addr = X.X.X.68

crush map
Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host vmhost4 {
    id -3        # do not change unnecessarily
    id -2 class ssd        # do not change unnecessarily
    id -8 class hdd        # do not change unnecessarily
    # weight 3.49316
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 0.87329
    item osd.1 weight 0.87329
    item osd.2 weight 0.87329
    item osd.3 weight 0.87329
}
host vmhost5 {
    id -5        # do not change unnecessarily
    id -4 class ssd        # do not change unnecessarily
    id -9 class hdd        # do not change unnecessarily
    # weight 3.49316
    alg straw2
    hash 0    # rjenkins1
    item osd.4 weight 0.87329
    item osd.5 weight 0.87329
    item osd.6 weight 0.87329
    item osd.7 weight 0.87329
}
host vmhost6 {
    id -10        # do not change unnecessarily
    id -6 class ssd        # do not change unnecessarily
    id -11 class hdd        # do not change unnecessarily
    # weight 3.63678
    alg straw2
    hash 0    # rjenkins1
    item osd.8 weight 0.90919
    item osd.9 weight 0.90919
    item osd.10 weight 0.90919
    item osd.11 weight 0.90919
}
root default {
    id -1        # do not change unnecessarily
    id -7 class ssd        # do not change unnecessarily
    id -12 class hdd        # do not change unnecessarily
    # weight 10.62311
    alg straw2
    hash 0    # rjenkins1
    item vmhost4 weight 3.49316
    item vmhost5 weight 3.49316
    item vmhost6 weight 3.63678
}

# rules
rule replicated_rule {
    id 0
    type replicated
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map

My network configuration
Code:
auto lo
iface lo inet loopback
####################
# 1G interfaces

auto eno1
iface eno1 inet manual
mtu 9000

auto eno2
iface eno2 inet manual
mtu 9000

auto enp175s0f0
iface enp175s0f0 inet manual
mtu 9000

auto enp175s0f1
iface enp175s0f1 inet manual
mtu 9000

auto enp59s0f0
iface enp59s0f0 inet manual
    mtu 9000

auto enp59s0f1
iface enp59s0f1 inet manual
mtu 9000

auto bond0
iface bond0 inet manual
    bond-slaves eno1 eno2 enp175s0f0 enp175s0f1
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer2+3
    mtu 9000
#1GB Data Network

auto bond1
iface bond1 inet manual
    bond-slaves enp59s0f0 enp59s0f1
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer2+3
    mtu 9000
#10G CEPH network

auto bond1.102
iface bond1.102 inet manual
    mtu 9000
#CEPH VLAN 102

auto bond0.101
iface bond0.101 inet manual
    mtu 9000
#KVM VLAN 101

auto vmbr1
iface vmbr1 inet static
    address X.X.X.66/27
    gateway X.X.X.65
    bridge-ports bond0.101
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094
    mtu 9000
    bridge-vds 101
#KVM VLAN

auto vmbr2
iface vmbr2 inet manual
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 100 103 104 105 106 107 108 109 110 111 112 113 114 115
    mtu 9000
#1G DATA Network
auto cephbr0
iface cephbr0 inet static
    address Y.Y.10.1/27
    bridge_ports bond1.102
    bridge_stp off
    bridge_vids 102
    bridge_vlan_aware 1
    mtu 9000

auto ep59s0f0
iface ep59s0f0 inet manual
mtu 9000

post-up ip route add default via X.X.X.65 dev bond0.101

Any assistance here would be greatly appreciated!
 
I hate being that guy but does nobody have any suggestions on how to address this CEPH issue? If there is more data you need, please ask and I am happy to provide what I have.
 
I hate being that guy but does nobody have any suggestions on how to address this CEPH issue? If there is more data you need, please ask and I am happy to provide what I have.
i don´t exactly now where to look (maybe /var/log/ceph), but osds beeing down for itself is not an error. Any messages when you start them?
 
check the following on vmhost3, in no particular order:

1. verify that your ceph public and private for the other two nodes are reachable (ping)
2. check the service status of your osds:
systemctl status ceph-osd@*
2a. if any of the daemons are stopped, restart them
3. check the running logs of your osds (one should suffice to let you know whats going on:
tail -f /var/log/ceph/ceph-osd.11.log
if you see activity- wait. if you dont, restart the daemon and look again since its probably stopped now.

If the daemons are running and the logs are showing activity, you may just be running out of resources on the node. check systemload on top.
 
So after some digging and a lot of trial/error, I found my issue. I have 2 !)G ports that I set up with LACP in the hops to get 20G backend for CEPH. Turns out, I have something misconfigured on my switch and was causing a small broadcast storm. The host that had all 4 OSD's in a "down | in" state could not reach the other hosts via the network. Once I removed the 2nd NIC on each host, my issue was resolved. I will continue to review my LAG's and other network settings to see if I can still bond my ports like I am trying to accomplish.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!