I have an unusual problem and I am certain it is rooted in the network.
Hardware:
3 Dell 720R with 8 OSDs each running Proxmox 8.1.3 and Ceph Quincy (waiting for healthy system before upgrading ceph).
Network:
domain: amaranthos.local
Proxmox IP: 10.0.0.0/19 (vmbr0 bonded to 3 1g ethernet cards eth0,eth1,eth2 on bond0)
Ceph IP: 10.10.10.0/24 (vmbr1 to eth3)
After several power failures (like 5 in 24 hours) the system ended up in a very bad state.
Networking was not working (ubiquiti UDM Pro + USW24-Pro), so some basic troubleshooting brought DNS back up.
In the process, the main search domain was changed, and invalidated all the certificates.
The cluster was not recognizing the 3 nodes.
I now have it communicating properly, the certs appear valid.
I have run pvecm updatecerts on all nodes
DNS is on a dedicated machine running bind9 and everything is resolving correctly.
/etc/hosts match on all nodes
/etc/corosync/corosync.conf, /etc/pve/corosync.conf, /etc/pve/ceph.conf all match on all 3 nodes
corosync version matches the file
ssh to pve1,2 and 3 all work from each other to each node, using certificate validation
ping times are < 0.25ms
ping to 10.0.0.200, 201 and 203 all work
ping to 10.10.10.2,3 and 4 all work, from all nodes
I can ssh to pve1, 2 and 3 with ssh root@pveX from all nodes
networking both in and out of the local system is working as expected.
I have a ton of errors and I can't seem to understand how to pinpoint the causes, such as:
The GUI is responsive and shows an active cluster, but nothing for ceph, just 500 errors (timeouts)
Is this a conflict between corosync.conf, ceph.conf and the backing DBs inside containers?
If so, how do I reach into the DBs and extract the proper info to reset the conf files to the expected IPs?
My cli statuses show no real helpful errors, they seem to say everything is running, journal shows all the errors.
ceph -s gets a timeout after 5 minutes
attached is the boot sequence of journal -b, I see the errors, but I can't see a root cause
trying to mount cephfs manually results in:
Monitors never come out of 'probing'
Hardware:
3 Dell 720R with 8 OSDs each running Proxmox 8.1.3 and Ceph Quincy (waiting for healthy system before upgrading ceph).
Network:
domain: amaranthos.local
Proxmox IP: 10.0.0.0/19 (vmbr0 bonded to 3 1g ethernet cards eth0,eth1,eth2 on bond0)
Ceph IP: 10.10.10.0/24 (vmbr1 to eth3)
After several power failures (like 5 in 24 hours) the system ended up in a very bad state.
Networking was not working (ubiquiti UDM Pro + USW24-Pro), so some basic troubleshooting brought DNS back up.
In the process, the main search domain was changed, and invalidated all the certificates.
The cluster was not recognizing the 3 nodes.
I now have it communicating properly, the certs appear valid.
I have run pvecm updatecerts on all nodes
DNS is on a dedicated machine running bind9 and everything is resolving correctly.
/etc/hosts match on all nodes
/etc/corosync/corosync.conf, /etc/pve/corosync.conf, /etc/pve/ceph.conf all match on all 3 nodes
corosync version matches the file
ssh to pve1,2 and 3 all work from each other to each node, using certificate validation
ping times are < 0.25ms
ping to 10.0.0.200, 201 and 203 all work
ping to 10.10.10.2,3 and 4 all work, from all nodes
I can ssh to pve1, 2 and 3 with ssh root@pveX from all nodes
networking both in and out of the local system is working as expected.
I have a ton of errors and I can't seem to understand how to pinpoint the causes, such as:
Code:
Dec 27 08:48:11 pve1 pmxcfs[118187]: [main] notice: exit proxmox configuration filesystem (0)
Dec 27 08:48:11 pve1 pmxcfs[118187]: [confdb] crit: cmap_finalize failed: 9
Dec 27 08:48:11 pve1 pmxcfs[118187]: [confdb] crit: cmap_track_delete version failed: 9
Dec 27 08:48:11 pve1 pmxcfs[118187]: [confdb] crit: cmap_track_delete nodelist failed: 9
Dec 27 08:48:11 pve1 pmxcfs[118187]: [quorum] crit: quorum_finalize failed: 9
Dec 27 08:48:11 pve1 pmxcfs[118187]: [status] crit: cpg_leave failed: 2
Dec 27 08:48:11 pve1 pmxcfs[118187]: [status] crit: cpg_dispatch failed: 2
Dec 27 08:48:11 pve1 pmxcfs[118187]: [dcdb] crit: cpg_leave failed: 2
Dec 27 08:48:11 pve1 pmxcfs[118187]: [dcdb] crit: cpg_dispatch failed: 2
Dec 27 08:48:11 pve1 pmxcfs[118187]: [status] notice: node lost quorum
Dec 27 08:48:11 pve1 pmxcfs[118187]: [quorum] crit: quorum_dispatch failed: 2
The GUI is responsive and shows an active cluster, but nothing for ceph, just 500 errors (timeouts)
Is this a conflict between corosync.conf, ceph.conf and the backing DBs inside containers?
If so, how do I reach into the DBs and extract the proper info to reset the conf files to the expected IPs?
My cli statuses show no real helpful errors, they seem to say everything is running, journal shows all the errors.
ceph -s gets a timeout after 5 minutes
attached is the boot sequence of journal -b, I see the errors, but I can't see a root cause
trying to mount cephfs manually results in:
Code:
root@pve1:~# systemctl status ceph-mds@pve1.service● ceph-mds@pve1.service - Ceph metadata server daemon Loaded: loaded (/lib/systemd/system/ceph-mds@.service; enabled; preset: enabled) Drop-In: /usr/lib/systemd/system/ceph-mds@.service.d └─ceph-after-pve-cluster.conf Active: active (running) since Wed 2023-12-27 10:39:22 MST; 3min 50s ago Main PID: 27420 (ceph-mds) Tasks: 9 Memory: 11.2M CPU: 527ms CGroup: /system.slice/system-ceph\x2dmds.slice/ceph-mds@pve1.service └─27420 /usr/bin/ceph-mds -f --cluster ceph --id pve1 --setuser ceph --setgroup cephDec 27 10:39:22 pve1 systemd[1]: Started ceph-mds@pve1.service - Ceph metadata server daemon.
root@pve1:~# mount -v -t ceph 10.10.10.2,10.10.10.3,10.10.10.4:/ /mnt/pve/cephfs -o name=admin,secretfile=/etc/pve/priv/ceph/cephfs.secret,conf=/etc/pve/ceph.conf,fs=cephfsparsing options: rw,name=admin,secretfile=/etc/pve/priv/ceph/cephfs.secret,conf=/etc/pve/ceph.conf,fs=cephfsmount.ceph: options "name=admin,mds_namespace=cephfs".invalid new device string formatmount.ceph: resolved to: "10.10.10.2,10.10.10.3,10.10.10.4"mount.ceph: trying mount with old device syntax: 10.10.10.2,10.10.10.3,10.10.10.4:/mount.ceph: options "name=admin,mds_namespace=cephfs,key=admin,fsid=b3445d50-80e3-405e-b3cd-a5b7251876e2" will pass to kernelmount
error: no mds server is up or the cluster is laggy
Code:
root@pve1:~# corosync-cfgtool -s
Local node ID 1, transport knet
LINK ID 0 udp
addr = 10.0.0.200
status:
nodeid: 1: localhost
nodeid: 2: connected
nodeid: 3: connected
root@pve1:~#
Code:
ceph --admin-daemon /var/run/ceph/ceph-mon.pve1.asok mon_status
{
"name": "pve1",
"rank": 2,
"state": "probing",
"election_epoch": 0,
"quorum": [],
"features": {
"required_con": "2449958755906961412",
"required_mon": [
"kraken",
"luminous",
"mimic",
"osdmap-prune",
"nautilus",
"octopus",
"pacific",
"elector-pinging",
"quincy"
],
"quorum_con": "0",
"quorum_mon": []
},
"outside_quorum": [
"pve1"
],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": {
"epoch": 10,
"fsid": "b3445d50-80e3-405e-b3cd-a5b7251876e2",
"modified": "2023-10-13T18:40:15.247220Z",
"created": "2022-10-13T20:53:32.107284Z",
"min_mon_release": 17,
"min_mon_release_name": "quincy",
"election_strategy": 1,
"disallowed_leaders: ": "",
"stretch_mode": false,
"tiebreaker_mon": "",
"removed_ranks: ": "",
"features": {
"persistent": [
"kraken",
"luminous",
"mimic",
"osdmap-prune",
"nautilus",
"octopus",
"pacific",
"elector-pinging",
"quincy"
],
"optional": []
},
"mons": [
{
"rank": 0,
"name": "pve2",
"public_addrs": {
"addrvec": [
{
"type": "v2",
"addr": "10.10.10.3:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "10.10.10.3:6789",
"nonce": 0
}
]
},
"addr": "10.10.10.3:6789/0",
"public_addr": "10.10.10.3:6789/0",
"priority": 0,
"weight": 0,
"crush_location": "{}"
},
{
"rank": 1,
"name": "pve3",
"public_addrs": {
"addrvec": [
{
"type": "v2",
"addr": "10.10.10.4:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "10.10.10.4:6789",
"nonce": 0
}
]
},
"addr": "10.10.10.4:6789/0",
"public_addr": "10.10.10.4:6789/0",
"priority": 0,
"weight": 0,
"crush_location": "{}"
},
{
"rank": 2,
"name": "pve1",
"public_addrs": {
"addrvec": [
{
"type": "v2",
"addr": "10.10.10.2:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "10.10.10.2:6789",
"nonce": 0
}
]
},
"addr": "10.10.10.2:6789/0",
"public_addr": "10.10.10.2:6789/0",
"priority": 0,
"weight": 0,
"crush_location": "{}"
}
]
},
"feature_map": {
"mon": [
{
"features": "0x3f01cfbf7ffdffff",
"release": "luminous",
"num": 1
}
],
"client": [
{
"features": "0x3f01cfbf7ffdffff",
"release": "luminous",
"num": 11
}
]
},
"stretch_mode": false
}
root@pve1:~#
Monitors never come out of 'probing'
Attachments
Last edited: