I performed a PVE 6 to 7 upgrade last night following the Proxmox documented procedure. Two of our nodes came up without issue. One of our nodes is giving this error:
1/3 mons down, quorum prox-ceph1,prox-ceph2mon.prox-ceph3 (rank 2) addr [v2:192.168.235.13:3300/0,v1:192.168.235.13:6789/0] is down (out of quorum)
I have tried restarting the affected node.
I have tried restarting the monitor service with combinations of the following commands:
systemctl status ceph-mon@prox-ceph3 returns this:
The monmap on a known good node returns this:
The IPs and ports listed above appear to be correct. I have tried for hours to get the prox-ceph3 monitor daemon to start and run. I have gotten it to start after a reboot of the node, however it is stuck in a 'probing' state. I read that the monmap may be corrupt on prox-ceph3 so I tried to inject the monmap from our prox-ceph1 node and it give me an error stating the dir doesn't exist but it does. I was trying to use this command to do this:
The exported good monmap is from prox-ceph1 in /tmp/monmap:
Our nodes are up, the OSDs are running, VMs are running and operational. I do have these warning in the Ceph Health Detail:
The too may PGs warning started when I upgraded from Nautilus to Octopus to take the PVE 7 update. I haven't gotten back to resolve that warning yet as the mon warning is more concerning right now.
Any help with this issue would be appreciated. I realize this is a wall of text, but I wanted to provide everything I've done up to this point. The only thing I haven't tried is rebooting all the nodes in an attempt to get the mon quorum to renegotiate. As it stands now, I can't get the prox-ceph3 mon daemon to start or join the quorum. I've exhausted my knowledge of PVE as I'm new to it. I've tried to do my due diligence, but I'm against a wall now.
Thank you.
1/3 mons down, quorum prox-ceph1,prox-ceph2mon.prox-ceph3 (rank 2) addr [v2:192.168.235.13:3300/0,v1:192.168.235.13:6789/0] is down (out of quorum)
I have tried restarting the affected node.
I have tried restarting the monitor service with combinations of the following commands:
Code:
systemctl stop ceph-mon@prox-ceph3
systemctl start ceph-mon@prox-ceph3
systemctl restart ceph-mon@prox-ceph3
systemctl status ceph-mon@prox-ceph3 returns this:
Code:
root@prox-ceph3:~# systemctl status ceph-mon@prox-ceph3
● ceph-mon@prox-ceph3.service - Ceph cluster monitor daemon
Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
└─ceph-after-pve-cluster.conf
Active: failed (Result: exit-code) since Fri 2021-10-29 08:28:02 EDT; 1h 13min ago
Process: 182103 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id prox-ceph3 --setuser ceph --setgroup c>
Main PID: 182103 (code=exited, status=1/FAILURE)
CPU: 52ms
Oct 29 08:28:02 prox-ceph3 systemd[1]: Stopped Ceph cluster monitor daemon.
Oct 29 08:28:02 prox-ceph3 systemd[1]: ceph-mon@prox-ceph3.service: Start request repeated too quickly.
Oct 29 08:28:02 prox-ceph3 systemd[1]: ceph-mon@prox-ceph3.service: Failed with result 'exit-code'.
Oct 29 08:28:02 prox-ceph3 systemd[1]: Failed to start Ceph cluster monitor daemon.
Oct 29 08:32:11 prox-ceph3 systemd[1]: ceph-mon@prox-ceph3.service: Start request repeated too quickly.
Oct 29 08:32:11 prox-ceph3 systemd[1]: ceph-mon@prox-ceph3.service: Failed with result 'exit-code'.
Oct 29 08:32:11 prox-ceph3 systemd[1]: Failed to start Ceph cluster monitor daemon.
Oct 29 08:33:32 prox-ceph3 systemd[1]: ceph-mon@prox-ceph3.service: Start request repeated too quickly.
Oct 29 08:33:32 prox-ceph3 systemd[1]: ceph-mon@prox-ceph3.service: Failed with result 'exit-code'.
Oct 29 08:33:32 prox-ceph3 systemd[1]: Failed to start Ceph cluster monitor daemon.
The monmap on a known good node returns this:
Code:
root@prox-ceph1:~# ceph daemon mon.prox-ceph1 mon_status
{
"name": "prox-ceph1",
"rank": 0,
"state": "leader",
"election_epoch": 10120,
"quorum": [
0,
1
],
"quorum_age": 27933,
"features": {
"required_con": "2449958747315978244",
"required_mon": [
"kraken",
"luminous",
"mimic",
"osdmap-prune",
"nautilus",
"octopus"
],
"quorum_con": "4540138292840890367",
"quorum_mon": [
"kraken",
"luminous",
"mimic",
"osdmap-prune",
"nautilus",
"octopus"
]
},
"outside_quorum": [],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": {
"epoch": 4,
"fsid": "27c7fb73-57f0-4d1d-8801-1db89fc9b7c8",
"modified": "2021-10-22T01:34:27.238492Z",
"created": "2020-10-20T16:18:19.674390Z",
"min_mon_release": 15,
"min_mon_release_name": "octopus",
"features": {
"persistent": [
"kraken",
"luminous",
"mimic",
"osdmap-prune",
"nautilus",
"octopus"
],
"optional": []
},
"mons": [
{
"rank": 0,
"name": "prox-ceph1",
"public_addrs": {
"addrvec": [
{
"type": "v2",
"addr": "192.168.235.11:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "192.168.235.11:6789",
"nonce": 0
}
]
},
"addr": "192.168.235.11:6789/0",
"public_addr": "192.168.235.11:6789/0",
"priority": 0,
"weight": 0
},
{
"rank": 1,
"name": "prox-ceph2",
"public_addrs": {
"addrvec": [
{
"type": "v2",
"addr": "192.168.235.12:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "192.168.235.12:6789",
"nonce": 0
}
]
},
"addr": "192.168.235.12:6789/0",
"public_addr": "192.168.235.12:6789/0",
"priority": 0,
"weight": 0
},
{
"rank": 2,
"name": "prox-ceph3",
"public_addrs": {
"addrvec": [
{
"type": "v2",
"addr": "192.168.235.13:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "192.168.235.13:6789",
"nonce": 0
}
]
},
"addr": "192.168.235.13:6789/0",
"public_addr": "192.168.235.13:6789/0",
"priority": 0,
"weight": 0
}
]
},
"feature_map": {
"mon": [
{
"features": "0x3f01cfb8ffedffff",
"release": "luminous",
"num": 1
}
],
"mds": [
{
"features": "0x3f01cfb8ffedffff",
"release": "luminous",
"num": 2
}
],
"osd": [
{
"features": "0x3f01cfb8ffedffff",
"release": "luminous",
"num": 8
}
],
"client": [
{
"features": "0x2f018fb87aa4aafe",
"release": "luminous",
"num": 1
},
{
"features": "0x3f01cfb8ffedffff",
"release": "luminous",
"num": 10
}
],
"mgr": [
{
"features": "0x3f01cfb8ffedffff",
"release": "luminous",
"num": 1
}
]
}
}
The IPs and ports listed above appear to be correct. I have tried for hours to get the prox-ceph3 monitor daemon to start and run. I have gotten it to start after a reboot of the node, however it is stuck in a 'probing' state. I read that the monmap may be corrupt on prox-ceph3 so I tried to inject the monmap from our prox-ceph1 node and it give me an error stating the dir doesn't exist but it does. I was trying to use this command to do this:
Code:
ceph-mon -i prox-ceph3 --inject-monmap /tmp/monmap
or
ceph-mon -i mon.prox-ceph3 --inject-monmap /tmp/monmap
The exported good monmap is from prox-ceph1 in /tmp/monmap:
Code:
root@prox-ceph1:~# ceph-mon -i prox-ceph3 --inject-monmap /tmp/monmap
2021-10-29T09:49:03.612-0400 7f4f96dcf580 -1 monitor data directory at '/var/lib/ceph/mon/ceph-prox-ceph3' does not exist: have you run 'mkfs'?
Our nodes are up, the OSDs are running, VMs are running and operational. I do have these warning in the Ceph Health Detail:
Code:
root@prox-ceph1:~# ceph health detail
HEALTH_WARN 1/3 mons down, quorum prox-ceph1,prox-ceph2; noout flag(s) set; 2 pools have too many placement groups; 4 slow ops, oldest one blocked for 975 sec, mon.prox-ceph3 has slow ops
[WRN] MON_DOWN: 1/3 mons down, quorum prox-ceph1,prox-ceph2
mon.prox-ceph3 (rank 2) addr [v2:192.168.235.13:3300/0,v1:192.168.235.13:6789/0] is down (out of quorum)
[WRN] OSDMAP_FLAGS: noout flag(s) set
[WRN] POOL_TOO_MANY_PGS: 2 pools have too many placement groups
Pool VMOS has 512 placement groups, should have 64
Pool cephfs_data has 128 placement groups, should have 32
[WRN] SLOW_OPS: 4 slow ops, oldest one blocked for 975 sec, mon.prox-ceph3 has slow ops
The too may PGs warning started when I upgraded from Nautilus to Octopus to take the PVE 7 update. I haven't gotten back to resolve that warning yet as the mon warning is more concerning right now.
Any help with this issue would be appreciated. I realize this is a wall of text, but I wanted to provide everything I've done up to this point. The only thing I haven't tried is rebooting all the nodes in an attempt to get the mon quorum to renegotiate. As it stands now, I can't get the prox-ceph3 mon daemon to start or join the quorum. I've exhausted my knowledge of PVE as I'm new to it. I've tried to do my due diligence, but I'm against a wall now.
Thank you.
Last edited: