One cluster node inop after update

Luftikus · Mar 19, 2024

Hello Comunity

I am having trouble getting my 3-node cluster up and running again after update.

I was away for 5 weeks and shut down most the IT equipment in my home lab, including the Proxmox cluster. After I returned, I started up everything and noticed that the Proxmox cluster is having issues, two nodes seemed operational and could be selected in the GUI while the third node just showed as disconnected. I shut the cluster down again and started only the node that had a problem. Everything seemed normal, I could ssh into the node and it showed up normal in the GUI. As soon as I started up the other two nodes, the first one went offline again.

After some checking I realised that the system clock was off by about 4 minutes on the problem node. That was solved by the system automatically as it synced the system time to match the other nodes (via NTP). However, the problem with that one node still existed. As my next step I decided to update all VMs, LXCs, nodes and everything else. This is automated with Ansible and worked flawlessly for over a year.

After the update the problem still existed, but seemed to be slightly different now. I noticed some errors in the syslog about ssh keys. After googling the problem, I followed some recommendations, which basically was to run pvecm updatecerts -F and restart the cluster. But that didn't help either. I cannot ssh into the problem node when the other two nodes are up and running. When only the problem node is online, I can ssh into the node. Similar with the GUI, when the whole cluster is running, the GUI shows the problem node intermittendly as disconnected.

I have read and followed so many online articles, but nothing seem to help. I hope to get some guidance from the community that would help me to reinstate the cluster. I don't really want to reinstall the problem node, but are prepared to do so if there is no other way to fix this issue. Below is an excert from the syslog from the problem node. It shows that the node connects to the cluster and after a moment disconnects again.

Code:

Mar 19 12:59:49 pve-1 pmxcfs[1067]: [dcdb] notice: members: 1/1067
Mar 19 12:59:49 pve-1 corosync[1186]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 19 12:59:49 pve-1 pmxcfs[1067]: [status] notice: node lost quorum
Mar 19 12:59:49 pve-1 pmxcfs[1067]: [status] notice: members: 1/1067
Mar 19 12:59:49 pve-1 pmxcfs[1067]: [dcdb] crit: received write while not quorate - trigger resync
Mar 19 12:59:49 pve-1 pmxcfs[1067]: [dcdb] crit: leaving CPG group
Mar 19 12:59:49 pve-1 pve-ha-lrm[1562]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve-1/lrm_status.tmp.1562' - Permission denied
Mar 19 12:59:50 pve-1 pmxcfs[1067]: [dcdb] notice: start cluster connection
Mar 19 12:59:50 pve-1 pmxcfs[1067]: [dcdb] crit: cpg_join failed: 14
Mar 19 12:59:50 pve-1 pmxcfs[1067]: [dcdb] crit: can't initialize service
Mar 19 12:59:54 pve-1 pvestatd[1255]: storage 'NAS-TS-453A' is not online
Mar 19 12:59:56 pve-1 pmxcfs[1067]: [dcdb] notice: members: 1/1067
Mar 19 12:59:56 pve-1 pmxcfs[1067]: [dcdb] notice: all data is up to date
Mar 19 12:59:59 pve-1 pvestatd[1255]: got timeout
Mar 19 13:00:04 pve-1 pvestatd[1255]: got timeout
Mar 19 13:00:04 pve-1 pvestatd[1255]: status update time (18.124 seconds)
Mar 19 13:00:08 pve-1 corosync[1186]:   [KNET  ] rx: host: 2 link: 0 is up
Mar 19 13:00:08 pve-1 corosync[1186]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Mar 19 13:00:08 pve-1 corosync[1186]:   [KNET  ] rx: host: 3 link: 0 is up
Mar 19 13:00:08 pve-1 corosync[1186]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
Mar 19 13:00:08 pve-1 corosync[1186]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Mar 19 13:00:08 pve-1 corosync[1186]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Mar 19 13:00:08 pve-1 corosync[1186]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Mar 19 13:00:08 pve-1 corosync[1186]:   [QUORUM] Sync members[3]: 1 2 3
Mar 19 13:00:08 pve-1 corosync[1186]:   [QUORUM] Sync joined[2]: 2 3
Mar 19 13:00:08 pve-1 corosync[1186]:   [TOTEM ] A new membership (1.62b0) was formed. Members joined: 2 3
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [dcdb] notice: members: 1/1067, 2/897, 3/893
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [dcdb] notice: starting data syncronisation
Mar 19 13:00:08 pve-1 corosync[1186]:   [QUORUM] This node is within the primary component and will provide service.
Mar 19 13:00:08 pve-1 corosync[1186]:   [QUORUM] Members[3]: 1 2 3
Mar 19 13:00:08 pve-1 corosync[1186]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [dcdb] notice: cpg_send_message retried 1 times
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [status] notice: node has quorum
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [status] notice: members: 1/1067, 2/897, 3/893
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [status] notice: starting data syncronisation
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [dcdb] notice: received sync request (epoch 1/1067/00001242)
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [status] notice: received sync request (epoch 1/1067/00000928)
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [dcdb] notice: received all states
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [dcdb] notice: leader is 2/897
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [dcdb] notice: synced members: 2/897, 3/893
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [dcdb] notice: waiting for updates from leader
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [status] notice: received all states
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [status] notice: all data is up to date
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [dcdb] notice: update complete - trying to commit (got 3 inode updates)
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [dcdb] notice: all data is up to date
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [status] notice: received log
Mar 19 13:00:09 pve-1 kernel: libceph: mon1 (1)192.168.1.21:6789 session lost, hunting for new mon
Mar 19 13:00:09 pve-1 kernel: libceph: mon0 (1)192.168.1.20:6789 session established
Mar 19 13:00:12 pve-1 kernel: libceph: osd2 (1)192.168.1.22:6801 socket closed (con state OPEN)
Mar 19 13:00:12 pve-1 kernel: rbd: rbd1: encountered watch error: -107
Mar 19 13:00:44 pve-1 corosync[1186]:   [KNET  ] link: host: 3 link: 0 is down
Mar 19 13:00:44 pve-1 corosync[1186]:   [KNET  ] link: host: 2 link: 0 is down
Mar 19 13:00:44 pve-1 corosync[1186]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Mar 19 13:00:44 pve-1 corosync[1186]:   [KNET  ] host: host: 3 has no active links
Mar 19 13:00:44 pve-1 corosync[1186]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Mar 19 13:00:44 pve-1 corosync[1186]:   [KNET  ] host: host: 2 has no active links
Mar 19 13:00:45 pve-1 corosync[1186]:   [TOTEM ] Token has not been received in 2737 ms
Mar 19 13:00:46 pve-1 corosync[1186]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
Mar 19 13:00:49 pve-1 pvestatd[1255]: storage 'NAS-TS-453A' is not online
Mar 19 13:00:50 pve-1 corosync[1186]:   [QUORUM] Sync members[1]: 1
Mar 19 13:00:50 pve-1 corosync[1186]:   [QUORUM] Sync left[2]: 2 3
Mar 19 13:00:50 pve-1 corosync[1186]:   [TOTEM ] A new membership (1.62b4) was formed. Members left: 2 3
Mar 19 13:00:50 pve-1 corosync[1186]:   [TOTEM ] Failed to receive the leave message. failed: 2 3
Mar 19 13:00:50 pve-1 corosync[1186]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Mar 19 13:00:50 pve-1 corosync[1186]:   [QUORUM] Members[1]: 1
Mar 19 13:00:50 pve-1 corosync[1186]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 19 13:00:50 pve-1 pmxcfs[1067]: [status] notice: node lost quorum
Mar 19 13:00:50 pve-1 pmxcfs[1067]: [dcdb] notice: members: 1/1067
Mar 19 13:00:50 pve-1 pmxcfs[1067]: [status] notice: members: 1/1067
Mar 19 13:00:50 pve-1 pmxcfs[1067]: [dcdb] crit: received write while not quorate - trigger resync
Mar 19 13:00:50 pve-1 pmxcfs[1067]: [dcdb] crit: leaving CPG group
Mar 19 13:00:50 pve-1 pve-ha-lrm[1562]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve-1/lrm_status.tmp.1562' - Permission denied
Mar 19 13:00:51 pve-1 pmxcfs[1067]: [dcdb] notice: start cluster connection
Mar 19 13:00:51 pve-1 pmxcfs[1067]: [dcdb] crit: cpg_join failed: 14
Mar 19 13:00:51 pve-1 pmxcfs[1067]: [dcdb] crit: can't initialize service
Mar 19 13:00:52 pve-1 pvestatd[1255]: status update time (8.144 seconds)
Mar 19 13:00:57 pve-1 pmxcfs[1067]: [dcdb] notice: members: 1/1067
Mar 19 13:00:57 pve-1 pmxcfs[1067]: [dcdb] notice: all data is up to date
Mar 19 13:00:59 pve-1 pvestatd[1255]: got timeout
Mar 19 13:01:04 pve-1 pvestatd[1255]: storage 'NAS-TS-453A' is not online
Mar 19 13:01:04 pve-1 kernel: libceph: mon0 (1)192.168.1.20:6789 socket closed (con state OPEN)
Mar 19 13:01:04 pve-1 kernel: libceph: mon0 (1)192.168.1.20:6789 session lost, hunting for new mon
Mar 19 13:01:04 pve-1 kernel: libceph: mon1 (1)192.168.1.21:6789 session established
Mar 19 13:01:05 pve-1 pvestatd[1255]: status update time (11.151 seconds)
Mar 19 13:01:07 pve-1 corosync[1186]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Mar 19 13:01:07 pve-1 corosync[1186]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Mar 19 13:01:07 pve-1 corosync[1186]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Mar 19 13:01:11 pve-1 pvescheduler[486298]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Mar 19 13:01:11 pve-1 pvescheduler[486297]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Mar 19 13:01:11 pve-1 corosync[1186]:   [QUORUM] Sync members[1]: 1
Mar 19 13:01:11 pve-1 corosync[1186]:   [TOTEM ] A new membership (1.62b8) was formed. Members
Mar 19 13:01:11 pve-1 corosync[1186]:   [QUORUM] Members[1]: 1
Mar 19 13:01:11 pve-1 corosync[1186]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 19 13:01:13 pve-1 pvestatd[1255]: got timeout
Mar 19 13:01:13 pve-1 pvestatd[1255]: status update time (8.460 seconds)
Mar 19 13:01:15 pve-1 corosync[1186]:   [QUORUM] Sync members[1]: 1
Mar 19 13:01:15 pve-1 corosync[1186]:   [TOTEM ] A new membership (1.62bc) was formed. Members
Mar 19 13:01:15 pve-1 corosync[1186]:   [QUORUM] Members[1]: 1
Mar 19 13:01:15 pve-1 corosync[1186]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 19 13:01:20 pve-1 corosync[1186]:   [QUORUM] Sync members[1]: 1
Mar 19 13:01:20 pve-1 corosync[1186]:   [TOTEM ] A new membership (1.62c0) was formed. Members
Mar 19 13:01:20 pve-1 corosync[1186]:   [QUORUM] Members[1]: 1
Mar 19 13:01:20 pve-1 corosync[1186]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 19 13:01:22 pve-1 corosync[1186]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
Mar 19 13:01:22 pve-1 corosync[1186]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Mar 19 13:01:22 pve-1 corosync[1186]:   [QUORUM] Sync members[3]: 1 2 3
Mar 19 13:01:22 pve-1 corosync[1186]:   [QUORUM] Sync joined[2]: 2 3
Mar 19 13:01:22 pve-1 corosync[1186]:   [TOTEM ] A new membership (1.62c4) was formed. Members joined: 2 3
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: members: 1/1067, 2/897, 3/893
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: starting data syncronisation
Mar 19 13:01:22 pve-1 corosync[1186]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Mar 19 13:01:22 pve-1 corosync[1186]:   [QUORUM] This node is within the primary component and will provide service.
Mar 19 13:01:22 pve-1 corosync[1186]:   [QUORUM] Members[3]: 1 2 3
Mar 19 13:01:22 pve-1 corosync[1186]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: cpg_send_message retried 1 times
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [status] notice: node has quorum
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [status] notice: members: 1/1067, 2/897, 3/893
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [status] notice: starting data syncronisation
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: received sync request (epoch 1/1067/00001246)
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [status] notice: received sync request (epoch 1/1067/0000092A)
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: received all states
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: leader is 2/897
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: synced members: 2/897, 3/893
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: waiting for updates from leader
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: dfsm_deliver_queue: queue length 2
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [status] notice: received all states
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [status] notice: all data is up to date
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [status] notice: dfsm_deliver_queue: queue length 29
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: update complete - trying to commit (got 3 inode updates)
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: all data is up to date
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 2

The status of the problem node, queried when it is online, doesn't show any issues to my understanding:

Code:

root@pve-1:~# pvecm status
Cluster information
-------------------
Name:             Cluster
Config Version:   5
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Mar 19 13:25:22 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.63b0
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.20 (local)
0x00000002          1 192.168.1.21
0x00000003          1 192.168.1.22

This is the status of node 3 when node 1 is offline:

Code:

root@pve-3:~# pvecm status
Cluster information
-------------------
Name:             Cluster
Config Version:   5
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Mar 19 13:29:52 2024
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000003
Ring ID:          2.63d8
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 192.168.1.21
0x00000003          1 192.168.1.22 (local)

I appreciate your help, thank you.

fabian · Mar 19, 2024

is ceph using the same NIC/link? it's possible that the cluster cold-start overwhelms it. but the log looks okay - unless the link continues to flap afterwards?

Luftikus · Mar 20, 2024

Thanks Fabian, appreciate your help.

You could be right, I haven't looked much into Ceph yet as I focussed on permission issues as I often see the "Connection error 595: Connection refused" message. And yes, there is only one NIC (1G) in use in each node. I know, not ideal, but it has worked so far ok.

I have now stopped all VMs and LXCs and have only the nodes running. None of the status information in the GUI shows a high load, not the CPU, Memory of Network. When I check the Ceph status, it shows HEALTH_WARN in yellow when one node is down. That I assume is normal. When the problem node is up for long enough (> about 1 min), the health status goes to green and all seems normal. But then after a random time (usually < 1 min), the problem node goes down again and the cycle repeats itself. I don't understand what is wrong and don't know what to do to fix it.

Here is a bit more information from the ceph log and some ceph status information:

Code:

1710901175.2776194 mgr.pve-3 (mgr.87114113) 80323 : cluster 0 pgmap v80116: 193 pgs: 193 active+undersized+degraded; 74 GiB data, 208 GiB used, 1.2 TiB / 1.4 TiB avail; 19932/59796 objects degraded (33.333%)
1710901177.2798426 mgr.pve-3 (mgr.87114113) 80324 : cluster 0 pgmap v80117: 193 pgs: 193 active+undersized+degraded; 74 GiB data, 208 GiB used, 1.2 TiB / 1.4 TiB avail; 19932/59796 objects degraded (33.333%)
1710901179.2820923 mgr.pve-3 (mgr.87114113) 80325 : cluster 0 pgmap v80118: 193 pgs: 193 active+undersized+degraded; 74 GiB data, 208 GiB used, 1.2 TiB / 1.4 TiB avail; 19932/59796 objects degraded (33.333%)
1710901181.2842724 mgr.pve-3 (mgr.87114113) 80326 : cluster 0 pgmap v80119: 193 pgs: 193 active+undersized+degraded; 74 GiB data, 208 GiB used, 1.2 TiB / 1.4 TiB avail; 19932/59796 objects degraded (33.333%)
1710901183.2865095 mgr.pve-3 (mgr.87114113) 80327 : cluster 0 pgmap v80120: 193 pgs: 193 active+undersized+degraded; 74 GiB data, 208 GiB used, 1.2 TiB / 1.4 TiB avail; 19932/59796 objects degraded (33.333%)
1710901185.2879767 mgr.pve-3 (mgr.87114113) 80328 : cluster 0 pgmap v80121: 193 pgs: 193 active+undersized+degraded; 74 GiB data, 208 GiB used, 1.2 TiB / 1.4 TiB avail; 19932/59796 objects degraded (33.333%)
1710901186.5745275 mon.pve-2 (mon.1) 402118 : cluster 0 mgrmap e2012: pve-3(active, since 44h), standbys: pve-2
1710901187.2903886 mgr.pve-3 (mgr.87114113) 80329 : cluster 0 pgmap v80122: 193 pgs: 193 active+undersized+degraded; 74 GiB data, 208 GiB used, 1.2 TiB / 1.4 TiB avail; 19932/59796 objects degraded (33.333%)
1710901189.2926931 mgr.pve-3 (mgr.87114113) 80330 : cluster 0 pgmap v80123: 193 pgs: 193 active+undersized+degraded; 74 GiB data, 208 GiB used, 1.2 TiB / 1.4 TiB avail; 19932/59796 objects degraded (33.333%)
1710901191.295514 mgr.pve-3 (mgr.87114113) 80331 : cluster 0 pgmap v80124: 193 pgs: 193 active+undersized+degraded; 74 GiB data, 208 GiB used, 1.2 TiB / 1.4 TiB avail; 19932/59796 objects degraded (33.333%)
1710901194.6782255 mon.pve-2 (mon.1) 402120 : cluster 1 mon.pve-2 calling monitor election
1710901194.7023394 mon.pve-1 (mon.0) 535749 : cluster 1 mon.pve-1 calling monitor election
1710901194.7348 mon.pve-1 (mon.0) 535750 : cluster 1 mon.pve-1 calling monitor election
1710901195.4186401 mon.pve-1 (mon.0) 535751 : cluster 1 mon.pve-1 is new leader, mons pve-1,pve-2,pve-3 in quorum (ranks 0,1,2)
1710901195.4532053 mon.pve-1 (mon.0) 535753 : cluster 0 monmap e3: 3 mons at {pve-1=[v2:192.168.1.20:3300/0,v1:192.168.1.20:6789/0],pve-2=[v2:192.168.1.21:3300/0,v1:192.168.1.21:6789/0],pve-3=[v2:192.168.1.22:3300/0,v1:192.168.1.22:6789/0]} removed_ranks: {}
1710901195.4539037 mon.pve-1 (mon.0) 535754 : cluster 0 fsmap
1710901195.453913 mon.pve-1 (mon.0) 535755 : cluster 0 osdmap e5034: 3 total, 2 up, 3 in
1710901195.454096 mon.pve-1 (mon.0) 535756 : cluster 0 mgrmap e2012: pve-3(active, since 44h), standbys: pve-2
1710901195.454191 mon.pve-1 (mon.0) 535757 : cluster 1 Health check cleared: MON_DOWN (was: 1/3 mons down, quorum pve-2,pve-3)
1710901195.4615695 mon.pve-1 (mon.0) 535758 : cluster 3 Health detail: HEALTH_WARN 1 osds down; 1 host (1 osds) down; Degraded data redundancy: 19932/59796 objects degraded (33.333%), 193 pgs degraded, 193 pgs undersized
1710901195.46158 mon.pve-1 (mon.0) 535759 : cluster 3 [WRN] OSD_DOWN: 1 osds down
1710901195.4615843 mon.pve-1 (mon.0) 535760 : cluster 3     osd.0 (root=default,host=pve-1) is down
1710901195.4615874 mon.pve-1 (mon.0) 535761 : cluster 3 [WRN] OSD_HOST_DOWN: 1 host (1 osds) down
1710901195.4615903 mon.pve-1 (mon.0) 535762 : cluster 3     host pve-1 (root=default) (1 osds) is down
1710901195.4615932 mon.pve-1 (mon.0) 535763 : cluster 3 [WRN] PG_DEGRADED: Degraded data redundancy: 19932/59796 objects degraded (33.333%), 193 pgs degraded, 193 pgs undersized
1710901195.4615965 mon.pve-1 (mon.0) 535764 : cluster 3     pg 4.27 is stuck undersized for 10m, current state active+undersized+degraded, last acting [1,2]
1710901195.461599 mon.pve-1 (mon.0) 535765 : cluster 3     pg 4.28 is stuck undersized for 10m, current state active+undersized+degraded, last acting [2,1]
1710901195.4616024 mon.pve-1 (mon.0) 535766 : cluster 3     pg 4.29 is stuck undersized for 10m, current state active+undersized+degraded, last acting [2,1]
1710901195.461607 mon.pve-1 (mon.0) 535767 : cluster 3     pg 4.2a is stuck undersized for 10m, current state active+undersized+degraded, last acting [1,2]
1710901195.4616098 mon.pve-1 (mon.0) 535768 : cluster 3     pg 4.2b is stuck undersized for 10m, current state active+undersized+degraded, last acting [2,1]

Ceph status with node 1 down:

Code:

root@pve-2:~# ceph -s
  cluster:
    id:     b397cfa1-6c5b-4d56-b574-3c49f7ffa12c
    health: HEALTH_WARN
            Degraded data redundancy: 19932/59796 objects degraded (33.333%), 193 pgs degraded, 193 pgs undersized
            10 pgs not deep-scrubbed in time
            10 pgs not scrubbed in time
            1 subtrees have overcommitted pool target_size_bytes
 
  services:
    mon: 3 daemons, quorum pve-1,pve-2,pve-3 (age 14s)
    mgr: pve-3(active, since 44h), standbys: pve-2, pve-1
    osd: 3 osds: 2 up (since 17m), 2 in (since 3m)
 
  data:
    pools:   4 pools, 193 pgs
    objects: 19.93k objects, 74 GiB
    usage:   139 GiB used, 793 GiB / 932 GiB avail
    pgs:     19932/59796 objects degraded (33.333%)
             193 active+undersized+degraded

OSD status when node 1 is up:

Code:

root@pve-2:~# ceph osd df tree
ID   CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME    
 -1         1.36467         -  1.4 TiB  208 GiB  207 GiB  1.5 MiB  1.7 GiB  1.2 TiB  14.91  1.00    -          root default
 -3         0.45509         -  466 GiB   69 GiB   69 GiB  506 KiB  577 MiB  397 GiB  14.90  1.00    -              host pve-1
  0   nvme  0.45509   1.00000  466 GiB   69 GiB   69 GiB  506 KiB  577 MiB  397 GiB  14.90  1.00  193      up          osd.0
 -7         0.45479         -  466 GiB   69 GiB   69 GiB  498 KiB  583 MiB  396 GiB  14.91  1.00    -              host pve-2
  1    ssd  0.45479   1.00000  466 GiB   69 GiB   69 GiB  498 KiB  583 MiB  396 GiB  14.91  1.00  193      up          osd.1
-10         0.45479         -  466 GiB   69 GiB   69 GiB  497 KiB  583 MiB  396 GiB  14.91  1.00    -              host pve-3
  2    ssd  0.45479   1.00000  466 GiB   69 GiB   69 GiB  497 KiB  583 MiB  396 GiB  14.91  1.00  193      up          osd.2
                        TOTAL  1.4 TiB  208 GiB  207 GiB  1.5 MiB  1.7 GiB  1.2 TiB  14.91                                  
MIN/MAX VAR: 1.00/1.00  STDDEV: 0.01

Balancer status:

Code:

root@pve-2:~# ceph balancer status
{
    "active": true,
    "last_optimize_duration": "0:00:00.001964",
    "last_optimize_started": "Wed Mar 20 12:04:51 2024",
    "mode": "upmap",
    "no_optimization_needed": true,
    "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect",
    "plans": []
}

Autoscale status:

Code:

root@pve-2:~# ceph osd pool autoscale-status
POOL        SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK  
.mgr      29956k                3.0         1397G  0.0001                                  1.0       1              on         False
win11     39442M       256.0G   3.0         1397G  0.5495                                  1.0      64              on         False
opnsense  16623M       65536M   3.0         1397G  0.1374                                  1.0      64              on         False
servers   14417M                3.0         1397G  0.0302                                  1.0      64              on         False

Maybe the PG number and target size is wrong? I havent really understood yet how to configure that properly. I have the current settings in use for about a year or so with no issues whatsoever.

Appreciate any further help.

fabian · Mar 20, 2024

irrespective of any ceph optimizations, if you share a single link for everything than cluster outages are expected, corosync really requires low latency on the link(s) it uses and other traffic can quickly push it over its limits..

Luftikus · Apr 25, 2024

I understand the issue with the shared network link. However, this 3-node cluster was in operation for over a year without any issues. There are no high loads, the cluster is mainly used for testing and the shared network link was never a problem. Therefore, I am convinced that something else causes the problem in this case.

I still have the same issue as described above. Is there anybody else out there with the same or similar problems? If so, any help would be much appreciated.

Search

Search

One cluster node inop after update

Luftikus

New Member

fabian

Proxmox Staff Member

Luftikus

New Member

fabian

Proxmox Staff Member

Luftikus

New Member