Hi,
It's not really a Proxmox question, but on the Ceph support available on proxmox.
I know that it's not the ideal, but's it's a question :
I have 3 servers :
- server A : Proxmox with Ceph : this server has OSD
- server B : Proxmox with Ceph : this server has OSD
- server C : Proxmox with Ceph : this server is only used for the quorum of the cluster (no OSD on it).
All the installation is OK.
The replication is : 2/1 so = osd_pool_default_min_size = 1 osd_pool_default_size = 2
The goal is replication of 2 (1 per server), but if 1 OSD server is down (ie B), I want to continue to use data with the A
I write data on the CephFS : it's OK !
But, if I power down the server B, I have to wait 1800 seconds (30 minutes), until my cluster be available.
On the ceph.log I have, after 1800 seconds :
2024-07-19T18:44:14.967069+0200 mon.nas1 (mon.0) 119376 : cluster 3 Health check update: 369 slow ops, oldest one blocked for 1803 sec, daemons [osd.21,osd.23,mon.hypernas1] have slow ops. (SLOW_OPS)
2024-07-19T18:44:14.967198+0200 mon.nas1 (mon.0) 119377 : cluster 1 osd.1 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967208+0200 mon.nas1 (mon.0) 119378 : cluster 1 osd.3 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967215+0200 mon.nas1 (mon.0) 119379 : cluster 1 osd.5 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967219+0200 mon.nas1 (mon.0) 119380 : cluster 1 osd.7 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967223+0200 mon.nas1 (mon.0) 119381 : cluster 1 osd.9 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967228+0200 mon.nas1 (mon.0) 119382 : cluster 1 osd.11 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967234+0200 mon.nas1 (mon.0) 119383 : cluster 1 osd.13 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967239+0200 mon.nas1 (mon.0) 119384 : cluster 1 osd.15 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967243+0200 mon.nas1 (mon.0) 119385 : cluster 1 osd.17 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967249+0200 mon.nas1 (mon.0) 119386 : cluster 1 osd.19 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967255+0200 mon.nas1 (mon.0) 119387 : cluster 1 osd.20 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967261+0200 mon.nas1 (mon.0) 119388 : cluster 1 osd.22 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.969288+0200 mon.nas1 (mon.0) 119389 : cluster 3 Health check failed: 12 osds down (OSD_DOWN)
2024-07-19T18:44:14.969298+0200 mon.nas1 (mon.0) 119390 : cluster 3 Health check failed: 1 host (12 osds) down (OSD_HOST_DOWN)
I have modify the OSD mon_osd_report_timeout from 900 seconds to 60 seconds, but its' not OK.
Do you know how to reduce this 1800 secondes time out before my cluster is available with only once node ?
Thanks very much ...
It's not really a Proxmox question, but on the Ceph support available on proxmox.
I know that it's not the ideal, but's it's a question :
I have 3 servers :
- server A : Proxmox with Ceph : this server has OSD
- server B : Proxmox with Ceph : this server has OSD
- server C : Proxmox with Ceph : this server is only used for the quorum of the cluster (no OSD on it).
All the installation is OK.
The replication is : 2/1 so = osd_pool_default_min_size = 1 osd_pool_default_size = 2
The goal is replication of 2 (1 per server), but if 1 OSD server is down (ie B), I want to continue to use data with the A
I write data on the CephFS : it's OK !
But, if I power down the server B, I have to wait 1800 seconds (30 minutes), until my cluster be available.
On the ceph.log I have, after 1800 seconds :
2024-07-19T18:44:14.967069+0200 mon.nas1 (mon.0) 119376 : cluster 3 Health check update: 369 slow ops, oldest one blocked for 1803 sec, daemons [osd.21,osd.23,mon.hypernas1] have slow ops. (SLOW_OPS)
2024-07-19T18:44:14.967198+0200 mon.nas1 (mon.0) 119377 : cluster 1 osd.1 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967208+0200 mon.nas1 (mon.0) 119378 : cluster 1 osd.3 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967215+0200 mon.nas1 (mon.0) 119379 : cluster 1 osd.5 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967219+0200 mon.nas1 (mon.0) 119380 : cluster 1 osd.7 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967223+0200 mon.nas1 (mon.0) 119381 : cluster 1 osd.9 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967228+0200 mon.nas1 (mon.0) 119382 : cluster 1 osd.11 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967234+0200 mon.nas1 (mon.0) 119383 : cluster 1 osd.13 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967239+0200 mon.nas1 (mon.0) 119384 : cluster 1 osd.15 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967243+0200 mon.nas1 (mon.0) 119385 : cluster 1 osd.17 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967249+0200 mon.nas1 (mon.0) 119386 : cluster 1 osd.19 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967255+0200 mon.nas1 (mon.0) 119387 : cluster 1 osd.20 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.967261+0200 mon.nas1 (mon.0) 119388 : cluster 1 osd.22 marked down after no beacon for 900.615915 seconds
2024-07-19T18:44:14.969288+0200 mon.nas1 (mon.0) 119389 : cluster 3 Health check failed: 12 osds down (OSD_DOWN)
2024-07-19T18:44:14.969298+0200 mon.nas1 (mon.0) 119390 : cluster 3 Health check failed: 1 host (12 osds) down (OSD_HOST_DOWN)
I have modify the OSD mon_osd_report_timeout from 900 seconds to 60 seconds, but its' not OK.
Do you know how to reduce this 1800 secondes time out before my cluster is available with only once node ?
Thanks very much ...