Ceph on HPE DL380 Gen10+ not working

fjmo2008 · Jul 4, 2025

I have a Proxmox 8.4 cluster with two nodes and one qdevice, with Ceph Squid 19.2.1 recently installed and an additional device to maintain quorum for Ceph. Each node has one SATA SSD, so I have two OSDs (osd.18 and osd.19) created, and I have a pool called poolssd with both. Since ceph has been installed and configured, I get this message and it won't let me create any virtual machine in said pool:

HEALTH_WARN: Reduced data availability: 33 pgs inactive, 33 pgs peering
pg 1.0 is stuck peering since forever, current state peering, last acting [19,18]
pg 4.0 is stuck peering for 26h, current state peering, last acting [18,19]
pg 4.1 is stuck peering for 26h, current state creating+peering, last acting [19,18]
pg 4.2 is stuck peering for 26h, current state peering, last acting [18,19]
pg 4.3 is stuck peering for 26h, current state peering, last acting [18,19]
pg 4.4 is stuck peering for 26h, current state peering, last acting [18,19]
pg 4.5 is stuck peering for 26h, current state creating+peering, last acting [19,18]
pg 4.6 is stuck peering for 26h, current state peering, last acting [18,19]
pg 4.7 is stuck peering for 26h, current state peering, last acting [18,19]
pg 4.8 is stuck peering for 26h, current state creating+peering, last acting [19,18]
pg 4.9 is stuck peering for 26h, current state creating+peering, last acting [19,18]
pg 4.a is stuck peering for 26h, current state creating+peering, last acting [19,18]
pg 4.b is stuck peering for 26h, current state peering, last acting [18,19]
pg 4.c is stuck peering for 26h, current state creating+peering, last acting [19,18]
pg 4.d is stuck peering for 26h, current state creating+peering, last acting [19,18]
pg 4.e is stuck peering for 26h, current state creating+peering, last acting [19,18]
pg 4.f is stuck peering for 26h, current state peering, last acting [18,19]
pg 4.10 is stuck peering for 26h, current state peering, last acting [18,19]
pg 4.11 is stuck peering for 26h, current state peering, last acting [18,19]
pg 4.12 is stuck peering for 26h, current state peering, last acting [18,19]
pg 4.13 is stuck peering for 26h, current state creating+peering, last acting [19,18]
pg 4.14 is stuck peering for 26h, current state peering, last acting [18,19]
pg 4.15 is stuck peering for 26h, current state creating+peering, last acting [19,18]
pg 4.16 is stuck peering for 26h, current state peering, last acting [18,19]
pg 4.17 is stuck peering for 26h, current state peering, last acting [18,19]
pg 4.18 is stuck peering for 26h, current state creating+peering, last acting [19,18]
pg 4.19 is stuck peering for 26h, current state peering, last acting [18,19]
pg 4.1a is stuck peering for 26h, current state creating+peering, last acting [19,18]
pg 4.1b is stuck peering for 26h, current state creating+peering, last acting [19,18]
pg 4.1c is stuck peering for 26h, current state peering, last acting [18,19]
pg 4.1d is stuck peering for 26h, current state peering, last acting [18,19]
pg 4.1e is stuck peering for 26h, current state peering, last acting [18,19]
pg 4.1f is stuck peering for 26h, current state creating+peering, last acting [19,18]

I have 3 monitors configured, the 2 corresponding to the 2 proxmox nodes (mon.pve1 and mon.pve2) and the quorum monitor. And I also get the following messages:

HEALTH_WARN: 2 daemons have recently crashed
mon.pve1 crashed on host pve1 at 2025-07-03T05:24:48.235164Z
mon.pve1 crashed on host pve1 at 2025-07-03T05:45:50.830345Z

HEALTH_WARN: 14 slow ops, oldest one blocked for 8610 sec, daemons [osd.18,osd.19,mon.pve1] have slow ops.

I have a dedicated network for the private Ceph network and another for the public network, as can be seen in the Ceph.conf configuration file, which is as follows:

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 192.168.70.0/24
fsid = eb409a91-affd-487a-a02c-4df2e46e0a2e
mon_allow_pool_delete = true
mon_initial_members = pve1-pub pve2-pub ceph-mon3-pub
mon_host = 192.168.60.11 192.168.60.12 192.168.60.130
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 1
osd_pool_default_size = 2
public_network = 192.168.60.0/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
keyring = /etc/pve/ceph/$cluster.$name.keyring

[mon.pve1]
host = pve1
public_addr = 192.168.60.11
cluster_addr = 192.168.70.11

[mon.pve2]
host = pve2
public_addr = 192.168.60.12
cluster_addr = 192.168.70.12

[mon.ceph-mon3]
host = ceph-mon3
public_addr = 192.168.60.130
cluster_addr = 192.168.70.130

Both Proxmox nodes have subscriptions to the Proxmox Enterprise repository, so they are up to date with the stable repository.

I had previously performed this same configuration in a test environment using virtual machines for the nodes, and everything worked correctly in that environment. I replicated the test environment to the HPE physical servers to set up the production environment, but I can't get it to work.

Can anyone give me a clue?

Thank you very much.

aaron · Jul 4, 2025

fjmo2008 said:
I have a Proxmox 8.4 cluster with two nodes and one qdevice, with Ceph Squid 19.2.1

The minimum number of nodes needed for a stable Ceph cluster is 3!

How did you configure your pools regarding size/min-size?

What is the output of the following commands?

Code:

ceph osd df tree
pveceph status

Please use code blocks for the output to be easily readable. Either by using the </> button of the editor, or placing [code][/code] around it yourself.

fjmo2008 · Jul 4, 2025

Hi aaron, thanks for your reply

Code:

root@pve1:~# ceph osd df tree
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP    META    AVAIL    %USE  VAR   PGS  STATUS  TYPE NAME
-1         0.43658         -  447 GiB   69 MiB   12 MiB  31 KiB  56 MiB  447 GiB  0.01  1.00    -          root default
-3         0.21829         -  224 GiB   34 MiB  6.0 MiB  18 KiB  28 MiB  224 GiB  0.01  1.00    -              host pve1
18    ssd  0.21829   1.00000  224 GiB   34 MiB  6.0 MiB  18 KiB  28 MiB  224 GiB  0.01  1.00   33      up          osd.18
-5         0.21829         -  224 GiB   34 MiB  6.0 MiB  13 KiB  28 MiB  224 GiB  0.01  1.00    -              host pve2
19    ssd  0.21829   1.00000  224 GiB   34 MiB  6.0 MiB  13 KiB  28 MiB  224 GiB  0.01  1.00   33      up          osd.19
                       TOTAL  447 GiB   69 MiB   12 MiB  33 KiB  56 MiB  447 GiB  0.01
MIN/MAX VAR: 1.00/1.00  STDDEV: 0

Code:

root@pve1:~# pveceph status
  cluster:
    id:     eb409a91-affd-487a-a02c-4df2e46e0a2e
    health: HEALTH_WARN
            Reduced data availability: 33 pgs inactive, 33 pgs peering
            2 daemons have recently crashed
            10 slow ops, oldest one blocked for 3988 sec, daemons [osd.18,osd.19,mon.pve1] have slow ops.

  services:
    mon: 3 daemons, quorum pve1,pve2,ceph-mon3 (age 33m)
    mgr: pve1(active, since 66m), standbys: pve2
    osd: 2 osds: 2 up (since 33m), 2 in (since 33m)

  data:
    pools:   2 pools, 33 pgs
    objects: 0 objects, 0 B
    usage:   69 MiB used, 447 GiB / 447 GiB avail
    pgs:     100.000% pgs not active
             19 peering
             14 creating+peering

As you can see, there's a warning that 2 daemons have recently crashed.

Thanks

aaron · Jul 4, 2025

fjmo2008 said:
As you can see, there's a warning that 2 daemons have recently crashed.

yeah, the MON on PVE1, according to the info in your first post. There should still be 2 working, so not too much of a worry there.

What I find interesting is that you do have 33 PGs present, but none are active.

What is the size/min-size of the pool? (we can ignore the .mgr pool for now). And unless you plan to add a 3rd node very soon, don't bother with Ceph! Rather go with local ZFS + guest replication if you don't want an external storage.

gurubert · Jul 4, 2025

How should the second node decide that the first one is really down with only two replicas of the data?
This setup will never work.

fjmo2008 · Jul 9, 2025

Hello, the problem is solved by configuring Ceph with the cluster network on the same public network. It's not the best option, but this way everything is correct and the cluster is in HEALTH_OK status.

The public network is connected to 10Gb, and the cluster only has three virtual machines running. I'm going to monitor performance using the same public network for OSD communication.

Thanks everyone.

Search

Search

Ceph on HPE DL380 Gen10+ not working

fjmo2008

New Member

aaron

Proxmox Staff Member

fjmo2008

New Member

aaron

Proxmox Staff Member

gurubert

Distinguished Member

fjmo2008

New Member

We value your privacy