[SOLVED] Issue with Ceph 5/3 configuration and corruption

d2600hz · Apr 24, 2025

For context I have a nine node proxmox cluster. We've just added five new nodes which all have 4 x 3TB NVMe drives and I've set them up with a separate crush rule and added a device-class of nvme so they can be used separately from the original 4 nodes which are SSD are now on a new replicated rule (so `ceph osd pool autoscale-status` works).

But I'm having some issues. Here is the pool detail:

What I'm expecting is that I 'should' be able to drop two nodes at once and things should keep working. But it.... doesn't.

I've run up a simple VM to run fio on and if I drop more than one host, the disk gets corrupted and nothing works from that point, it's a full reinstall, disk is corrupted and won't boot. I can't get dmesg from the VM anymore (binary error etc... it can't run it). And I'm struggling to work out why this would be the case.

Am I doing something fundamentally wrong? I've spent probably two weeks doing some benchmarking and testing and this feels like a deal breaker to the N+2 redundancy I'm expecting from ceph.

Any pointers at what I could look at to see what's going on - I'm not seeing any reasons why this should happen with the 5/3 configuration. But it is.

fabian · Apr 24, 2025

please provide the crush map, storage config, VM config, ceph config and ceph status when you "drop" the two nodes..

you can dump the dmesg output by using a serial terminal and connecting to that from the host and piping the output to a file (or with ssh, if you prefer that)

d2600hz · Apr 24, 2025

Crush Map:

Code:

218
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class nvme
device 9 osd.9 class nvme
device 10 osd.10 class nvme
device 11 osd.11 class nvme
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class ssd
device 15 osd.15 class ssd
device 16 osd.16 class ssd
device 17 osd.17 class ssd
device 18 osd.18 class ssd
device 19 osd.19 class ssd
device 20 osd.20 class nvme
device 21 osd.21 class nvme
device 22 osd.22 class nvme
device 23 osd.23 class nvme
device 24 osd.24 class nvme
device 25 osd.25 class nvme
device 26 osd.26 class nvme
device 27 osd.27 class nvme
device 28 osd.28 class nvme
device 29 osd.29 class nvme
device 30 osd.30 class nvme
device 31 osd.31 class nvme
device 32 osd.32 class nvme
device 33 osd.33 class nvme
device 34 osd.34 class nvme
device 35 osd.35 class nvme

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host pve002-sjc {
        id -3           # do not change unnecessarily
        id -4 class ssd         # do not change unnecessarily
        id -13 class nvme               # do not change unnecessarily
        # weight 6.98639
        alg straw2
        hash 0  # rjenkins1
        item osd.0 weight 1.74660
        item osd.1 weight 1.74660
        item osd.2 weight 1.74660
        item osd.3 weight 1.74660
}
host pve003-sjc {
        id -5           # do not change unnecessarily
        id -6 class ssd         # do not change unnecessarily
        id -14 class nvme               # do not change unnecessarily
        # weight 6.98639
        alg straw2
        hash 0  # rjenkins1
        item osd.4 weight 1.74660
        item osd.5 weight 1.74660
        item osd.6 weight 1.74660
        item osd.7 weight 1.74660
}
host pve004-sjc {
        id -7           # do not change unnecessarily
        id -8 class ssd         # do not change unnecessarily
        id -15 class nvme               # do not change unnecessarily
        # weight 6.98639
        alg straw2
        hash 0  # rjenkins1
        item osd.16 weight 1.74660
        item osd.17 weight 1.74660
        item osd.18 weight 1.74660
        item osd.19 weight 1.74660
}
host pve001-sjc {
        id -9           # do not change unnecessarily
        id -10 class ssd                # do not change unnecessarily
        id -16 class nvme               # do not change unnecessarily
        # weight 6.98639
        alg straw2
        hash 0  # rjenkins1
        item osd.13 weight 1.74660
        item osd.14 weight 1.74660
        item osd.15 weight 1.74660
        item osd.12 weight 1.74660
}
host pve011-sjc {
        id -11          # do not change unnecessarily
        id -12 class ssd                # do not change unnecessarily
        id -17 class nvme               # do not change unnecessarily
        # weight 11.64366
        alg straw2
        hash 0  # rjenkins1
        item osd.9 weight 2.91089
        item osd.10 weight 2.91089
        item osd.11 weight 2.91089
        item osd.8 weight 2.91100
}
host pve012-sjc {
        id -19          # do not change unnecessarily
        id -20 class ssd                # do not change unnecessarily
        id -21 class nvme               # do not change unnecessarily
        # weight 11.64355
        alg straw2
        hash 0  # rjenkins1
        item osd.20 weight 2.91089
        item osd.21 weight 2.91089
        item osd.22 weight 2.91089
        item osd.23 weight 2.91089
}
host pve013-sjc {
        id -22          # do not change unnecessarily
        id -23 class ssd                # do not change unnecessarily
        id -24 class nvme               # do not change unnecessarily
        # weight 11.64355
        alg straw2
        hash 0  # rjenkins1
        item osd.24 weight 2.91089
        item osd.26 weight 2.91089
        item osd.25 weight 2.91089
        item osd.27 weight 2.91089
}
host pve014-sjc {
        id -25          # do not change unnecessarily
        id -26 class ssd                # do not change unnecessarily
        id -27 class nvme               # do not change unnecessarily
        # weight 11.64355
        alg straw2
        hash 0  # rjenkins1
        item osd.28 weight 2.91089
        item osd.29 weight 2.91089
        item osd.30 weight 2.91089
        item osd.31 weight 2.91089
}
host pve015-sjc {
        id -28          # do not change unnecessarily
        id -29 class ssd                # do not change unnecessarily
        id -30 class nvme               # do not change unnecessarily
        # weight 11.64355
        alg straw2
        hash 0  # rjenkins1
        item osd.32 weight 2.91089
        item osd.33 weight 2.91089
        item osd.34 weight 2.91089
        item osd.35 weight 2.91089
}
root default {
        id -1           # do not change unnecessarily
        id -2 class ssd         # do not change unnecessarily
        id -18 class nvme               # do not change unnecessarily
        # weight 86.16344
        alg straw2
        hash 0  # rjenkins1
        item pve002-sjc weight 6.98639
        item pve003-sjc weight 6.98639
        item pve004-sjc weight 6.98639
        item pve001-sjc weight 6.98639
        item pve011-sjc weight 11.64366
        item pve012-sjc weight 11.64355
        item pve013-sjc weight 11.64355
        item pve014-sjc weight 11.64355
        item pve015-sjc weight 11.64355
}

# rules
rule replicated_rule {
        id 0
        type replicated
        step take default
        step chooseleaf firstn 0 type host
        step emit
}
rule replicated-nvme {
        id 1
        type replicated
        step take default class nvme
        step chooseleaf firstn 0 type host
        step emit
}
rule replicated-ssd {
        id 2
        type replicated
        step take default class ssd
        step chooseleaf firstn 0 type host
        step emit
}

storage.cfg

Code:

dir: local
        path /var/lib/vz
        content iso,vztmpl,backup

lvmthin: xlocal-lvm
        thinpool data
        vgname pve
        content images,rootdir

rbd: sjc-pool
        content rootdir,images
        krbd 0
        pool sjc-pool

rbd: sjc-pool-nvme
        content rootdir,images
        krbd 0
        pool sjc-pool-nvme

Ceph Config

Code:

[global]
        auth_client_required = cephx
        auth_cluster_required = cephx
        auth_service_required = cephx
        cluster_network = 10.30.250.2/24
        fsid = 46c43c19-dc66-4b87-a50f-be598b4ebef6
        mon_allow_pool_delete = true
        mon_host = 10.30.250.13 10.30.250.12 10.30.250.11 10.30.250.4 10.30.250.3 10.30.250.2 10.30.250.1
        ms_bind_ipv4 = true
        ms_bind_ipv6 = false
        osd_pool_default_min_size = 2
        osd_pool_default_size = 3
        public_network = 10.30.250.2/24

[client]
        keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
        keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
        keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mon.pve001-sjc]
        public_addr = 10.30.250.1

[mon.pve002-sjc]
        public_addr = 10.30.250.2

[mon.pve003-sjc]
        public_addr = 10.30.250.3

[mon.pve004-sjc]
        public_addr = 10.30.250.4

[mon.pve011-sjc]
        public_addr = 10.30.250.11

[mon.pve012-sjc]
        public_addr = 10.30.250.12

[mon.pve013-sjc]
        public_addr = 10.30.250.13

VM Config:

Code:

agent: 1
boot: order=scsi0;ide2;net0
cores: 4
cpu: x86-64-v2-AES
memory: 2048
meta: creation-qemu=9.2.0,ctime=1745483451
name: test-migration-sjc
net0: virtio=BC:24:11:BD:72:BA,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: sjc-pool-nvme:vm-105-disk-0,cache=writeback,discard=on,iothread=1,size=32G
scsihw: virtio-scsi-single
smbios1: uuid=5411e440-a6a9-484e-91c4-52f16bb987f4
sockets: 1
vmgenid: b76af415-4340-4f92-987e-ff8e1f6dbccd

Ceph Status:

Code:

  cluster:
    id:     46c43c19-dc66-4b87-a50f-be598b4ebef6
    health: HEALTH_WARN
            1/7 mons down, quorum pve012-sjc,pve011-sjc,pve004-sjc,pve003-sjc,pve002-sjc,pve001-sjc
            8 osds down
            2 hosts (8 osds) down
            Degraded data redundancy: 1236130/5584795 objects degraded (22.134%), 512 pgs degraded, 512 pgs undersized
 
  services:
    mon: 7 daemons, quorum pve012-sjc,pve011-sjc,pve004-sjc,pve003-sjc,pve002-sjc,pve001-sjc (age 2m), out of quorum: pve013-sjc
    mgr: pve003-sjc(active, since 7d), standbys: pve002-sjc, pve001-sjc, pve004-sjc, pve012-sjc, pve011-sjc
    osd: 36 osds: 28 up (since 2m), 36 in (since 9h)
 
  data:
    pools:   3 pools, 641 pgs
    objects: 1.45M objects, 3.3 TiB
    usage:   10 TiB used, 76 TiB / 86 TiB avail
    pgs:     1236130/5584795 objects degraded (22.134%)
             512 active+undersized+degraded
             128 active+clean
             1   active+clean+scrubbing+deep
 
  io:
    client:   7.7 KiB/s rd, 4.4 MiB/s wr, 2 op/s rd, 217 op/s wr

Just working on getting a VM status now.

d2600hz · Apr 24, 2025

spirit said:
without talking about osd, you need 5 monitors if you want to be able to loose 2 nodes

Yeah I have 7 run up on this cluster.

spirit · Apr 24, 2025

d2600hz said:
Yeah I have 7 run up on this cluster.

oh yes, sorry . so with 5/7 mon online, it's ok.

spirit · Apr 24, 2025

and you cluster seem fine, no pg down, just a simple warning.

spirit · Apr 24, 2025

disk corruption in the vm is really strange. in the worse case, the ceph should be readonly some some blocks of the vm disk.

fabian · Apr 24, 2025

hichy7 monitors is probably a bit overkill, upstream recommends 3 for smaller and 5 for bigger clusters (or if you want to tolerate two of them going down, like in your case).

the logs from the VM and the node where it is running covering the time period where you did your test would be interesting, as well as the information which two nodes you took down.

jsterr · Apr 24, 2025

Just because Im wondering: why not 4/2?

d2600hz · Apr 24, 2025

jsterr said:
Just because Im wondering: why not 4/2?

We only have 5 nodes in this nvme crush map. I think if I used 4/2 that there'd be a chance that it'd fail? Am I incorrect in that?

spirit · Apr 24, 2025

d2600hz said:
We only have 5 nodes in this nvme crush map. I think if I used 4/2 that there'd be a chance that it'd fail? Am I incorrect in that?

3/1 should be enough if you want to loose 2 nodes. (but risky if you loose a third node at the same time). 4/2 is safe

d2600hz · Apr 24, 2025

spirit said:
3/1 should be enough if you want to loose 2 nodes. (but risky if you loose a third node at the same time). 4/2 is safe

There's so much confusing stuff on the internet about what is best. The hardware is all enterprise grade, but the aim would be to handle an N+2 type scenario, so we've lost two nodes.

Back to more testing scenarios me thinks!

jsterr · Apr 24, 2025

Min-Size 1 is not recommended imo, I personally would go for a 4/2 (you can loose 2 servers at the same time).
@d2600hz please also post the output of ceph pg ls

d2600hz · Apr 24, 2025

jsterr said:
Min-Size 1 is not recommended imo, I personally would go for a 4/2 (you can loose 2 servers at the same time).
@d2600hz please also post the output of ceph pg ls

With the two hosts down?

jsterr · Apr 24, 2025

d2600hz said:
With the two hosts down?

No. Ceph pg ls shows all pgs and tells you, if data is correctly put on the right hosts/osds. so the output would be nice with all hosts beeing online.

d2600hz · Apr 24, 2025

Too large to post here's a link:

https://gist.github.com/doogienz/f4a83d71c183e981928a8f8ef39132b0

jsterr · Apr 24, 2025

d2600hz said:
Too large to post here's a link:

https://gist.github.com/doogienz/f4a83d71c183e981928a8f8ef39132b0

Thanks and also: ceph osd pool ls detail

d2600hz · Apr 24, 2025

jsterr said:
Thanks and also: ceph osd pool ls detail

Code:

pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 10034 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 15.79
pool 2 'sjc-pool' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 10858 lfor 0/505/5974 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 1.38
pool 8 'sjc-pool-nvme' replicated size 4 min_size 2 crush_rule 1 object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode on last_change 34208 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 1 application rbd read_balance_score 1.45

I just changed the pool to 4/2

jsterr · Apr 24, 2025

I cant see any problem, you should be able to loose 2 nodes without interrupting vms that are NOT running on those 2 now offline nodes.

Code:

8.1d1     1204         0          0        0    302431820            0           0   2629      3000  active+clean    11m      34115'48729      34298:60681  [26,35,30,22]p26  [26,35,30,22]p26  2025-04-23T23:31:40.309863+0000  2025-04-23T23:31:40.309863+0000                    0  periodic scrub scheduled @ 2025-04-25T00:00:41.351980+0000     
8.1d2     1278         0          0        0    306924840            0           0   2313      3000  active+clean    11m      34115'55713      34298:69472  [22,10,33,31]p22  [22,10,33,31]p22  2025-04-23T23:31:40.309863+0000  2025-04-23T23:31:40.309863+0000                    0  periodic scrub scheduled @ 2025-04-25T03:01:31.610453+0000     
8.1d3     1207         0          0        0    302444096            0           0   2466      3000  active+clean    11m      34102'62166      34298:77081   [34,30,9,27]p34   [34,30,9,27]p34  2025-04-23T23:31:40.309863+0000  2025-04-23T23:31:40.309863+0000                    0  periodic scrub scheduled @ 2025-04-25T05:19:21.663206+0000

Just for explanation. The ID 8.1d1 for example, first number = pool-id. the numbers in the [ ] are the osds where the data was put. Its matching correctly, as your osds with nvmes are 8-11 and 20-35.

Can you try again and exactly tell, what nodes you shut down, where the vm (that gets problems) was placed (on which node)?

d2600hz · Apr 30, 2025

Okay update for this - as everyone pretty much confirmed that the configuration and setup should have been fine, I started absolutely hammering any and all hardware in this setup.

For reference they were new boxes, and unfortunately because of corporate overloads they'd forced the company HPE solution on us (we'd been using Dell's up to now). And then we had 5 boxes delivered from the same order to another DC and would you believe it, 3 were DOA.

So I'm now not trusting the hardware, so I tested everything network related. And low and behold one of the nics serving the storage network had a faulty port. As it has in an LACP bond it was hard to find, but replacing the nic has resolved the issue.

Thanks to everyone for your time, in this case, hardware fault.

[SOLVED] Issue with Ceph 5/3 configuration and corruption

New Member

Proxmox Staff Member

New Member

New Member

Distinguished Member

Distinguished Member

Distinguished Member

Proxmox Staff Member

Famous Member

New Member

Distinguished Member

New Member

Famous Member

New Member

Famous Member

New Member

Famous Member

New Member

Famous Member

New Member

We value your privacy