[SOLVED] Issue with Ceph 5/3 configuration and corruption

d2600hz · Thursday at 03:16

For context I have a nine node proxmox cluster. We've just added five new nodes which all have 4 x 3TB NVMe drives and I've set them up with a separate crush rule and added a device-class of nvme so they can be used separately from the original 4 nodes which are SSD are now on a new replicated rule (so `ceph osd pool autoscale-status` works).

But I'm having some issues. Here is the pool detail:

What I'm expecting is that I 'should' be able to drop two nodes at once and things should keep working. But it.... doesn't.

I've run up a simple VM to run fio on and if I drop more than one host, the disk gets corrupted and nothing works from that point, it's a full reinstall, disk is corrupted and won't boot. I can't get dmesg from the VM anymore (binary error etc... it can't run it). And I'm struggling to work out why this would be the case.

Am I doing something fundamentally wrong? I've spent probably two weeks doing some benchmarking and testing and this feels like a deal breaker to the N+2 redundancy I'm expecting from ceph.

Any pointers at what I could look at to see what's going on - I'm not seeing any reasons why this should happen with the 5/3 configuration. But it is.

fabian · Thursday at 10:16

please provide the crush map, storage config, VM config, ceph config and ceph status when you "drop" the two nodes..

you can dump the dmesg output by using a serial terminal and connecting to that from the host and piping the output to a file (or with ssh, if you prefer that)

d2600hz · Thursday at 10:54

Crush Map:

Code:

218
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class nvme
device 9 osd.9 class nvme
device 10 osd.10 class nvme
device 11 osd.11 class nvme
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class ssd
device 15 osd.15 class ssd
device 16 osd.16 class ssd
device 17 osd.17 class ssd
device 18 osd.18 class ssd
device 19 osd.19 class ssd
device 20 osd.20 class nvme
device 21 osd.21 class nvme
device 22 osd.22 class nvme
device 23 osd.23 class nvme
device 24 osd.24 class nvme
device 25 osd.25 class nvme
device 26 osd.26 class nvme
device 27 osd.27 class nvme
device 28 osd.28 class nvme
device 29 osd.29 class nvme
device 30 osd.30 class nvme
device 31 osd.31 class nvme
device 32 osd.32 class nvme
device 33 osd.33 class nvme
device 34 osd.34 class nvme
device 35 osd.35 class nvme

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host pve002-sjc {
        id -3           # do not change unnecessarily
        id -4 class ssd         # do not change unnecessarily
        id -13 class nvme               # do not change unnecessarily
        # weight 6.98639
        alg straw2
        hash 0  # rjenkins1
        item osd.0 weight 1.74660
        item osd.1 weight 1.74660
        item osd.2 weight 1.74660
        item osd.3 weight 1.74660
}
host pve003-sjc {
        id -5           # do not change unnecessarily
        id -6 class ssd         # do not change unnecessarily
        id -14 class nvme               # do not change unnecessarily
        # weight 6.98639
        alg straw2
        hash 0  # rjenkins1
        item osd.4 weight 1.74660
        item osd.5 weight 1.74660
        item osd.6 weight 1.74660
        item osd.7 weight 1.74660
}
host pve004-sjc {
        id -7           # do not change unnecessarily
        id -8 class ssd         # do not change unnecessarily
        id -15 class nvme               # do not change unnecessarily
        # weight 6.98639
        alg straw2
        hash 0  # rjenkins1
        item osd.16 weight 1.74660
        item osd.17 weight 1.74660
        item osd.18 weight 1.74660
        item osd.19 weight 1.74660
}
host pve001-sjc {
        id -9           # do not change unnecessarily
        id -10 class ssd                # do not change unnecessarily
        id -16 class nvme               # do not change unnecessarily
        # weight 6.98639
        alg straw2
        hash 0  # rjenkins1
        item osd.13 weight 1.74660
        item osd.14 weight 1.74660
        item osd.15 weight 1.74660
        item osd.12 weight 1.74660
}
host pve011-sjc {
        id -11          # do not change unnecessarily
        id -12 class ssd                # do not change unnecessarily
        id -17 class nvme               # do not change unnecessarily
        # weight 11.64366
        alg straw2
        hash 0  # rjenkins1
        item osd.9 weight 2.91089
        item osd.10 weight 2.91089
        item osd.11 weight 2.91089
        item osd.8 weight 2.91100
}
host pve012-sjc {
        id -19          # do not change unnecessarily
        id -20 class ssd                # do not change unnecessarily
        id -21 class nvme               # do not change unnecessarily
        # weight 11.64355
        alg straw2
        hash 0  # rjenkins1
        item osd.20 weight 2.91089
        item osd.21 weight 2.91089
        item osd.22 weight 2.91089
        item osd.23 weight 2.91089
}
host pve013-sjc {
        id -22          # do not change unnecessarily
        id -23 class ssd                # do not change unnecessarily
        id -24 class nvme               # do not change unnecessarily
        # weight 11.64355
        alg straw2
        hash 0  # rjenkins1
        item osd.24 weight 2.91089
        item osd.26 weight 2.91089
        item osd.25 weight 2.91089
        item osd.27 weight 2.91089
}
host pve014-sjc {
        id -25          # do not change unnecessarily
        id -26 class ssd                # do not change unnecessarily
        id -27 class nvme               # do not change unnecessarily
        # weight 11.64355
        alg straw2
        hash 0  # rjenkins1
        item osd.28 weight 2.91089
        item osd.29 weight 2.91089
        item osd.30 weight 2.91089
        item osd.31 weight 2.91089
}
host pve015-sjc {
        id -28          # do not change unnecessarily
        id -29 class ssd                # do not change unnecessarily
        id -30 class nvme               # do not change unnecessarily
        # weight 11.64355
        alg straw2
        hash 0  # rjenkins1
        item osd.32 weight 2.91089
        item osd.33 weight 2.91089
        item osd.34 weight 2.91089
        item osd.35 weight 2.91089
}
root default {
        id -1           # do not change unnecessarily
        id -2 class ssd         # do not change unnecessarily
        id -18 class nvme               # do not change unnecessarily
        # weight 86.16344
        alg straw2
        hash 0  # rjenkins1
        item pve002-sjc weight 6.98639
        item pve003-sjc weight 6.98639
        item pve004-sjc weight 6.98639
        item pve001-sjc weight 6.98639
        item pve011-sjc weight 11.64366
        item pve012-sjc weight 11.64355
        item pve013-sjc weight 11.64355
        item pve014-sjc weight 11.64355
        item pve015-sjc weight 11.64355
}

# rules
rule replicated_rule {
        id 0
        type replicated
        step take default
        step chooseleaf firstn 0 type host
        step emit
}
rule replicated-nvme {
        id 1
        type replicated
        step take default class nvme
        step chooseleaf firstn 0 type host
        step emit
}
rule replicated-ssd {
        id 2
        type replicated
        step take default class ssd
        step chooseleaf firstn 0 type host
        step emit
}

storage.cfg

Code:

dir: local
        path /var/lib/vz
        content iso,vztmpl,backup

lvmthin: xlocal-lvm
        thinpool data
        vgname pve
        content images,rootdir

rbd: sjc-pool
        content rootdir,images
        krbd 0
        pool sjc-pool

rbd: sjc-pool-nvme
        content rootdir,images
        krbd 0
        pool sjc-pool-nvme

Ceph Config

Code:

[global]
        auth_client_required = cephx
        auth_cluster_required = cephx
        auth_service_required = cephx
        cluster_network = 10.30.250.2/24
        fsid = 46c43c19-dc66-4b87-a50f-be598b4ebef6
        mon_allow_pool_delete = true
        mon_host = 10.30.250.13 10.30.250.12 10.30.250.11 10.30.250.4 10.30.250.3 10.30.250.2 10.30.250.1
        ms_bind_ipv4 = true
        ms_bind_ipv6 = false
        osd_pool_default_min_size = 2
        osd_pool_default_size = 3
        public_network = 10.30.250.2/24

[client]
        keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
        keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
        keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mon.pve001-sjc]
        public_addr = 10.30.250.1

[mon.pve002-sjc]
        public_addr = 10.30.250.2

[mon.pve003-sjc]
        public_addr = 10.30.250.3

[mon.pve004-sjc]
        public_addr = 10.30.250.4

[mon.pve011-sjc]
        public_addr = 10.30.250.11

[mon.pve012-sjc]
        public_addr = 10.30.250.12

[mon.pve013-sjc]
        public_addr = 10.30.250.13

VM Config:

Code:

agent: 1
boot: order=scsi0;ide2;net0
cores: 4
cpu: x86-64-v2-AES
memory: 2048
meta: creation-qemu=9.2.0,ctime=1745483451
name: test-migration-sjc
net0: virtio=BC:24:11:BD:72:BA,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: sjc-pool-nvme:vm-105-disk-0,cache=writeback,discard=on,iothread=1,size=32G
scsihw: virtio-scsi-single
smbios1: uuid=5411e440-a6a9-484e-91c4-52f16bb987f4
sockets: 1
vmgenid: b76af415-4340-4f92-987e-ff8e1f6dbccd

Ceph Status:

Code:

  cluster:
    id:     46c43c19-dc66-4b87-a50f-be598b4ebef6
    health: HEALTH_WARN
            1/7 mons down, quorum pve012-sjc,pve011-sjc,pve004-sjc,pve003-sjc,pve002-sjc,pve001-sjc
            8 osds down
            2 hosts (8 osds) down
            Degraded data redundancy: 1236130/5584795 objects degraded (22.134%), 512 pgs degraded, 512 pgs undersized
 
  services:
    mon: 7 daemons, quorum pve012-sjc,pve011-sjc,pve004-sjc,pve003-sjc,pve002-sjc,pve001-sjc (age 2m), out of quorum: pve013-sjc
    mgr: pve003-sjc(active, since 7d), standbys: pve002-sjc, pve001-sjc, pve004-sjc, pve012-sjc, pve011-sjc
    osd: 36 osds: 28 up (since 2m), 36 in (since 9h)
 
  data:
    pools:   3 pools, 641 pgs
    objects: 1.45M objects, 3.3 TiB
    usage:   10 TiB used, 76 TiB / 86 TiB avail
    pgs:     1236130/5584795 objects degraded (22.134%)
             512 active+undersized+degraded
             128 active+clean
             1   active+clean+scrubbing+deep
 
  io:
    client:   7.7 KiB/s rd, 4.4 MiB/s wr, 2 op/s rd, 217 op/s wr

Just working on getting a VM status now.

d2600hz · Thursday at 11:00

spirit said:
without talking about osd, you need 5 monitors if you want to be able to loose 2 nodes

Yeah I have 7 run up on this cluster.

spirit · Thursday at 11:02

d2600hz said:
Yeah I have 7 run up on this cluster.

oh yes, sorry . so with 5/7 mon online, it's ok.

spirit · Thursday at 11:04

and you cluster seem fine, no pg down, just a simple warning.

spirit · Thursday at 11:06

disk corruption in the vm is really strange. in the worse case, the ceph should be readonly some some blocks of the vm disk.

fabian · Thursday at 11:11

hichy7 monitors is probably a bit overkill, upstream recommends 3 for smaller and 5 for bigger clusters (or if you want to tolerate two of them going down, like in your case).

the logs from the VM and the node where it is running covering the time period where you did your test would be interesting, as well as the information which two nodes you took down.

jsterr · Thursday at 11:12

Just because Im wondering: why not 4/2?

d2600hz · Thursday at 11:19

jsterr said:
Just because Im wondering: why not 4/2?

We only have 5 nodes in this nvme crush map. I think if I used 4/2 that there'd be a chance that it'd fail? Am I incorrect in that?

spirit · Thursday at 11:22

d2600hz said:
We only have 5 nodes in this nvme crush map. I think if I used 4/2 that there'd be a chance that it'd fail? Am I incorrect in that?

3/1 should be enough if you want to loose 2 nodes. (but risky if you loose a third node at the same time). 4/2 is safe

d2600hz · Thursday at 11:25

spirit said:
3/1 should be enough if you want to loose 2 nodes. (but risky if you loose a third node at the same time). 4/2 is safe

There's so much confusing stuff on the internet about what is best. The hardware is all enterprise grade, but the aim would be to handle an N+2 type scenario, so we've lost two nodes.

Back to more testing scenarios me thinks!

jsterr · Thursday at 11:25

Min-Size 1 is not recommended imo, I personally would go for a 4/2 (you can loose 2 servers at the same time).
@d2600hz please also post the output of ceph pg ls

d2600hz · Thursday at 11:38

jsterr said:
Min-Size 1 is not recommended imo, I personally would go for a 4/2 (you can loose 2 servers at the same time).
@d2600hz please also post the output of ceph pg ls

With the two hosts down?

jsterr · Thursday at 11:39

d2600hz said:
With the two hosts down?

No. Ceph pg ls shows all pgs and tells you, if data is correctly put on the right hosts/osds. so the output would be nice with all hosts beeing online.

d2600hz · Thursday at 11:45

Too large to post here's a link:

https://gist.github.com/doogienz/f4a83d71c183e981928a8f8ef39132b0

jsterr · Thursday at 11:51

d2600hz said:
Too large to post here's a link:

https://gist.github.com/doogienz/f4a83d71c183e981928a8f8ef39132b0

Thanks and also: ceph osd pool ls detail

d2600hz · Thursday at 11:51

jsterr said:
Thanks and also: ceph osd pool ls detail

Code:

pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 10034 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 15.79
pool 2 'sjc-pool' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 10858 lfor 0/505/5974 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 1.38
pool 8 'sjc-pool-nvme' replicated size 4 min_size 2 crush_rule 1 object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode on last_change 34208 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 1 application rbd read_balance_score 1.45

I just changed the pool to 4/2

jsterr · Thursday at 11:57

I cant see any problem, you should be able to loose 2 nodes without interrupting vms that are NOT running on those 2 now offline nodes.

Code:

8.1d1     1204         0          0        0    302431820            0           0   2629      3000  active+clean    11m      34115'48729      34298:60681  [26,35,30,22]p26  [26,35,30,22]p26  2025-04-23T23:31:40.309863+0000  2025-04-23T23:31:40.309863+0000                    0  periodic scrub scheduled @ 2025-04-25T00:00:41.351980+0000     
8.1d2     1278         0          0        0    306924840            0           0   2313      3000  active+clean    11m      34115'55713      34298:69472  [22,10,33,31]p22  [22,10,33,31]p22  2025-04-23T23:31:40.309863+0000  2025-04-23T23:31:40.309863+0000                    0  periodic scrub scheduled @ 2025-04-25T03:01:31.610453+0000     
8.1d3     1207         0          0        0    302444096            0           0   2466      3000  active+clean    11m      34102'62166      34298:77081   [34,30,9,27]p34   [34,30,9,27]p34  2025-04-23T23:31:40.309863+0000  2025-04-23T23:31:40.309863+0000                    0  periodic scrub scheduled @ 2025-04-25T05:19:21.663206+0000

Just for explanation. The ID 8.1d1 for example, first number = pool-id. the numbers in the [ ] are the osds where the data was put. Its matching correctly, as your osds with nvmes are 8-11 and 20-35.

Can you try again and exactly tell, what nodes you shut down, where the vm (that gets problems) was placed (on which node)?

d2600hz · 2025-04-30T03:11:59+0200

Okay update for this - as everyone pretty much confirmed that the configuration and setup should have been fine, I started absolutely hammering any and all hardware in this setup.

For reference they were new boxes, and unfortunately because of corporate overloads they'd forced the company HPE solution on us (we'd been using Dell's up to now). And then we had 5 boxes delivered from the same order to another DC and would you believe it, 3 were DOA.

So I'm now not trusting the hardware, so I tested everything network related. And low and behold one of the nics serving the storage network had a faulty port. As it has in an LACP bond it was hard to find, but replacing the nic has resolved the issue.

Thanks to everyone for your time, in this case, hardware fault.

[SOLVED] Issue with Ceph 5/3 configuration and corruption

New Member

Proxmox Staff Member

New Member

New Member

Distinguished Member

Distinguished Member

Distinguished Member

Proxmox Staff Member

Renowned Member

New Member

Distinguished Member

New Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

New Member

We value your privacy