Ceph Luminous - Lost data

David Herselman · Nov 22, 2017

We migrated our existing FileStore OSDs to BlueStore successfully on 2 out of 5 hosts before deciding to change the SSD partition sizes that BlueStore DB and WAL sit on. We previously sized them at 10GB, then they were journals for the HDD FileStore OSDs and wanted to increase them to 60GB partitions to ensure BlueStore DBs never overflow back on to HDDs.

Our RBD pools are configured with 3 replicas and we naturally checked that Ceph was healthy before marking all OSDs in a host as out, destroying them and then re-creating them after repartitioning the SSDs.

This worked perfectly on the first host and everything replicated again. When we started this on the second host, after again confirming that Ceph was healthy we observed that one object was unfound. The Ceph crush map should avoid placing multiple copies of the same object on a single OSD and should only confirm the write once it's landed on all replicas, correct?

Ceph healthy before we started:

Code:

[admin@kvm5c ~]# ceph -s
  cluster:
    id:     a3f1c21f-f883-48e0-9bd2-4f869c72b17d
    health: HEALTH_WARN
            noout flag(s) set

  services:
    mon: 3 daemons, quorum 1,2,3
    mgr: kvm5b(active), standbys: kvm5c, kvm5d
    mds: cephfs-1/1/1 up  {0=kvm5b=up:active}, 2 up:standby
    osd: 20 osds: 20 up, 20 in
         flags noout

  data:
    pools:   3 pools, 592 pgs
    objects: 1202k objects, 4609 GB
    usage:   14377 GB used, 23350 GB / 37728 GB avail
    pgs:     589 active+clean
             3   active+clean+scrubbing+deep

  io:
    client:   414 kB/s rd, 8770 kB/s wr, 117 op/s rd, 971 op/s wr

Took 4 OSDs on host kvm5c offline and destroyed them. All pools run as 3/1 so I assume all data to be replicated thrice on different hosts. Health however reports one object as unfound:

Code:

[admin@kvm5c ~]# ceph -s
  cluster:
    id:     a3f1c21f-f883-48e0-9bd2-4f869c72b17d
    health: HEALTH_WARN
            noout flag(s) set
            1/1231025 objects unfound (0.000%)
            Degraded data redundancy: 718772/3689399 objects degraded (19.482%), 336 pgs unclean, 336 pgs degraded, 319 pgs undersized

  services:
    mon: 3 daemons, quorum 1,2,3
    mgr: kvm5b(active), standbys: kvm5c, kvm5d
    mds: cephfs-1/1/1 up  {0=kvm5b=up:active}, 2 up:standby
    osd: 20 osds: 16 up, 16 in; 319 remapped pgs
         flags noout

  data:
    pools:   3 pools, 592 pgs
    objects: 1202k objects, 4610 GB
    usage:   11678 GB used, 18601 GB / 30280 GB avail
    pgs:     718772/3689399 objects degraded (19.482%)
             1/1231025 objects unfound (0.000%)
             287 active+undersized+degraded+remapped+backfill_wait
             256 active+clean
             23  active+undersized+degraded+remapped+backfilling
             17  active+recovery_wait+degraded
             9   active+recovery_wait+undersized+degraded+remapped

  io:
    client:   1130 kB/s rd, 12092 kB/s wr, 177 op/s rd, 835 op/s wr
    recovery: 232 MB/s, 60 objects/s

We listed the missing object in the affected placement group:

Code:

[admin@kvm5c ~]# ceph health detail
    pg 0.177 has 1 unfound objects

[admin@kvm5c ~]# ceph pg 0.177 list_missing
{
    "offset": {
        "oid": "",
        "key": "",
        "snapid": 0,
        "hash": 0,
        "max": 0,
        "pool": -9223372036854775808,
        "namespace": ""
    },
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
        {
            "oid": {
                "oid": "rbd_data.3338a3238e1f29.00000000000006f6",
                "key": "",
                "snapid": -2,
                "hash": 3281629559,
                "max": 0,
                "pool": 0,
                "namespace": ""
            },
            "need": "8993'34050777",
            "have": "8993'34050774",
            "flags": "none",
            "locations": []
        }
    ],
    "more": false
}

Next we looked up which RBD image this affected:

Code:

[admin@kvm5c ~]# rbd --pool rbd ls | while read image ; do rbd --pool rbd info $image; done | grep -C 5 3338a3238e1f29
rbd image 'vm-142-disk-1':
        size 102400 MB in 25600 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.3338a3238e1f29
        format: 2
        features: layering
        flags:

VM 142 had locked up so we turned it off and told Ceph to delete the missing data:

Code:

ceph pg 0.177 mark_unfound_lost delete

We finally booted the VM (Linux) using a rescue ISO image and ran file system integrity tests, which luckily didn't yield any errors...

David Herselman · Nov 22, 2017

Full placement group query information, should it be relevant (attached)

David Herselman · Nov 22, 2017

I assume the VM tried to commit data and that the write landed on the OSD moments before it was marked down, but before it could be replicated to other OSDs. The write most probably wasn't returned as being successful to the guest, so the filesystem there was in a persistent state...

Jarek · Nov 22, 2017

Why the noout flag is set?

David Herselman · Nov 22, 2017

We are replacing OSDs, so I did not want Ceph to start replicating the 3rd copy to OSDs in other systems, as we were going to bring up the new replacement OSDs almost immediately. We unset the 'noout' option when we complete our maintenance on our larger cluster.

We have another small development cluster, which is made up using bits and pieces. There are single 2TB SAS discs in each node and 4 x 300GB discs. We run with 'noout' permanently on that cluster, as we don't want Ceph to replicate the data from one of the 2TB discs to the 300GB discs automatically as this could fill the discs and result in them going offline as well...

Search

Search

Ceph Luminous - Lost data

David Herselman

Renowned Member

David Herselman

Renowned Member

Attachments

David Herselman

Renowned Member

Jarek

Well-Known Member

David Herselman

Renowned Member