after reinstalled pve(osd reused),ceph osd can't start

Alwin Antreich · Jan 28, 2021

hrghope said:
will the process of recovery modify the data of osds?

Extracting the maps shouldn't change any data on the OSD.

hrghope said:
pg 100% unknow，maybe i will say goodbye to my data.

What's in the log files? But to say, it will be quicker to get the latest backup. Check on what epoch the OSDs are.

hrghope said:
i recoveried monmap 3 osds in one host last week, and i can read some data。now i can't.

Since at some point all OSDs have been in contact with the different version of the MON DB, it can well be a mixed epoch now. Hard to tell from afar.

You only had two OSD nodes, you might be able to extract the raw objects. Though that's an even longer shot.

hrghope · Jan 28, 2021

Alwin Antreich said:
Extracting the maps shouldn't change any data on the OSD.

What's in the log files? But to say, it will be quicker to get the latest backup. Check on what epoch the OSDs are.

Since at some point all OSDs have been in contact with the different version of the MON DB, it can well be a mixed epoch now. Hard to tell from afar.

You only had two OSD nodes, you might be able to extract the raw objects. Though that's an even longer shot.

"Extracting the maps shouldn't change any data on the OSD."

if i recoveried my maps and started my ceph cluster,but something was wrong，then will any recovery process try to modify the ods's data for some crorrcting motion?

"You only had two OSD nodes, you might be able to extract the raw objects. "

any document for "extract the raw objects."?

hrghope · Jan 29, 2021

you means:"ceph-objectstore-tool --data-path $PATH_TO_OSD --pgid $PG_ID $OBJECT get-bytes > $OBJECT_FILE_NAME"?

Alwin Antreich · Jan 29, 2021

hrghope said:
you means:"ceph-objectstore-tool --data-path $PATH_TO_OSD --pgid $PG_ID $OBJECT get-bytes > $OBJECT_FILE_NAME"?

yes, I meant that for extraction. But I never tried, so I don't know the outcome.

hrghope said:
if i recoveried my maps and started my ceph cluster,but something was wrong，then will any recovery process try to modify the ods's data for some crorrcting motion?

You had a partially running cluster, the OSDs in the cluster moved forward with the epoch, but the others didn't.

hrghope · Jan 31, 2021

Alwin Antreich said:
You had a partially running cluster, the OSDs in the cluster moved forward with the epoch, but the others didn't.

the cluster crashed on epoch 3263.now is 3438。 may/can i reset to 3263 for my recovery testing?

Alwin Antreich · Feb 1, 2021

hrghope said:
the cluster crashed on epoch 3263.now is 3438。 may/can i reset to 3263 for my recovery testing?

You can try to get the older map.
https://arvimal.blog/2016/05/08/how-to-get-a-ceph-monosd-map-at-a-specific-epoch/

And a thought, since you have only two OSD nodes. I suppose the size is 2 and min_size hopefully as well. If the default distribution at host level was kept, then a node with all its OSDs in would be enough. The OSDs on the other node could be destroyed and re-created. Ceph would then recovery the missing copy onto the new OSDs. But be aware that will destroy data irretrievably.

hrghope · Feb 1, 2021

Alwin Antreich said:
You can try to get the older map.
https://arvimal.blog/2016/05/08/how-to-get-a-ceph-monosd-map-at-a-specific-epoch/

And a thought, since you have only two OSD nodes. I suppose the size is 2 and min_size hopefully as well. If the default distribution at host level was kept, then a node with all its OSDs in would be enough. The OSDs on the other node could be destroyed and re-created. Ceph would then recovery the missing copy onto the new OSDs. But be aware that will destroy data irretrievably.

may be better.but i got a low ops and everything seems hangs

Code:

root@pve:/etc/pve# ceph daemon osd.2 ops , because some pools's min_size=1?

{

    "ops": [

        {

            "description": "osd_op(client.30014.0:2 1.6c 1.2bb7eec (undecoded) ondisk+read+known_if_redirected e3458)",

            "initiated_at": "2021-02-01 19:40:07.167206",

            "age": 134.20371570899999,

            "duration": 134.203736408,

            "type_data": {

                "flag_point": "queued for pg",

                "client_info": {

                    "client": "client.30014",

                    "client_addr": "192.168.3.5:0/2169637478",

                    "tid": 2

                },

                "events": [

                    {

                        "time": "2021-02-01 19:40:07.167206",

                        "event": "initiated"

                    },

                    {

                        "time": "2021-02-01 19:40:07.167206",

                        "event": "header_read"

                    },

                    {

                        "time": "2021-02-01 19:40:07.167205",

                        "event": "throttled"

                    },

                    {

                        "time": "2021-02-01 19:40:07.167208",

                        "event": "all_read"

                    },

                    {

                        "time": "2021-02-01 19:40:07.167208",

                        "event": "dispatched"

                    },

                    {

                        "time": "2021-02-01 19:40:07.167211",

                        "event": "queued_for_pg"

                    }

                ]

            }

        }

    ],

    "num_ops": 1

}

ceph_work_l1 was pool size=2 min=1,but " rbd list ceph_work_l1" hangs

e3458 seams to read epoch 3458?
but i ran:
"ceph-objectstore-tool --op set-osdmap --no-mon-config --epoch 3263 --data-path /var/lib/ceph/osd/ceph-2 --type bluestore --file /root/recovery/osdmap3263" for all my osds(exclude out's osds)

hrghope · Feb 1, 2021

" I suppose the size is 2 and min_size hopefully as well", the min_size of my pool is 1,so i can't recovery from my half osd host?

Alwin Antreich · Feb 1, 2021

No, that just means that the IO is blocked till the pool has reached size = 2 again. It is worth a try, as a last resort.

The important part is, that there were only two nodes with OSDs and that the distribution of copies was on host level (default rule).

hrghope · Feb 1, 2021

Alwin Antreich said:
No, that just means that the IO is blocked till the pool has reached size = 2 again. It is worth a try, as a last resort.

The important part is, that there were only two nodes with OSDs and that the distribution of copies was on host level (default rule).

the pool that is size 2 and min 1 can read or write when one of two hosts down. why can't read/write now?

Alwin Antreich · Feb 1, 2021

hrghope said:
the pool that is size 2 and min 1 can read or write when one of two hosts down. why can't read/write now?

size is the target number of copies, Ceph will always try to create all copies. The min_size is the number of copies when Ceph will stop IO to not lose data.

hrghope · Feb 2, 2021

Alwin Antreich said:
size is the target number of copies, Ceph will always try to create all copies. The min_size is the number of copies when Ceph will stop IO to not lose data.

i my past experience,recovery process didn't block normal process like read and write, except the cluster dosen't reach the min_size hosts.

Alwin Antreich · Feb 2, 2021

hrghope said:
i my past experience,recovery process didn't block normal process like read and write, except the cluster dosen't reach the min_size hosts.

To clarify, if only one PG of a pool drops below the min_size replica of the pool, Ceph stops IO for that pool. Depending on what failed, non-blocking self-healing is done.
https://docs.ceph.com/en/latest/rados/operations/pools/#set-pool-values

hrghope · Feb 3, 2021

Alwin Antreich said:
To clarify, if only one PG of a pool drops below the min_size replica of the pool, Ceph stops IO for that pool. Depending on what failed, non-blocking self-healing is done.
https://docs.ceph.com/en/latest/rados/operations/pools/#set-pool-values

but my ceph_work_l1 pool was size 2 and min_size 1。 when i recoveried the one of two hosts,why i couldn't read my data?

on the other hand ,my other pools that were size 1 and min_size 1 couldn't read. it makes sense.

Alwin Antreich · Feb 3, 2021

hrghope said:
but my ceph_work_l1 pool was size 2 and min_size 1。 when i recoveried the one of two hosts,why i couldn't read my data?

Do all PGs have a copy left? And what's the ceph -s state now?

hrghope · Feb 4, 2021

Alwin Antreich said:
Do all PGs have a copy left? And what's the ceph -s state now?

Code:

root@pve:~# ceph -s
  cluster:
    id:     856cb359-a991-46b3-9468-a057d3e78d7c
    health: HEALTH_WARN
            5 pool(s) have no replicas configured
            Reduced data availability: 499 pgs inactive, 255 pgs down
            Degraded data redundancy: 3641/2905089 objects degraded (0.125%), 33 pgs degraded, 33 pgs undersized
            424 pgs not deep-scrubbed in time
            492 pgs not scrubbed in time
            1 slow ops, oldest one blocked for 61 sec, osd.2 has slow ops
            too many PGs per OSD (256 > max 250)

  services:
    mon: 1 daemons, quorum pve (age 75s)
    mgr: pve(active, since 74s)
    osd: 5 osds: 3 up (since 63s), 3 in (since 2d)

  data:
    pools:   10 pools, 768 pgs
    objects: 2.57M objects, 9.7 TiB
    usage:   9.3 TiB used, 20 TiB / 29 TiB avail
    pgs:     31.771% pgs unknown
             33.203% pgs not active
             3641/2905089 objects degraded (0.125%)
             255 down
             244 unknown
             233 active+clean
             33  active+undersized+degraded
             3   active+clean+scrubbing+deep

root@pve:~# ceph daemon osd.2 status
{
    "cluster_fsid": "856cb359-a991-46b3-9468-a057d3e78d7c",
    "osd_fsid": "c9036164-5359-4461-bb19-2296821acebb",
    "whoami": 2,
    "state": "active",
    "oldest_map": 2652,
    "newest_map": 3490,
    "num_pgs": 129
}

root@pve:~# ceph daemon osd.3 status
{
    "cluster_fsid": "856cb359-a991-46b3-9468-a057d3e78d7c",
    "osd_fsid": "7bd4adc8-e750-49f3-b729-16376edebcc6",
    "whoami": 3,
    "state": "active",
    "oldest_map": 2652,
    "newest_map": 3490,
    "num_pgs": 120
}

root@pve:~# ceph daemon osd.4 status
{
    "cluster_fsid": "856cb359-a991-46b3-9468-a057d3e78d7c",
    "osd_fsid": "aacfd858-3605-4f76-a870-3920e9b64db2",
    "whoami": 4,
    "state": "active",
    "oldest_map": 2652,
    "newest_map": 3490,
    "num_pgs": 276
}

root@pve:~# ceph daemon osd.2 ops
{
    "ops": [
        {
            "description": "osd_op(client.20010.0:2 1.6c 1.2bb7eec (undecoded) ondisk+read+known_if_redirected e3489)",
            "initiated_at": "2021-02-04 15:25:56.234766",
            "age": 224.498877765,
            "duration": 224.49889509499999,
            "type_data": {
                "flag_point": "queued for pg",
                "client_info": {
                    "client": "client.20010",
                    "client_addr": "192.168.3.5:0/1494568603",
                    "tid": 2
                },
                "events": [
                    {
                        "time": "2021-02-04 15:25:56.234766",
                        "event": "initiated"
                    },
                    {
                        "time": "2021-02-04 15:25:56.234766",
                        "event": "header_read"
                    },
                    {
                        "time": "2021-02-04 15:25:56.234765",
                        "event": "throttled"
                    },
                    {
                        "time": "2021-02-04 15:25:56.234768",
                        "event": "all_read"
                    },
                    {
                        "time": "2021-02-04 15:25:56.234769",
                        "event": "dispatched"
                    },
                    {
                        "time": "2021-02-04 15:25:56.234772",
                        "event": "queued_for_pg"
                    }
                ]
            }
        }
    ],
    "num_ops": 1
}

"rbd list ceph_work_l1" hangs, my pools config:

hrghope · Feb 4, 2021

when i ran "ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph" to start the osd of host pve,the osd of pve8 will shutdown

logs such as :

Code:

2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[10.1e( v 3146'1542 (0'0,3146'1542] local-lis/les=3518/3519 n=184 ec=2140/2140 lis/c 3433/3425 les/c/f 3434/3426/0 3520/3520/3518) [0] r=0 lpr=3520 pi=[3425,3520)/3 crt=3146'1542 lcod 0'0 mlcod 0'0 unknown mbc={}] start_peering_interval up [0,4] -> [0], acting [0,4] -> [0], acting_primary 0 -> 0, up_primary 0 -> 0, role 0 -> 0, features acting 4611087854035861503 upacting 4611087854035861503
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[10.1e( v 3146'1542 (0'0,3146'1542] local-lis/les=3518/3519 n=184 ec=2140/2140 lis/c 3433/3425 les/c/f 3434/3426/0 3520/3520/3518) [0] r=0 lpr=3520 pi=[3425,3520)/3 crt=3146'1542 lcod 0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[12.5( v 3255'552 (0'0,3255'552] local-lis/les=3433/3434 n=78 ec=2154/2154 lis/c 3433/3420 les/c/f 3434/3421/0 3520/3520/3520) [0] r=0 lpr=3520 pi=[3420,3520)/2 crt=3255'552 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [3,0] -> [0], acting [3,0] -> [0], acting_primary 3 -> 0, up_primary 3 -> 0, role 1 -> 0, features acting 4611087854035861503 upacting 4611087854035861503
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[12.5( v 3255'552 (0'0,3255'552] local-lis/les=3433/3434 n=78 ec=2154/2154 lis/c 3433/3420 les/c/f 3434/3421/0 3520/3520/3520) [0] r=0 lpr=3520 pi=[3420,3520)/2 crt=3255'552 lcod 0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[1.3c( v 3263'185256 (2790'182156,3263'185256] local-lis/les=3433/3434 n=6116 ec=10/10 lis/c 3433/3433 les/c/f 3434/3434/0 3518/3518/3518) [0] r=0 lpr=3518 pi=[3433,3518)/2 crt=3263'185256 lcod 0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering>: Peering, affected_by_map, going to Reset
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[1.3c( v 3263'185256 (2790'182156,3263'185256] local-lis/les=3433/3434 n=6116 ec=10/10 lis/c 3433/3433 les/c/f 3434/3434/0 3518/3518/3518) [0] r=0 lpr=3520 pi=[3433,3518)/2 crt=3263'185256 lcod 0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[8.28( v 3255'450 (0'0,3255'450] local-lis/les=3433/3434 n=98 ec=2124/2124 lis/c 3433/3408 les/c/f 3434/3409/0 3518/3518/3518) [0,2] r=0 lpr=3518 pi=[3408,3518)/2 crt=3255'450 lcod 0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering>: Peering, affected_by_map, going to Reset
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[8.28( v 3255'450 (0'0,3255'450] local-lis/les=3433/3434 n=98 ec=2124/2124 lis/c 3433/3408 les/c/f 3434/3409/0 3520/3520/3518) [0] r=0 lpr=3520 pi=[3408,3520)/3 crt=3255'450 lcod 0'0 mlcod 0'0 unknown mbc={}] start_peering_interval up [0,2] -> [0], acting [0,2] -> [0], acting_primary 0 -> 0, up_primary 0 -> 0, role 0 -> 0, features acting 4611087854035861503 upacting 4611087854035861503
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[8.28( v 3255'450 (0'0,3255'450] local-lis/les=3433/3434 n=98 ec=2124/2124 lis/c 3433/3408 les/c/f 3434/3409/0 3520/3520/3518) [0] r=0 lpr=3520 pi=[3408,3520)/3 crt=3255'450 lcod 0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[2.28( v 3263'348310 (3148'345300,3263'348310] local-lis/les=3433/3434 n=29259 ec=18/18 lis/c 3433/3433 les/c/f 3434/3434/0 3518/3518/3518) [0] r=0 lpr=3518 pi=[3433,3518)/2 crt=3263'348310 lcod 0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering>: Peering, affected_by_map, going to Reset
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[2.28( v 3263'348310 (3148'345300,3263'348310] local-lis/les=3433/3434 n=29259 ec=18/18 lis/c 3433/3433 les/c/f 3434/3434/0 3518/3518/3518) [0] r=0 lpr=3520 pi=[3433,3518)/2 crt=3263'348310 lcod 0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
2021-02-04 15:45:54.987 7fa6af248700  0 log_channel(cluster) log [DBG] : 1.4b deep-scrub starts
2021-02-04 15:46:15.719 7fa6c6cef700 -1 received  signal: Interrupt, si_code : 128, si_value (int): 0, si_value (ptr): 0, si_errno: 0, si_pid : 0, si_uid : 0, si_addr0, si_status0
2021-02-04 15:46:15.719 7fa6c6cef700 -1 osd.0 3521 *** Got signal Interrupt ***
2021-02-04 15:46:15.719 7fa6c6cef700 -1 osd.0 3521 *** Immediate shutdown (osd_fast_shutdown=true) ***

hrghope · Feb 4, 2021

Alwin Antreich said:
Do all PGs have a copy left

no ,just some pools that were important for me had size=2 config

hrghope · Feb 4, 2021

some slow ops:

"description": "osd_op(client.20010.0:2 1.6c 1.2bb7eec (undecoded) ondisk+retry+read+known_if_redirected e3538)",
"initiated_at": "2021-02-04 18:33:36.376080",
"age": 585.254502419,
"duration": 585.25477198700003,
"type_data": {
"flag_point": "queued for pg",
"client_info": {
"client": "client.20010",
"client_addr": "192.168.3.5:0/1494568603",
"tid": 2
},
"events": [
{
"time": "2021-02-04 18:33:36.376080",
"event": "initiated"
},
{
"time": "2021-02-04 18:33:36.376080",
"event": "header_read"
},
{
"time": "2021-02-04 18:33:36.376079",
"event": "throttled"
},
{
"time": "2021-02-04 18:33:36.376082",
"event": "all_read"
},
{
"time": "2021-02-04 18:33:36.376083",
"event": "dispatched"
},
{
"time": "2021-02-04 18:33:36.376086",
"event": "queued_for_pg"
}
]
}
}
],

i think it meas some pg broken. and ceph try to recovery and block all ops.

it should be ok,if i delete the broken pools those were size 1 replicas config?

Alwin Antreich · Feb 5, 2021

As long as there isn't a second node with OSDs, the PGs with size=2 will never recover and the PGs with only one copy will have missing PGs. So currently only the pools with size=2 may be recovered.

hrghope said:
"event": "queued_for_pg"

Seems to wait to get a lock on the PG.
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008652.html

Since the cluster seems to have filestore OSDs, another possible recovery may be by extracting the objects of the RBDs itself.
https://github.com/ceph/ceph/tree/master/src/tools/rbd_recover_tool
https://gitlab.lbader.de/kryptur/ceph-recovery

after reinstalled pve(osd reused),ceph osd can't start

Well-Known Member

Member

Member

Well-Known Member

Member

Well-Known Member

Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Member

Member

Member

Well-Known Member

We value your privacy