after reinstalled pve(osd reused),ceph osd can't start

will the process of recovery modify the data of osds?
Extracting the maps shouldn't change any data on the OSD.

pg 100% unknow,maybe i will say goodbye to my data.
What's in the log files? But to say, it will be quicker to get the latest backup. Check on what epoch the OSDs are.

i recoveried monmap 3 osds in one host last week, and i can read some data。now i can't.
Since at some point all OSDs have been in contact with the different version of the MON DB, it can well be a mixed epoch now. Hard to tell from afar.

You only had two OSD nodes, you might be able to extract the raw objects. Though that's an even longer shot.
 
Extracting the maps shouldn't change any data on the OSD.


What's in the log files? But to say, it will be quicker to get the latest backup. Check on what epoch the OSDs are.


Since at some point all OSDs have been in contact with the different version of the MON DB, it can well be a mixed epoch now. Hard to tell from afar.

You only had two OSD nodes, you might be able to extract the raw objects. Though that's an even longer shot.

"Extracting the maps shouldn't change any data on the OSD."

if i recoveried my maps and started my ceph cluster,but something was wrong,then will any recovery process try to modify the ods's data for some crorrcting motion?

"You only had two OSD nodes, you might be able to extract the raw objects. "

any document for "extract the raw objects."?
 
you means:"ceph-objectstore-tool --data-path $PATH_TO_OSD --pgid $PG_ID $OBJECT get-bytes > $OBJECT_FILE_NAME"?
 
you means:"ceph-objectstore-tool --data-path $PATH_TO_OSD --pgid $PG_ID $OBJECT get-bytes > $OBJECT_FILE_NAME"?
yes, I meant that for extraction. But I never tried, so I don't know the outcome.

if i recoveried my maps and started my ceph cluster,but something was wrong,then will any recovery process try to modify the ods's data for some crorrcting motion?
You had a partially running cluster, the OSDs in the cluster moved forward with the epoch, but the others didn't.
 
the cluster crashed on epoch 3263.now is 3438。 may/can i reset to 3263 for my recovery testing?
You can try to get the older map.
https://arvimal.blog/2016/05/08/how-to-get-a-ceph-monosd-map-at-a-specific-epoch/

And a thought, since you have only two OSD nodes. I suppose the size is 2 and min_size hopefully as well. If the default distribution at host level was kept, then a node with all its OSDs in would be enough. The OSDs on the other node could be destroyed and re-created. Ceph would then recovery the missing copy onto the new OSDs. But be aware that will destroy data irretrievably.
 
You can try to get the older map.
https://arvimal.blog/2016/05/08/how-to-get-a-ceph-monosd-map-at-a-specific-epoch/

And a thought, since you have only two OSD nodes. I suppose the size is 2 and min_size hopefully as well. If the default distribution at host level was kept, then a node with all its OSDs in would be enough. The OSDs on the other node could be destroyed and re-created. Ceph would then recovery the missing copy onto the new OSDs. But be aware that will destroy data irretrievably.
may be better.but i got a low ops and everything seems hangs

Code:
root@pve:/etc/pve# ceph daemon osd.2 ops , because some pools's min_size=1?

{

    "ops": [

        {

            "description": "osd_op(client.30014.0:2 1.6c 1.2bb7eec (undecoded) ondisk+read+known_if_redirected e3458)",

            "initiated_at": "2021-02-01 19:40:07.167206",

            "age": 134.20371570899999,

            "duration": 134.203736408,

            "type_data": {

                "flag_point": "queued for pg",

                "client_info": {

                    "client": "client.30014",

                    "client_addr": "192.168.3.5:0/2169637478",

                    "tid": 2

                },

                "events": [

                    {

                        "time": "2021-02-01 19:40:07.167206",

                        "event": "initiated"

                    },

                    {

                        "time": "2021-02-01 19:40:07.167206",

                        "event": "header_read"

                    },

                    {

                        "time": "2021-02-01 19:40:07.167205",

                        "event": "throttled"

                    },

                    {

                        "time": "2021-02-01 19:40:07.167208",

                        "event": "all_read"

                    },

                    {

                        "time": "2021-02-01 19:40:07.167208",

                        "event": "dispatched"

                    },

                    {

                        "time": "2021-02-01 19:40:07.167211",

                        "event": "queued_for_pg"

                    }

                ]

            }

        }

    ],

    "num_ops": 1

}

ceph_work_l1 was pool size=2 min=1,but " rbd list ceph_work_l1" hangs


e3458 seams to read epoch 3458?
but i ran:
"ceph-objectstore-tool --op set-osdmap --no-mon-config --epoch 3263 --data-path /var/lib/ceph/osd/ceph-2 --type bluestore --file /root/recovery/osdmap3263" for all my osds(exclude out's osds)
 
Last edited:
" I suppose the size is 2 and min_size hopefully as well", the min_size of my pool is 1,so i can't recovery from my half osd host?
 
No, that just means that the IO is blocked till the pool has reached size = 2 again. It is worth a try, as a last resort.

The important part is, that there were only two nodes with OSDs and that the distribution of copies was on host level (default rule).
 
No, that just means that the IO is blocked till the pool has reached size = 2 again. It is worth a try, as a last resort.

The important part is, that there were only two nodes with OSDs and that the distribution of copies was on host level (default rule).
the pool that is size 2 and min 1 can read or write when one of two hosts down. why can't read/write now?
 
size is the target number of copies, Ceph will always try to create all copies. The min_size is the number of copies when Ceph will stop IO to not lose data.
i my past experience,recovery process didn't block normal process like read and write, except the cluster dosen't reach the min_size hosts.
 
To clarify, if only one PG of a pool drops below the min_size replica of the pool, Ceph stops IO for that pool. Depending on what failed, non-blocking self-healing is done.
https://docs.ceph.com/en/latest/rados/operations/pools/#set-pool-values
but my ceph_work_l1 pool was size 2 and min_size 1。 when i recoveried the one of two hosts,why i couldn't read my data?

on the other hand ,my other pools that were size 1 and min_size 1 couldn't read. it makes sense.
 
Last edited:
Do all PGs have a copy left? And what's the ceph -s state now?
Code:
root@pve:~# ceph -s
  cluster:
    id:     856cb359-a991-46b3-9468-a057d3e78d7c
    health: HEALTH_WARN
            5 pool(s) have no replicas configured
            Reduced data availability: 499 pgs inactive, 255 pgs down
            Degraded data redundancy: 3641/2905089 objects degraded (0.125%), 33 pgs degraded, 33 pgs undersized
            424 pgs not deep-scrubbed in time
            492 pgs not scrubbed in time
            1 slow ops, oldest one blocked for 61 sec, osd.2 has slow ops
            too many PGs per OSD (256 > max 250)

  services:
    mon: 1 daemons, quorum pve (age 75s)
    mgr: pve(active, since 74s)
    osd: 5 osds: 3 up (since 63s), 3 in (since 2d)

  data:
    pools:   10 pools, 768 pgs
    objects: 2.57M objects, 9.7 TiB
    usage:   9.3 TiB used, 20 TiB / 29 TiB avail
    pgs:     31.771% pgs unknown
             33.203% pgs not active
             3641/2905089 objects degraded (0.125%)
             255 down
             244 unknown
             233 active+clean
             33  active+undersized+degraded
             3   active+clean+scrubbing+deep

root@pve:~# ceph daemon osd.2 status
{
    "cluster_fsid": "856cb359-a991-46b3-9468-a057d3e78d7c",
    "osd_fsid": "c9036164-5359-4461-bb19-2296821acebb",
    "whoami": 2,
    "state": "active",
    "oldest_map": 2652,
    "newest_map": 3490,
    "num_pgs": 129
}

root@pve:~# ceph daemon osd.3 status
{
    "cluster_fsid": "856cb359-a991-46b3-9468-a057d3e78d7c",
    "osd_fsid": "7bd4adc8-e750-49f3-b729-16376edebcc6",
    "whoami": 3,
    "state": "active",
    "oldest_map": 2652,
    "newest_map": 3490,
    "num_pgs": 120
}

root@pve:~# ceph daemon osd.4 status
{
    "cluster_fsid": "856cb359-a991-46b3-9468-a057d3e78d7c",
    "osd_fsid": "aacfd858-3605-4f76-a870-3920e9b64db2",
    "whoami": 4,
    "state": "active",
    "oldest_map": 2652,
    "newest_map": 3490,
    "num_pgs": 276
}

root@pve:~# ceph daemon osd.2 ops
{
    "ops": [
        {
            "description": "osd_op(client.20010.0:2 1.6c 1.2bb7eec (undecoded) ondisk+read+known_if_redirected e3489)",
            "initiated_at": "2021-02-04 15:25:56.234766",
            "age": 224.498877765,
            "duration": 224.49889509499999,
            "type_data": {
                "flag_point": "queued for pg",
                "client_info": {
                    "client": "client.20010",
                    "client_addr": "192.168.3.5:0/1494568603",
                    "tid": 2
                },
                "events": [
                    {
                        "time": "2021-02-04 15:25:56.234766",
                        "event": "initiated"
                    },
                    {
                        "time": "2021-02-04 15:25:56.234766",
                        "event": "header_read"
                    },
                    {
                        "time": "2021-02-04 15:25:56.234765",
                        "event": "throttled"
                    },
                    {
                        "time": "2021-02-04 15:25:56.234768",
                        "event": "all_read"
                    },
                    {
                        "time": "2021-02-04 15:25:56.234769",
                        "event": "dispatched"
                    },
                    {
                        "time": "2021-02-04 15:25:56.234772",
                        "event": "queued_for_pg"
                    }
                ]
            }
        }
    ],
    "num_ops": 1
}


1612423866345.png


"rbd list ceph_work_l1" hangs, my pools config:
1612424125483.png
 
Last edited:
when i ran "ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph" to start the osd of host pve,the osd of pve8 will shutdown

logs such as :

Code:
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[10.1e( v 3146'1542 (0'0,3146'1542] local-lis/les=3518/3519 n=184 ec=2140/2140 lis/c 3433/3425 les/c/f 3434/3426/0 3520/3520/3518) [0] r=0 lpr=3520 pi=[3425,3520)/3 crt=3146'1542 lcod 0'0 mlcod 0'0 unknown mbc={}] start_peering_interval up [0,4] -> [0], acting [0,4] -> [0], acting_primary 0 -> 0, up_primary 0 -> 0, role 0 -> 0, features acting 4611087854035861503 upacting 4611087854035861503
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[10.1e( v 3146'1542 (0'0,3146'1542] local-lis/les=3518/3519 n=184 ec=2140/2140 lis/c 3433/3425 les/c/f 3434/3426/0 3520/3520/3518) [0] r=0 lpr=3520 pi=[3425,3520)/3 crt=3146'1542 lcod 0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[12.5( v 3255'552 (0'0,3255'552] local-lis/les=3433/3434 n=78 ec=2154/2154 lis/c 3433/3420 les/c/f 3434/3421/0 3520/3520/3520) [0] r=0 lpr=3520 pi=[3420,3520)/2 crt=3255'552 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] start_peering_interval up [3,0] -> [0], acting [3,0] -> [0], acting_primary 3 -> 0, up_primary 3 -> 0, role 1 -> 0, features acting 4611087854035861503 upacting 4611087854035861503
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[12.5( v 3255'552 (0'0,3255'552] local-lis/les=3433/3434 n=78 ec=2154/2154 lis/c 3433/3420 les/c/f 3434/3421/0 3520/3520/3520) [0] r=0 lpr=3520 pi=[3420,3520)/2 crt=3255'552 lcod 0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[1.3c( v 3263'185256 (2790'182156,3263'185256] local-lis/les=3433/3434 n=6116 ec=10/10 lis/c 3433/3433 les/c/f 3434/3434/0 3518/3518/3518) [0] r=0 lpr=3518 pi=[3433,3518)/2 crt=3263'185256 lcod 0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering>: Peering, affected_by_map, going to Reset
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[1.3c( v 3263'185256 (2790'182156,3263'185256] local-lis/les=3433/3434 n=6116 ec=10/10 lis/c 3433/3433 les/c/f 3434/3434/0 3518/3518/3518) [0] r=0 lpr=3520 pi=[3433,3518)/2 crt=3263'185256 lcod 0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[8.28( v 3255'450 (0'0,3255'450] local-lis/les=3433/3434 n=98 ec=2124/2124 lis/c 3433/3408 les/c/f 3434/3409/0 3518/3518/3518) [0,2] r=0 lpr=3518 pi=[3408,3518)/2 crt=3255'450 lcod 0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering>: Peering, affected_by_map, going to Reset
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[8.28( v 3255'450 (0'0,3255'450] local-lis/les=3433/3434 n=98 ec=2124/2124 lis/c 3433/3408 les/c/f 3434/3409/0 3520/3520/3518) [0] r=0 lpr=3520 pi=[3408,3520)/3 crt=3255'450 lcod 0'0 mlcod 0'0 unknown mbc={}] start_peering_interval up [0,2] -> [0], acting [0,2] -> [0], acting_primary 0 -> 0, up_primary 0 -> 0, role 0 -> 0, features acting 4611087854035861503 upacting 4611087854035861503
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[8.28( v 3255'450 (0'0,3255'450] local-lis/les=3433/3434 n=98 ec=2124/2124 lis/c 3433/3408 les/c/f 3434/3409/0 3520/3520/3518) [0] r=0 lpr=3520 pi=[3408,3520)/3 crt=3255'450 lcod 0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[2.28( v 3263'348310 (3148'345300,3263'348310] local-lis/les=3433/3434 n=29259 ec=18/18 lis/c 3433/3433 les/c/f 3434/3434/0 3518/3518/3518) [0] r=0 lpr=3518 pi=[3433,3518)/2 crt=3263'348310 lcod 0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering>: Peering, affected_by_map, going to Reset
2021-02-04 15:45:53.391 7fa6af248700  1 osd.0 pg_epoch: 3520 pg[2.28( v 3263'348310 (3148'345300,3263'348310] local-lis/les=3433/3434 n=29259 ec=18/18 lis/c 3433/3433 les/c/f 3434/3434/0 3518/3518/3518) [0] r=0 lpr=3520 pi=[3433,3518)/2 crt=3263'348310 lcod 0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
2021-02-04 15:45:54.987 7fa6af248700  0 log_channel(cluster) log [DBG] : 1.4b deep-scrub starts
2021-02-04 15:46:15.719 7fa6c6cef700 -1 received  signal: Interrupt, si_code : 128, si_value (int): 0, si_value (ptr): 0, si_errno: 0, si_pid : 0, si_uid : 0, si_addr0, si_status0
2021-02-04 15:46:15.719 7fa6c6cef700 -1 osd.0 3521 *** Got signal Interrupt ***
2021-02-04 15:46:15.719 7fa6c6cef700 -1 osd.0 3521 *** Immediate shutdown (osd_fast_shutdown=true) ***
 
Last edited:
some slow ops:

"description": "osd_op(client.20010.0:2 1.6c 1.2bb7eec (undecoded) ondisk+retry+read+known_if_redirected e3538)",
"initiated_at": "2021-02-04 18:33:36.376080",
"age": 585.254502419,
"duration": 585.25477198700003,
"type_data": {
"flag_point": "queued for pg",
"client_info": {
"client": "client.20010",
"client_addr": "192.168.3.5:0/1494568603",
"tid": 2
},
"events": [
{
"time": "2021-02-04 18:33:36.376080",
"event": "initiated"
},
{
"time": "2021-02-04 18:33:36.376080",
"event": "header_read"
},
{
"time": "2021-02-04 18:33:36.376079",
"event": "throttled"
},
{
"time": "2021-02-04 18:33:36.376082",
"event": "all_read"
},
{
"time": "2021-02-04 18:33:36.376083",
"event": "dispatched"
},
{
"time": "2021-02-04 18:33:36.376086",
"event": "queued_for_pg"
}
]
}
}
],

i think it meas some pg broken. and ceph try to recovery and block all ops.

it should be ok,if i delete the broken pools those were size 1 replicas config?
 
Last edited:
As long as there isn't a second node with OSDs, the PGs with size=2 will never recover and the PGs with only one copy will have missing PGs. So currently only the pools with size=2 may be recovered.

"event": "queued_for_pg"
Seems to wait to get a lock on the PG.
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008652.html

Since the cluster seems to have filestore OSDs, another possible recovery may be by extracting the objects of the RBDs itself.
https://github.com/ceph/ceph/tree/master/src/tools/rbd_recover_tool
https://gitlab.lbader.de/kryptur/ceph-recovery
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!