VM's offline after OSD change

Irek Zayniev · Aug 2, 2019

Hello!
I have 8 node cluster PVE 6.0.4 with Nautilus Ceph.
On each node, there is 1 OSD.
Ceph cluster usage is about 48%

Ceph config is

Code:

[global] 
 auth client required = cephx
 auth cluster required = cephx 
 auth service required = cephx 
 cluster network = 10.200.201.0/22 
 fsid = ede0d6ae-81ec-4137-a918-5daf79ae0ff2 
 mon allow pool delete = true 
 osd journal size = 5120 
 osd pool default min size = 2 
 osd pool default size = 3 
 public network = 10.200.201.0/22 
 mon_host = 10.200.201.73 10.200.201.74 10.200.201.76 

[client] 
 keyring = /etc/pve/priv/$cluster.$name.keyring

I did add 2 OSD and remove 1 at the same time on one node.
67 pgs becomes inactive and all VM's now offline from the outside, some of them console does not works, no migration possible.

What is wrong? How to avoid such kind of issues in future?

Richard · Aug 7, 2019

Irek Zayniev said:
Hello!
I have 8 node cluster PVE 6.0.4 with Nautilus Ceph.
On each node, there is 1 OSD.
Ceph cluster usage is about 48%

Ceph config is

Code:

[global] auth client required = cephx auth cluster required = cephx auth service required = cephx cluster network = 10.200.201.0/22 fsid = ede0d6ae-81ec-4137-a918-5daf79ae0ff2 mon allow pool delete = true osd journal size = 5120 osd pool default min size = 2 osd pool default size = 3 public network = 10.200.201.0/22 mon_host = 10.200.201.73 10.200.201.74 10.200.201.76 [client] keyring = /etc/pve/priv/$cluster.$name.keyring

I did add 2 OSD and remove 1 at the same time on one node.
67 pgs becomes inactive and all VM's now offline from the outside, some of them console does not works, no migration possible.

What is wrong? How to avoid such kind of issues in future?

Post actual Ceph data. The easiest to run

Code:

pvereport

At the end you find a couple of information regarding Ceph.

Irek Zayniev · Aug 7, 2019

Really interesting long report.
Ceph is healthy.

Code:

# pveceph lspools
Name                       size   min_size     pg_num     %-used                 used
VMs                           2          2        512       0.17        2879961897728

# echo VMs
rbd ls VMs

Interesting is that all VMs are offline even if 1 OSD is down and all is fine as soon OSD is back or CEPH finish recovery.

Code:

ceph osd df tree

ID  CLASS WEIGHT   REWEIGHT SIZE    RAW USE DATA    OMAP    META     AVAIL   %USE  VAR  PGS STATUS TYPE NAME                  

 -1       19.09865        -  19 TiB 3.8 TiB 3.8 TiB 105 MiB   21 GiB  15 TiB 19.78 1.00   -        root default               

 -3        0.90999        - 931 GiB 193 GiB 192 GiB  36 MiB  988 MiB 738 GiB 20.75 1.05   -            host galaxy01-rubby102 

  0   hdd  0.90999  1.00000 931 GiB 193 GiB 192 GiB  36 MiB  988 MiB 738 GiB 20.75 1.05  51     up         osd.0              

 -5        2.72838        - 2.7 TiB 540 GiB 537 GiB 8.0 MiB  3.0 GiB 2.2 TiB 19.31 0.98   -            host galaxy02-rubby202 

  1   hdd  0.90999  1.00000 931 GiB 173 GiB 172 GiB 7.9 MiB 1016 MiB 759 GiB 18.54 0.94  46     up         osd.1              

 19   hdd  0.90919  1.00000 931 GiB 155 GiB 154 GiB  48 KiB 1024 MiB 776 GiB 16.67 0.84  41     up         osd.19             

 20   hdd  0.90919  1.00000 931 GiB 212 GiB 211 GiB  32 KiB 1024 MiB 719 GiB 22.74 1.15  56     up         osd.20             

 -7        2.72838        - 2.7 TiB 543 GiB 540 GiB 5.8 MiB  3.0 GiB 2.2 TiB 19.44 0.98   -            host galaxy03-rubby702 

  2   hdd  0.90999  1.00000 931 GiB 181 GiB 180 GiB 5.7 MiB 1018 MiB 750 GiB 19.48 0.98  48     up         osd.2              

 17   hdd  0.90919  1.00000 931 GiB 170 GiB 169 GiB  16 KiB 1024 MiB 761 GiB 18.26 0.92  45     up         osd.17             

 18   hdd  0.90919  1.00000 931 GiB 192 GiB 191 GiB  32 KiB 1024 MiB 739 GiB 20.58 1.04  51     up         osd.18             

 -9        2.72838        - 2.7 TiB 593 GiB 590 GiB  15 MiB  3.1 GiB 2.1 TiB 21.22 1.07   -            host galaxy04-rubby802 

  3   hdd  0.90999  1.00000 931 GiB 184 GiB 183 GiB  15 MiB 1009 MiB 747 GiB 19.81 1.00  49     up         osd.3              

 15   hdd  0.90919  1.00000 931 GiB 196 GiB 195 GiB  48 KiB 1024 MiB 735 GiB 21.09 1.07  52     up         osd.15             

 16   hdd  0.90919  1.00000 931 GiB 212 GiB 211 GiB   4 KiB  1.1 GiB 719 GiB 22.75 1.15  56     up         osd.16             

-13        2.72838        - 2.7 TiB 537 GiB 534 GiB  14 MiB  3.0 GiB 2.2 TiB 19.21 0.97   -            host galaxy06-rubby121 

  4   hdd  0.90999  1.00000 931 GiB 170 GiB 169 GiB  14 MiB 1010 MiB 761 GiB 18.24 0.92  45     up         osd.4              

 11   hdd  0.90919  1.00000 931 GiB 159 GiB 158 GiB  80 KiB 1024 MiB 772 GiB 17.04 0.86  42     up         osd.11             

 12   hdd  0.90919  1.00000 931 GiB 208 GiB 207 GiB  16 KiB 1024 MiB 723 GiB 22.36 1.13  55     up         osd.12             

-15        2.72838        - 2.7 TiB 496 GiB 493 GiB  14 MiB  3.0 GiB 2.2 TiB 17.75 0.90   -            host galaxy07-rubby131 

  6   hdd  0.90999  1.00000 931 GiB 153 GiB 152 GiB  14 MiB 1010 MiB 779 GiB 16.37 0.83  40     up         osd.6              

 13   hdd  0.90919  1.00000 931 GiB 174 GiB 173 GiB  20 KiB 1024 MiB 757 GiB 18.67 0.94  46     up         osd.13             

 14   hdd  0.90919  1.00000 931 GiB 170 GiB 169 GiB 160 KiB 1024 MiB 761 GiB 18.22 0.92  45     up         osd.14             

-11        1.81839        - 1.8 TiB 462 GiB 460 GiB 140 KiB  2.0 GiB 1.4 TiB 24.79 1.25   -            host galaxy10-leo703   

  8   hdd  0.90919  1.00000 931 GiB 232 GiB 231 GiB 120 KiB 1024 MiB 699 GiB 24.91 1.26  61     up         osd.8              

 10   hdd  0.90919  1.00000 931 GiB 230 GiB 229 GiB  20 KiB  1.0 GiB 701 GiB 24.68 1.25  61     up         osd.10             

-21        2.72838        - 2.7 TiB 505 GiB 502 GiB  12 MiB  3.0 GiB 2.2 TiB 18.09 0.91   -            host galaxy11-mirco303 

  5   hdd  0.90919  1.00000 931 GiB 249 GiB 248 GiB  52 KiB 1024 MiB 682 GiB 26.70 1.35  66     up         osd.5              

  7   hdd  0.90919  1.00000 931 GiB 140 GiB 139 GiB  52 KiB 1024 MiB 791 GiB 15.05 0.76  37     up         osd.7              

  9   hdd  0.90999  1.00000 931 GiB 117 GiB 116 GiB  12 MiB 1012 MiB 815 GiB 12.51 0.63  31     up         osd.9              

                      TOTAL  19 TiB 3.8 TiB 3.8 TiB 105 MiB   21 GiB  15 TiB 19.78                                            

MIN/MAX VAR: 0.63/1.35  STDDEV: 3.38

what can be wrong?

Richard · Aug 7, 2019

Irek Zayniev said:
what can be wrong?

If Ceph status is healthy nothing is wrong ....

Irek Zayniev · Aug 7, 2019

But why one OSD stops work of cluster?

Jarek · Aug 7, 2019

You have a pool with size=2 and min_size=2. If an osd is down, there are some placegroups with only one copy available, which is less than min_size, so any I/O is blocked.

Search

Search

VM's offline after OSD change

Irek Zayniev

New Member

Richard

Renowned Member

Irek Zayniev

New Member

Richard

Renowned Member

Irek Zayniev

New Member

Jarek

Well-Known Member