VM's offline after OSD change

Irek Zayniev

New Member
May 29, 2018
19
1
3
45
Hello!
I have 8 node cluster PVE 6.0.4 with Nautilus Ceph.
On each node, there is 1 OSD.
Ceph cluster usage is about 48%

Ceph config is
Code:
[global] 
 auth client required = cephx
 auth cluster required = cephx 
 auth service required = cephx 
 cluster network = 10.200.201.0/22 
 fsid = ede0d6ae-81ec-4137-a918-5daf79ae0ff2 
 mon allow pool delete = true 
 osd journal size = 5120 
 osd pool default min size = 2 
 osd pool default size = 3 
 public network = 10.200.201.0/22 
 mon_host = 10.200.201.73 10.200.201.74 10.200.201.76 

[client] 
 keyring = /etc/pve/priv/$cluster.$name.keyring

I did add 2 OSD and remove 1 at the same time on one node.
67 pgs becomes inactive and all VM's now offline from the outside, some of them console does not works, no migration possible.

What is wrong? How to avoid such kind of issues in future?
 
Hello!
I have 8 node cluster PVE 6.0.4 with Nautilus Ceph.
On each node, there is 1 OSD.
Ceph cluster usage is about 48%

Ceph config is
Code:
[global]
 auth client required = cephx
 auth cluster required = cephx
 auth service required = cephx
 cluster network = 10.200.201.0/22
 fsid = ede0d6ae-81ec-4137-a918-5daf79ae0ff2
 mon allow pool delete = true
 osd journal size = 5120
 osd pool default min size = 2
 osd pool default size = 3
 public network = 10.200.201.0/22
 mon_host = 10.200.201.73 10.200.201.74 10.200.201.76

[client]
 keyring = /etc/pve/priv/$cluster.$name.keyring

I did add 2 OSD and remove 1 at the same time on one node.
67 pgs becomes inactive and all VM's now offline from the outside, some of them console does not works, no migration possible.

What is wrong? How to avoid such kind of issues in future?


Post actual Ceph data. The easiest to run
Code:
pvereport

At the end you find a couple of information regarding Ceph.
 
Really interesting long report.
Ceph is healthy.
Code:
# pveceph lspools
Name                       size   min_size     pg_num     %-used                 used
VMs                           2          2        512       0.17        2879961897728

# echo VMs
rbd ls VMs

Interesting is that all VMs are offline even if 1 OSD is down and all is fine as soon OSD is back or CEPH finish recovery.

Code:
ceph osd df tree

ID  CLASS WEIGHT   REWEIGHT SIZE    RAW USE DATA    OMAP    META     AVAIL   %USE  VAR  PGS STATUS TYPE NAME                  

 -1       19.09865        -  19 TiB 3.8 TiB 3.8 TiB 105 MiB   21 GiB  15 TiB 19.78 1.00   -        root default               

 -3        0.90999        - 931 GiB 193 GiB 192 GiB  36 MiB  988 MiB 738 GiB 20.75 1.05   -            host galaxy01-rubby102 

  0   hdd  0.90999  1.00000 931 GiB 193 GiB 192 GiB  36 MiB  988 MiB 738 GiB 20.75 1.05  51     up         osd.0              

 -5        2.72838        - 2.7 TiB 540 GiB 537 GiB 8.0 MiB  3.0 GiB 2.2 TiB 19.31 0.98   -            host galaxy02-rubby202 

  1   hdd  0.90999  1.00000 931 GiB 173 GiB 172 GiB 7.9 MiB 1016 MiB 759 GiB 18.54 0.94  46     up         osd.1              

 19   hdd  0.90919  1.00000 931 GiB 155 GiB 154 GiB  48 KiB 1024 MiB 776 GiB 16.67 0.84  41     up         osd.19             

 20   hdd  0.90919  1.00000 931 GiB 212 GiB 211 GiB  32 KiB 1024 MiB 719 GiB 22.74 1.15  56     up         osd.20             

 -7        2.72838        - 2.7 TiB 543 GiB 540 GiB 5.8 MiB  3.0 GiB 2.2 TiB 19.44 0.98   -            host galaxy03-rubby702 

  2   hdd  0.90999  1.00000 931 GiB 181 GiB 180 GiB 5.7 MiB 1018 MiB 750 GiB 19.48 0.98  48     up         osd.2              

 17   hdd  0.90919  1.00000 931 GiB 170 GiB 169 GiB  16 KiB 1024 MiB 761 GiB 18.26 0.92  45     up         osd.17             

 18   hdd  0.90919  1.00000 931 GiB 192 GiB 191 GiB  32 KiB 1024 MiB 739 GiB 20.58 1.04  51     up         osd.18             

 -9        2.72838        - 2.7 TiB 593 GiB 590 GiB  15 MiB  3.1 GiB 2.1 TiB 21.22 1.07   -            host galaxy04-rubby802 

  3   hdd  0.90999  1.00000 931 GiB 184 GiB 183 GiB  15 MiB 1009 MiB 747 GiB 19.81 1.00  49     up         osd.3              

 15   hdd  0.90919  1.00000 931 GiB 196 GiB 195 GiB  48 KiB 1024 MiB 735 GiB 21.09 1.07  52     up         osd.15             

 16   hdd  0.90919  1.00000 931 GiB 212 GiB 211 GiB   4 KiB  1.1 GiB 719 GiB 22.75 1.15  56     up         osd.16             

-13        2.72838        - 2.7 TiB 537 GiB 534 GiB  14 MiB  3.0 GiB 2.2 TiB 19.21 0.97   -            host galaxy06-rubby121 

  4   hdd  0.90999  1.00000 931 GiB 170 GiB 169 GiB  14 MiB 1010 MiB 761 GiB 18.24 0.92  45     up         osd.4              

 11   hdd  0.90919  1.00000 931 GiB 159 GiB 158 GiB  80 KiB 1024 MiB 772 GiB 17.04 0.86  42     up         osd.11             

 12   hdd  0.90919  1.00000 931 GiB 208 GiB 207 GiB  16 KiB 1024 MiB 723 GiB 22.36 1.13  55     up         osd.12             

-15        2.72838        - 2.7 TiB 496 GiB 493 GiB  14 MiB  3.0 GiB 2.2 TiB 17.75 0.90   -            host galaxy07-rubby131 

  6   hdd  0.90999  1.00000 931 GiB 153 GiB 152 GiB  14 MiB 1010 MiB 779 GiB 16.37 0.83  40     up         osd.6              

 13   hdd  0.90919  1.00000 931 GiB 174 GiB 173 GiB  20 KiB 1024 MiB 757 GiB 18.67 0.94  46     up         osd.13             

 14   hdd  0.90919  1.00000 931 GiB 170 GiB 169 GiB 160 KiB 1024 MiB 761 GiB 18.22 0.92  45     up         osd.14             

-11        1.81839        - 1.8 TiB 462 GiB 460 GiB 140 KiB  2.0 GiB 1.4 TiB 24.79 1.25   -            host galaxy10-leo703   

  8   hdd  0.90919  1.00000 931 GiB 232 GiB 231 GiB 120 KiB 1024 MiB 699 GiB 24.91 1.26  61     up         osd.8              

 10   hdd  0.90919  1.00000 931 GiB 230 GiB 229 GiB  20 KiB  1.0 GiB 701 GiB 24.68 1.25  61     up         osd.10             

-21        2.72838        - 2.7 TiB 505 GiB 502 GiB  12 MiB  3.0 GiB 2.2 TiB 18.09 0.91   -            host galaxy11-mirco303 

  5   hdd  0.90919  1.00000 931 GiB 249 GiB 248 GiB  52 KiB 1024 MiB 682 GiB 26.70 1.35  66     up         osd.5              

  7   hdd  0.90919  1.00000 931 GiB 140 GiB 139 GiB  52 KiB 1024 MiB 791 GiB 15.05 0.76  37     up         osd.7              

  9   hdd  0.90999  1.00000 931 GiB 117 GiB 116 GiB  12 MiB 1012 MiB 815 GiB 12.51 0.63  31     up         osd.9              

                      TOTAL  19 TiB 3.8 TiB 3.8 TiB 105 MiB   21 GiB  15 TiB 19.78                                            

MIN/MAX VAR: 0.63/1.35  STDDEV: 3.38

what can be wrong?
 
You have a pool with size=2 and min_size=2. If an osd is down, there are some placegroups with only one copy available, which is less than min_size, so any I/O is blocked.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!