Upgrade to PVE6 and Ceph Natuilus Failed

colwhackamole

New Member
Oct 14, 2019
7
0
1
33
Hi All,

I'm hoping I can get some assistance here. I have been reading forums and guides to try and resolve this issue to no avail. Last night I upgraded my Proxmox VE to v6 and my Ceph to Nautilus (I followed the upgrade guide on Proxmox's website.) I assume at some point I did something wrong as my Ceph Pool now has incomplete PG's and my Guests won't start. The upgrade to v6 went well, issues started happening during the Ceph upgrade. Please find my ceph health details below. Any advice or assistance pointing me in the right direction to help me resolve this would be much appreciated (I really don't want to have to rebuild all my VM's from scratch) Thank you very much in advance.

=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2019.10.14 16:34:18 =~=~=~=~=~=~=~=~=~=~=~=
ceph health detail
HEALTH_ERR Reduced data availability: 77 pgs inactive, 77 pgs incomplete; 63 pgs not deep-scrubbed in time; 63 pgs not scrubbed in time; 5 slow requests are blocked > 32 sec; 13 stuck requests are blocked > 4096 sec
PG_AVAILABILITY Reduced data availability: 77 pgs inactive, 77 pgs incomplete
pg 4.0 is incomplete, acting [4,2,0]
pg 4.3 is incomplete, acting [0,5,2]
pg 4.5 is incomplete, acting [0,2,5]
pg 4.6 is incomplete, acting [4,2,0]
pg 4.9 is incomplete, acting [5,1,2]
pg 4.a is incomplete, acting [2,0,1]
pg 4.b is incomplete, acting [0,2,5]
pg 4.c is incomplete, acting [3,4,0]
pg 4.d is incomplete, acting [3,5,0]
pg 4.10 is incomplete, acting [1,0,5]
pg 4.12 is incomplete, acting [1,0,2]
pg 4.14 is incomplete, acting [5,1,2]
pg 4.17 is incomplete, acting [1,3,0]
pg 4.18 is incomplete, acting [2,0,1]
pg 4.19 is incomplete, acting [1,0,5]
pg 4.20 is incomplete, acting [2,1,5]
pg 4.21 is incomplete, acting [1,0,3]
pg 4.22 is incomplete, acting [3,1,5]
pg 4.24 is incomplete, acting [0,2,1]
pg 4.25 is incomplete, acting [3,0,4]
pg 4.26 is incomplete, acting [1,2,5]
pg 4.27 is incomplete, acting [5,3,0]
pg 4.29 is incomplete, acting [4,0,3]
pg 4.2d is incomplete, acting [1,0,2]
pg 4.2e is incomplete, acting [3,1,0]
pg 4.2f is incomplete, acting [1,0,5]
pg 4.30 is incomplete, acting [3,4,0]
pg 4.32 is incomplete, acting [2,4,5]
pg 4.33 is incomplete, acting [4,2,0]
pg 4.35 is incomplete, acting [2,1,5]
pg 4.36 is incomplete, acting [3,4,0]
pg 4.37 is incomplete, acting [4,2,0]
pg 4.38 is incomplete, acting [3,4,0]
pg 4.3e is incomplete, acting [5,0,3]
pg 4.41 is incomplete, acting [2,0,5]
pg 4.43 is incomplete, acting [3,1,0]
pg 4.44 is incomplete, acting [0,2,4]
pg 4.45 is incomplete, acting [4,3,0]
pg 4.46 is incomplete, acting [3,5,4]
pg 4.4a is incomplete, acting [2,4,5]
pg 4.4b is incomplete, acting [4,0,5]
pg 4.4c is incomplete, acting [1,3,5]
pg 4.4e is incomplete, acting [1,2,0]
pg 4.50 is incomplete, acting [1,0,5]
pg 4.51 is incomplete, acting [0,4,5]
pg 4.52 is incomplete, acting [4,3,0]
pg 4.54 is incomplete, acting [4,2,5]
pg 4.55 is incomplete, acting [1,3,5]
pg 4.56 is incomplete, acting [1,2,0]
pg 4.57 is incomplete, acting [0,2,5]
pg 4.5d is stuck inactive since forever, current state incomplete, last acting [3,1,0]
PG_NOT_DEEP_SCRUBBED 63 pgs not deep-scrubbed in time
pg 4.4e not deep-scrubbed since 2019-08-08 17:38:30.541375
pg 4.4c not deep-scrubbed since 2019-08-09 00:36:04.827949
pg 4.43 not deep-scrubbed since 2019-08-08 09:18:51.727078
pg 4.41 not deep-scrubbed since 2019-08-09 07:13:53.639825
pg 4.46 not deep-scrubbed since 2019-08-07 08:08:33.056561
pg 4.45 not deep-scrubbed since 2019-08-05 20:03:01.359555
pg 4.38 not deep-scrubbed since 2019-08-07 07:42:00.306092
pg 4.32 not deep-scrubbed since 2019-08-09 17:13:09.716832
pg 4.33 not deep-scrubbed since 2019-08-07 19:55:17.960813
pg 4.30 not deep-scrubbed since 2019-08-03 10:32:50.107089
pg 4.36 not deep-scrubbed since 2019-08-07 05:32:08.733661
pg 4.37 not deep-scrubbed since 2019-08-06 17:18:10.678911
pg 4.35 not deep-scrubbed since 2019-08-04 00:57:54.992951
pg 4.29 not deep-scrubbed since 2019-08-05 16:31:15.559160
pg 4.2e not deep-scrubbed since 2019-08-08 01:02:46.726850
pg 4.2f not deep-scrubbed since 2019-08-04 10:43:14.844307
pg 4.2d not deep-scrubbed since 2019-08-09 22:37:39.907829
pg 4.20 not deep-scrubbed since 2019-08-05 01:21:15.869116
pg 4.21 not deep-scrubbed since 2019-08-05 18:14:18.475446
pg 4.a not deep-scrubbed since 2019-08-08 11:42:20.696481
pg 4.9 not deep-scrubbed since 2019-08-05 16:47:51.160285
pg 4.3 not deep-scrubbed since 2019-08-09 16:58:43.074139
pg 4.5 not deep-scrubbed since 2019-08-02 21:24:55.102930
pg 4.6 not deep-scrubbed since 2019-08-04 13:29:45.752570
pg 4.0 not deep-scrubbed since 2019-08-07 00:25:25.380248
pg 4.14 not deep-scrubbed since 2019-08-09 10:49:06.983061
pg 4.17 not deep-scrubbed since 2019-08-05 08:05:03.284850
pg 4.10 not deep-scrubbed since 2019-08-06 06:13:35.967461
pg 4.12 not deep-scrubbed since 2019-08-05 18:50:21.048520
pg 4.19 not deep-scrubbed since 2019-08-05 23:09:34.653928
pg 4.25 not deep-scrubbed since 2019-08-09 16:52:40.003880
pg 4.27 not deep-scrubbed since 2019-08-05 09:56:14.737333
pg 4.26 not deep-scrubbed since 2019-08-08 07:44:01.528375
pg 4.4b not deep-scrubbed since 2019-08-03 07:23:54.227877
pg 4.55 not deep-scrubbed since 2019-08-02 16:38:44.609272
pg 4.54 not deep-scrubbed since 2019-08-09 02:06:41.122968
pg 4.57 not deep-scrubbed since 2019-08-05 23:58:25.582045
pg 4.56 not deep-scrubbed since 2019-08-07 09:53:22.378615
pg 4.51 not deep-scrubbed since 2019-08-06 09:46:12.568070
pg 4.50 not deep-scrubbed since 2019-08-04 23:38:01.629711
pg 4.52 not deep-scrubbed since 2019-08-08 22:53:22.048507
pg 4.5d not deep-scrubbed since 2019-08-08 00:08:02.311031
pg 4.5f not deep-scrubbed since 2019-08-08 15:03:25.888232
pg 4.59 not deep-scrubbed since 2019-08-06 21:33:45.450309
pg 4.58 not deep-scrubbed since 2019-08-05 18:11:33.951199
pg 4.5a not deep-scrubbed since 2019-08-04 09:05:41.329865
pg 4.64 not deep-scrubbed since 2019-08-08 14:06:20.872913
pg 4.67 not deep-scrubbed since 2019-08-06 05:19:39.515020
pg 4.61 not deep-scrubbed since 2019-08-08 04:28:30.512770
pg 4.60 not deep-scrubbed since 2019-08-02 19:00:28.902131
13 more pgs...
PG_NOT_SCRUBBED 63 pgs not scrubbed in time
pg 4.4e not scrubbed since 2019-08-09 23:41:15.501075
pg 4.4c not scrubbed since 2019-08-09 00:36:04.827949
pg 4.43 not scrubbed since 2019-08-09 16:43:52.319864
pg 4.41 not scrubbed since 2019-08-09 07:13:53.639825
pg 4.46 not scrubbed since 2019-08-09 16:12:53.266564
pg 4.45 not scrubbed since 2019-08-09 03:15:49.529053
pg 4.38 not scrubbed since 2019-08-09 15:01:23.224579
pg 4.32 not scrubbed since 2019-08-09 17:13:09.716832
pg 4.33 not scrubbed since 2019-08-09 05:47:51.383642
pg 4.30 not scrubbed since 2019-08-09 15:01:04.878778
pg 4.36 not scrubbed since 2019-08-09 10:51:56.326858
pg 4.37 not scrubbed since 2019-08-09 14:34:41.344565
pg 4.35 not scrubbed since 2019-08-09 05:26:08.638891
pg 4.29 not scrubbed since 2019-08-09 10:09:30.690568
pg 4.2e not scrubbed since 2019-08-09 03:41:47.419563
pg 4.2f not scrubbed since 2019-08-09 16:24:32.349538
pg 4.2d not scrubbed since 2019-08-09 22:37:39.907829
pg 4.20 not scrubbed since 2019-08-09 06:07:05.066173
pg 4.21 not scrubbed since 2019-08-09 07:00:56.200580
pg 4.a not scrubbed since 2019-08-09 13:54:19.146873
pg 4.9 not scrubbed since 2019-08-09 16:58:44.945182
pg 4.3 not scrubbed since 2019-08-09 16:58:43.074139
pg 4.5 not scrubbed since 2019-08-09 20:09:52.000192
pg 4.6 not scrubbed since 2019-08-09 15:01:29.687018
pg 4.0 not scrubbed since 2019-08-09 12:31:27.153942
pg 4.14 not scrubbed since 2019-08-09 10:49:06.983061
pg 4.17 not scrubbed since 2019-08-09 02:27:50.421299
pg 4.10 not scrubbed since 2019-08-09 23:41:14.554786
pg 4.12 not scrubbed since 2019-08-09 13:58:29.837920
pg 4.19 not scrubbed since 2019-08-09 14:41:26.654210
pg 4.25 not scrubbed since 2019-08-09 16:52:40.003880
pg 4.27 not scrubbed since 2019-08-08 23:42:29.942359
pg 4.26 not scrubbed since 2019-08-09 12:49:00.580797
pg 4.4b not scrubbed since 2019-08-09 00:36:11.837797
pg 4.55 not scrubbed since 2019-08-09 07:59:50.643326
pg 4.54 not scrubbed since 2019-08-09 02:06:41.122968
pg 4.57 not scrubbed since 2019-08-09 22:14:24.377315
pg 4.56 not scrubbed since 2019-08-09 17:13:13.845480
pg 4.51 not scrubbed since 2019-08-08 23:00:58.570717
pg 4.50 not scrubbed since 2019-08-09 12:15:59.289107
pg 4.52 not scrubbed since 2019-08-08 22:53:22.048507
pg 4.5d not scrubbed since 2019-08-09 02:28:34.571034
pg 4.5f not scrubbed since 2019-08-09 16:20:36.448028
pg 4.59 not scrubbed since 2019-08-09 02:45:07.147397
pg 4.58 not scrubbed since 2019-08-09 10:21:07.569893
pg 4.5a not scrubbed since 2019-08-09 08:55:13.054821
pg 4.64 not scrubbed since 2019-08-09 15:37:52.195432
pg 4.67 not scrubbed since 2019-08-09 22:14:26.959688
pg 4.61 not scrubbed since 2019-08-09 14:37:28.729140
pg 4.60 not scrubbed since 2019-08-09 16:58:47.062602
13 more pgs...
REQUEST_SLOW 5 slow requests are blocked > 32 sec
5 ops are blocked > 2097.15 sec
osd.5 has blocked requests > 2097.15 sec
REQUEST_STUCK 13 stuck requests are blocked > 4096 sec
13 ops are blocked > 4194.3 sec
osds 1,2,3,4 have stuck requests > 4194.3 sec
root@pgdr7101:/etc/ceph#
 
Also, please find my specs below

ceph -s
cluster:
id: a1a20d8b-c8e0-4fd9-8892-75d978ba49fa
health: HEALTH_ERR
Reduced data availability: 77 pgs inactive, 77 pgs incomplete
63 pgs not deep-scrubbed in time
63 pgs not scrubbed in time
18 stuck requests are blocked > 4096 sec

services:
mon: 3 daemons, quorum pguas1,pgdr7101,pgdr7102 (age 7h)
mgr: pguas2(active, since 6h), standbys: pguas1, pgdr7101, pgdr7102
mds: cephfs:1 {0=pguas1=up:active} 1 up:standby
osd: 6 osds: 6 up, 6 in

data:
pools: 3 pools, 288 pgs
objects: 74.34k objects, 285 GiB
usage: 847 GiB used, 10 TiB / 11 TiB avail
pgs: 26.736% pgs not active
211 active+clean
77 incomplete
 
osd: 6 osds: 6 up, 6 in
Are these all the OSDs? And do you see in the ceph logs, heartbeat failures?

And what is the hardware the cluster was build on?
 
Correct, those are all the OSDs. As far as I am aware, I don't see any heartbeat failures. I scrolled back pretty far and all I see is a repeat of the lines below.

ceph log.PNG

My hardware consists of 2 UniFi Application Server XG's and 2 Dell R710's. This environment supports my home VoIP System (FreePBX), Media System (Emby), AD, VPN, CCTV, and several light tools/VM's. Backups are stored nightly in my NAS environment.
 
is there a way i can repair my 77 incomplete pgs? Currently, Proxmox won't let me restore from my backups on my NAS.

TASK ERROR: command 'set -o pipefail && zcat /mnt/pve/PGNAS2-VMBackups-Vol2/dump/vzdump-qemu-105-2019_09_21-01_15_31.vma.gz | vma extract -v -r /var/tmp/vzdumptmp1057830.fifo - /var/tmp/vzdumptmp1057830' failed: error with cfs lock 'storage-CEPHPOOL': got lock request timeou
 
TASK ERROR: command 'set -o pipefail && zcat /mnt/pve/PGNAS2-VMBackups-Vol2/dump/vzdump-qemu-105-2019_09_21-01_15_31.vma.gz | vma extract -v -r /var/tmp/vzdumptmp1057830.fifo - /var/tmp/vzdumptmp1057830' failed: error with cfs lock 'storage-CEPHPOOL': got lock request timeou
The requests for this are stuck (slow requests). What size/min_size do the pools have?

Can also post the crushmap?
 
Here's my pool info. I'm running 3/2.
Ceph-Pools.PNG

Here's my crush map
ceph osd crush tree --show-shadow
ID CLASS WEIGHT TYPE NAME
-1 10.82014 root default
-7 0.81639 host pgdr7101
5 hdd 0.81639 osd.5
-9 2.72659 host pgdr7102
0 hdd 2.72659 osd.0
-3 3.63858 host pguas1
1 hdd 1.81929 osd.1
4 hdd 1.81929 osd.4
-5 3.63858 host pguas2
2 hdd 1.81929 osd.2
3 hdd 1.81929 osd.3

Thank you again for your time and assistance with this
 
ceph osd crush tree --show-shadow
That is not the crushmap, easiest in the GUI. Datacenter -> node -> Ceph -> Configuration and there is the Crushmap on the right side. Please copy the text and post it.

And pleas post also a ceph osd df tree, it seems that the data distribution might be the underlying issue.
 
My apologies, Please find the crushmap below.

# begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable chooseleaf_stable 1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 # devices device 0 osd.0 class hdd device 1 osd.1 class hdd device 2 osd.2 class hdd device 3 osd.3 class hdd device 4 osd.4 class hdd device 5 osd.5 class hdd # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host pguas1 { id -3 # do not change unnecessarily id -4 class hdd # do not change unnecessarily # weight 3.639 alg straw2 hash 0 # rjenkins1 item osd.1 weight 1.819 item osd.4 weight 1.819 } host pguas2 { id -5 # do not change unnecessarily id -6 class hdd # do not change unnecessarily # weight 3.639 alg straw2 hash 0 # rjenkins1 item osd.2 weight 1.819 item osd.3 weight 1.819 } host pgdr7101 { id -7 # do not change unnecessarily id -8 class hdd # do not change unnecessarily # weight 0.816 alg straw2 hash 0 # rjenkins1 item osd.5 weight 0.816 } host pgdr7102 { id -9 # do not change unnecessarily id -10 class hdd # do not change unnecessarily # weight 2.727 alg straw2 hash 0 # rjenkins1 item osd.0 weight 2.727 } root default { id -1 # do not change unnecessarily id -2 class hdd # do not change unnecessarily # weight 10.820 alg straw2 hash 0 # rjenkins1 item pguas1 weight 3.639 item pguas2 weight 3.639 item pgdr7101 weight 0.816 item pgdr7102 weight 2.727 } # rules rule replicated_rule { id 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map

Please also find the df tree output below.

ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-1 10.82014 - 11 TiB 847 GiB 841 GiB 2.4 MiB 6.0 GiB 10 TiB 7.65 1.00 - root default
-7 0.81639 - 836 GiB 17 GiB 16 GiB 7 KiB 1024 MiB 819 GiB 2.05 0.27 - host pgdr7101
5 hdd 0.81639 1.00000 836 GiB 17 GiB 16 GiB 7 KiB 1024 MiB 819 GiB 2.05 0.27 113 up osd.5
-9 2.72659 - 2.7 TiB 270 GiB 269 GiB 1.7 MiB 1022 MiB 2.5 TiB 9.68 1.27 - host pgdr7102
0 hdd 2.72659 1.00000 2.7 TiB 270 GiB 269 GiB 1.7 MiB 1022 MiB 2.5 TiB 9.68 1.27 241 up osd.0
-3 3.63858 - 3.6 TiB 266 GiB 264 GiB 591 KiB 2.0 GiB 3.4 TiB 7.14 0.93 - host pguas1
1 hdd 1.81929 1.00000 1.8 TiB 152 GiB 151 GiB 7 KiB 1024 MiB 1.7 TiB 8.15 1.07 129 up osd.1
4 hdd 1.81929 1.00000 1.8 TiB 114 GiB 113 GiB 584 KiB 1023 MiB 1.7 TiB 6.13 0.80 126 up osd.4
-5 3.63858 - 3.6 TiB 294 GiB 292 GiB 104 KiB 2.0 GiB 3.4 TiB 7.89 1.03 - host pguas2
2 hdd 1.81929 1.00000 1.8 TiB 139 GiB 138 GiB 27 KiB 1024 MiB 1.7 TiB 7.43 0.97 121 up osd.2
3 hdd 1.81929 1.00000 1.8 TiB 155 GiB 154 GiB 77 KiB 1024 MiB 1.7 TiB 8.34 1.09 134 up osd.3
TOTAL 11 TiB 847 GiB 841 GiB 2.4 MiB 6.0 GiB 10 TiB 7.65
MIN/MAX VAR: 0.27/1.27 STDDEV: 2.53

Thanks again
 
Please post it as code (small three dots in the editor), it is not readable.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!