[SOLVED] Web interface not working, and /etc/pve not mounting

bondif · May 16, 2022

Hi,

We have a cluster of 9 nodes with a ceph pool for vm storage, lately we stopped to have access to the web UI, but we still have ssh access to the servers, and all the VMs are still running fine.
We've tried to debug the problem with no luck, the only thing that we were sure about is that there is a problem with the filesystem.
Today we've decided to reboot one of the nodes (pve0), and it shows this message on boot
`Failed to start Import ZFS pool zfs\x2dvm.`
`Failed to start Import ZFS pools by cache file.`
After that it boots fine, but /etc/pve is empty!
-----
When running systemctl status pveproxy is takes so much time and then shows "Transport endpoint is not connected"

Any help please? we have 5 VMs down

mira · May 16, 2022

Please provide the output of the following commands:
pveversion -v
systemctl status pve-cluster.service
systemctl status corosync.service
pvecm status

bondif · May 16, 2022

proxmox-ve: 7.0-2 (running kernel: 5.11.22-4-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-7
pve-kernel-helper: 7.0-7
pve-kernel-5.11.22-4-pve: 5.11.22-8
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-6
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.0.9-2
proxmox-backup-file-restore: 2.0.9-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-9
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.3-1
pve-ha-manager: 3.3-1
pve-i18n: 2.5-1
pve-qemu-kvm: 6.0.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-13
smartmontools: 7.2-1
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

Failed to get properties: Transport endpoint is not connected

`pvecm status` is now taking more than 20 minutes with no answer

mira · May 16, 2022

If that is the error you get when running systemctl commands, there's already something wrong with systemd.
Please provide the complete journal since the last boot: journalctl -b > journal.txt
The above command creates a file called journal.txt, please attach it here.

bondif · May 16, 2022

Here it is : https://drive.google.com/file/d/12BWYNEvjK3Hw25K8GaQOwb48sfo89VQk/view?usp=sharing
It's has 38 MB that why I've uploaded it somewhere else.

mira · May 16, 2022

Thank you for the journal.
This is from pve7, but pve0 would be more interesting, as it is the one recently rebooted.
Please provide it from pve0.

bondif · May 16, 2022

Here it is : https://drive.google.com/file/d/1QehQiyS7gt-udnXa7yXQn69YQj_WcvBA/view?usp=sharing

bondif · May 17, 2022

Anyone can help?

mira · May 18, 2022

Code:

May 16 16:00:13 pve0 systemd-logind[1602]: Failed to start user service 'user@0.service', ignoring: Connection timed out
May 16 16:00:38 pve0 systemd-logind[1602]: Failed to start session scope session-1.scope: Connection timed out
May 16 16:00:38 pve0 login[2187]: pam_systemd(login:session): Failed to create session: Connection timed out

We see those symptoms, but sadly not the reason for this.

Based on the pveversion -v output you're still running PVE 7.0.
Could you try updating to the latest version?

bondif · May 18, 2022

Just for the records, we've found that it was a problem with system date, it was GMT+1 we've bring it back to GMT, we've restarted the cluster "systemctl restart pve-cluster" and the web UI returned back.

But now there another issue, ceph cluster is degraded and 6 osds (3 * pve0 + 3 * pve6) are down/off and they won't get back up/on, any idea how bring them back please ?

When we try to bring them back we have these logs :

Code:

May 18 10:30:34 pve6 pvedaemon[1676]: <root@pam> end task UPID:pve6:000495A6:006B4ECA:6284CACA:srvstart:osd.26:root@pam: OK
May 18 10:30:49 pve6 ceph-osd[300460]: 2022-05-18T10:30:49.924+0000 7fda4bda9f00 -1 osd.26 17970 log_to_monitors {default=true}
May 18 10:30:50 pve6 ceph-osd[195360]: 2022-05-18T10:30:50.592+0000 7ff25d843700 -1 osd.24 18153 osdmap fullness state needs update
May 18 10:30:51 pve6 ceph-osd[300460]: 2022-05-18T10:30:51.096+0000 7fda442b2700 -1 osd.26 18153 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
May 18 10:30:51 pve6 ceph-osd[195360]: 2022-05-18T10:30:51.848+0000 7ff25d843700 -1 osd.24 18154 osdmap fullness state needs update
May 18 10:30:52 pve6 ceph-osd[195360]: 2022-05-18T10:30:52.592+0000 7ff25d843700 -1 osd.24 18155 osdmap fullness state needs update
May 18 10:30:53 pve6 ceph-osd[195360]: 2022-05-18T10:30:53.704+0000 7ff25d843700 -1 osd.24 18156 osdmap fullness state needs update
May 18 10:30:54 pve6 ceph-osd[195360]: 2022-05-18T10:30:54.816+0000 7ff25d843700 -1 osd.24 18157 osdmap fullness state needs update
May 18 10:30:59 pve6 ceph-osd[195360]: 2022-05-18T10:30:59.512+0000 7ff25d843700 -1 osd.24 18158 osdmap fullness state needs update
May 18 10:31:00 pve6 systemd[1]: Starting Proxmox VE replication runner...
May 18 10:31:00 pve6 ceph-osd[195360]: 2022-05-18T10:31:00.620+0000 7ff25d843700 -1 osd.24 18159 osdmap fullness state needs update
May 18 10:31:00 pve6 systemd[1]: pvesr.service: Succeeded.
May 18 10:31:00 pve6 systemd[1]: Finished Proxmox VE replication runner.
May 18 10:31:03 pve6 ceph-osd[195360]: 2022-05-18T10:31:03.712+0000 7ff25d843700 -1 osd.24 18160 osdmap fullness state needs update
May 18 10:31:04 pve6 ceph-osd[195360]: 2022-05-18T10:31:04.816+0000 7ff25d843700 -1 osd.24 18161 osdmap fullness state needs update
May 18 10:31:14 pve6 ceph-osd[195360]: 2022-05-18T10:31:14.056+0000 7ff25d843700 -1 osd.24 18162 osdmap fullness state needs update
May 18 10:31:15 pve6 ceph-osd[195360]: 2022-05-18T10:31:15.128+0000 7ff25d843700 -1 osd.24 18163 osdmap fullness state needs update
May 18 10:31:16 pve6 ceph-osd[300460]: 2022-05-18T10:31:16.044+0000 7fda442b2700 -1 osd.26 18163 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
May 18 10:31:16 pve6 ceph-osd[195360]: 2022-05-18T10:31:16.256+0000 7ff25d843700 -1 osd.24 18164 osdmap fullness state needs update
May 18 10:31:17 pve6 ceph-osd[195360]: 2022-05-18T10:31:17.400+0000 7ff25d843700 -1 osd.24 18165 osdmap fullness state needs update
May 18 10:31:18 pve6 ceph-osd[195360]: 2022-05-18T10:31:18.440+0000 7ff25d843700 -1 osd.24 18166 osdmap fullness state needs update
May 18 10:31:26 pve6 ceph-osd[195360]: 2022-05-18T10:31:26.920+0000 7ff25d843700 -1 osd.24 18167 osdmap fullness state needs update
May 18 10:31:27 pve6 ceph-osd[195360]: 2022-05-18T10:31:27.980+0000 7ff25d843700 -1 osd.24 18168 osdmap fullness state needs update

Code:

root@pve6:/etc/pve# ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME       STATUS  REWEIGHT  PRI-AFF
 -1         11.18625  root default
 -3          1.63737      host pve0
  0    hdd   0.54579          osd.0     down         0  1.00000
  1    hdd   0.54579          osd.1     down         0  1.00000
 14    hdd   0.54579          osd.14    down         0  1.00000
 -5          2.72896      host pve1
  2    hdd   0.54579          osd.2       up   1.00000  1.00000
  3    hdd   0.54579          osd.3       up   1.00000  1.00000
  4    hdd   0.54579          osd.4       up   1.00000  1.00000
  5    hdd   0.54579          osd.5       up   1.00000  1.00000
  6    hdd   0.54579          osd.6       up   1.00000  1.00000
 -7          1.90958      host pve2
  7    hdd   0.27280          osd.7       up   1.00000  1.00000
  8    hdd   0.27280          osd.8       up   1.00000  1.00000
  9    hdd   0.27280          osd.9       up   1.00000  1.00000
 10    hdd   0.27280          osd.10      up   1.00000  1.00000
 11    hdd   0.27280          osd.11      up   1.00000  1.00000
 12    hdd   0.27280          osd.12      up   1.00000  1.00000
 13    hdd   0.27280          osd.13      up   1.00000  1.00000
 -9          0.81839      host pve3
 15    hdd   0.27280          osd.15      up   1.00000  1.00000
 16    hdd   0.27280          osd.16      up   1.00000  1.00000
 17    hdd   0.27280          osd.17      up   1.00000  1.00000
-11          1.09119      host pve4
 18    hdd   0.27280          osd.18      up   1.00000  1.00000
 19    hdd   0.27280          osd.19      up   1.00000  1.00000
 20    hdd   0.27280          osd.20      up   1.00000  1.00000
 21    hdd   0.27280          osd.21      up   1.00000  1.00000
-13          0.54559      host pve5
 22    hdd   0.27280          osd.22      up   1.00000  1.00000
 23    hdd   0.27280          osd.23      up   1.00000  1.00000
-15          0.81839      host pve6
 24    hdd   0.27280          osd.24    down         0  1.00000
 25    hdd   0.27280          osd.25    down         0  1.00000
 26    hdd   0.27280          osd.26    down         0  1.00000
-17          0.81839      host pve7
 27    hdd   0.27280          osd.27      up   1.00000  1.00000
 28    hdd   0.27280          osd.28      up   1.00000  1.00000
 29    hdd   0.27280          osd.29      up   1.00000  1.00000
-19          0.81839      host pve8
 30    hdd   0.27280          osd.30      up   1.00000  1.00000
 31    hdd   0.27280          osd.31      up   1.00000  1.00000
 32    hdd   0.27280          osd.32      up   1.00000  1.00000

bondif · May 18, 2022

More logs when we start osd.25 and then it gets down :

Code:

2022-05-18T10:57:31.508+0000 7fc78327f700  1 osd.25 pg_epoch: 18356 pg[2.282( v 17528'2529901 lc 16523'2529900 (11747'2526900,17528'2529901] local-lis/les=18350/18351 n=502 ec=11802/98 lis/c=18353/18353 les/c/f=18354/18354/0 sis=18356 pruub=13.029947281s) [29,25] r=1 lpr=18356 pi=[16606,18356)/2 crt=17528'2529901 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 159.213790894s@ m=1 mbc={}] state<Start>: transitioning to Stray
2022-05-18T10:57:31.508+0000 7fc78227d700  1 osd.25 pg_epoch: 18356 pg[2.63( v 11780'3043998 (11775'3038582,11780'3043998] lb 2:c613e9c8:::rbd_data.4f8634c44a746b.0000000000000c48:head local-lis/les=13630/13631 n=302 ec=98/98 lis/c=17610/17610 les/c/f=17611/17611/0 sis=18356) [10,25]/[10] r=-1 lpr=18356 pi=[13630,18356)/1 crt=11780'3043998 mlcod 0'0 remapped NOTIFY mbc={}] state<Start>: transitioning to Stray
2022-05-18T10:57:31.508+0000 7fc78327f700  1 osd.25 pg_epoch: 18356 pg[2.ac( v 12976'2359645 (11780'2357129,12976'2359645] local-lis/les=16557/16558 n=501 ec=2244/98 lis/c=16557/16557 les/c/f=16558/16558/0 sis=18356) [25,5]/[25] r=0 lpr=18356 pi=[16557,18356)/2 crt=12976'2359645 lcod 0'0 mlcod 0'0 remapped mbc={}] start_peering_interval up [25,5] -> [25,5], acting [25,5] -> [25], acting_primary 25 -> 25, up_primary 25 -> 25, role 0 -> 0, features acting 4540138297136906239 upacting 4540138297136906239
2022-05-18T10:57:31.508+0000 7fc78327f700  1 osd.25 pg_epoch: 18356 pg[2.ac( v 12976'2359645 (11780'2357129,12976'2359645] local-lis/les=16557/16558 n=501 ec=2244/98 lis/c=16557/16557 les/c/f=16558/16558/0 sis=18356) [25,5]/[25] r=0 lpr=18356 pi=[16557,18356)/2 crt=12976'2359645 lcod 0'0 mlcod 0'0 remapped mbc={}] state<Start>: transitioning to Primary
2022-05-18T10:57:32.664+0000 7fc78327f700  0 log_channel(cluster) log [INF] : 2.ac continuing backfill to osd.5 from (11780'2357129,12976'2359645] 2:35011190:::rbd_data.5a67fff34445bb.0000000000000c57:head to 12976'2359645
2022-05-18T10:57:32.664+0000 7fc782a7e700  0 log_channel(cluster) log [INF] : 2.12a continuing backfill to osd.19 from (11780'2916530,11780'2922079] MIN to 11780'2922079
2022-05-18T10:57:32.664+0000 7fc783a80700  0 log_channel(cluster) log [INF] : 2.4ac continuing backfill to osd.5 from (11780'2357129,11780'2359638] MIN to 11780'2359638
2022-05-18T10:57:32.664+0000 7fc78227d700  0 log_channel(cluster) log [INF] : 2.2ac continuing backfill to osd.5 from (11780'2357129,11780'2359638] 2:35461315:::rbd_data.5a67fff34445bb.000000000001329d:head to 11780'2359638
2022-05-18T10:57:32.664+0000 7fc78327f700  0 log_channel(cluster) log [INF] : 2.21e continuing backfill to osd.3 from (11780'3969559,11780'3972047] 2:784662d6:::rbd_data.4dba77a5b8f793.0000000000002d00:head to 11780'3972047
2022-05-18T10:57:32.664+0000 7fc782a7e700  0 log_channel(cluster) log [INF] : 2.6ac continuing backfill to osd.5 from (11780'2357129,11780'2359638] 2:356fce3c:::rbd_data.52befd43cf3520.00000000000081eb:head to 11780'2359638
2022-05-18T10:57:32.664+0000 7fc78327f700  0 log_channel(cluster) log [INF] : 2.21e continuing backfill to osd.8 from (11780'3969559,11780'3972047] 2:784662d6:::rbd_data.4dba77a5b8f793.0000000000002d00:head to 11780'3972047
2022-05-18T10:57:32.664+0000 7fc784281700  0 log_channel(cluster) log [INF] : 2.32a continuing backfill to osd.19 from (11780'2916530,11780'2922079] MIN to 11780'2922079
2022-05-18T10:57:32.664+0000 7fc783a80700  0 log_channel(cluster) log [INF] : 2.61e continuing backfill to osd.3 from (11780'3969559,12978'3972057] 2:7863171a:::rbd_data.668c986b195cf.0000000000000e9d:head to 12978'3972057
2022-05-18T10:57:32.664+0000 7fc783a80700  0 log_channel(cluster) log [INF] : 2.61e continuing backfill to osd.8 from (11780'3969559,12978'3972057] 2:7863171a:::rbd_data.668c986b195cf.0000000000000e9d:head to 12978'3972057
2022-05-18T10:57:32.664+0000 7fc78227d700  0 log_channel(cluster) log [INF] : 2.72a continuing backfill to osd.19 from (11780'2916530,11780'2922079] 2:54e93c03:::rbd_data.a3160d630815e4.0000000000006486:head to 11780'2922079
2022-05-18T10:57:32.664+0000 7fc783a80700  0 log_channel(cluster) log [INF] : 2.6e1 continuing backfill to osd.20 from (11780'2528655,11780'2531061] 2:87623ed2:::rbd_data.b508a2bbb4b6cd.0000000000000ab8:head to 11780'2531061
2022-05-18T10:57:32.664+0000 7fc78327f700  0 log_channel(cluster) log [INF] : 2.52a continuing backfill to osd.19 from (11780'2916530,11885'2922080] MIN to 11885'2922080
2022-05-18T10:57:32.664+0000 7fc784281700  0 log_channel(cluster) log [INF] : 2.66d continuing backfill to osd.20 from (11750'4215211,11780'4217579] MIN to 11780'4217579
2022-05-18T10:57:32.664+0000 7fc78227d700  0 log_channel(cluster) log [INF] : 2.41e continuing backfill to osd.15 from (11780'3969559,11780'3972047] MIN to 11780'3972047
2022-05-18T10:57:32.664+0000 7fc78227d700  0 log_channel(cluster) log [INF] : 2.4e1 continuing backfill to osd.20 from (11780'2528655,11780'2531061] MIN to 11780'2531061
2022-05-18T10:57:32.664+0000 7fc783a80700  0 log_channel(cluster) log [INF] : 2.37b continuing backfill to osd.31 from (11780'2777237,11780'2779687] MIN to 11780'2779687
2022-05-18T10:57:32.664+0000 7fc78327f700  0 log_channel(cluster) log [INF] : 2.2e1 continuing backfill to osd.20 from (11780'2528655,13099'2531150] 2:875ab6d3:::rbd_data.52befd43cf3520.00000000000133bc:head to 13099'2531150
2022-05-18T10:57:32.664+0000 7fc784281700  0 log_channel(cluster) log [INF] : 2.1e continuing backfill to osd.15 from (11780'3969559,12584'3972048] 2:780256ab:::rbd_data.a3160d630815e4.0000000000007859:head to 12584'3972048
2022-05-18T10:57:32.664+0000 7fc78227d700  0 log_channel(cluster) log [INF] : 2.6d continuing backfill to osd.20 from (11750'4215211,13067'4217584] 2:b60108dd:::rbd_data.5a67fff34445bb.00000000000079ed:head to 13067'4217584
2022-05-18T10:57:32.664+0000 7fc783a80700  0 log_channel(cluster) log [INF] : 2.26d continuing backfill to osd.20 from (11750'4215211,11780'4217579] 2:b64364e9:::rbd_data.5a67fff34445bb.0000000000000e3c:head to 11780'4217579
2022-05-18T10:57:32.664+0000 7fc782a7e700  0 log_channel(cluster) log [INF] : 2.46d continuing backfill to osd.20 from (11750'4215211,11780'4217579] 2:b6208f9e:::rbd_data.5a67fff34445bb.00000000000038df:head to 11780'4217579
2022-05-18T10:57:32.664+0000 7fc784281700  0 log_channel(cluster) log [INF] : 2.77b continuing backfill to osd.31 from (11780'2778322,13096'2779689] MIN to 13096'2779689
2022-05-18T10:57:32.664+0000 7fc782a7e700  0 log_channel(cluster) log [INF] : 2.57b continuing backfill to osd.31 from (11780'2777237,11780'2779687] MIN to 11780'2779687
2022-05-18T10:57:32.664+0000 7fc784281700  0 log_channel(cluster) log [INF] : 2.e1 continuing backfill to osd.20 from (11780'2528655,11780'2531061] MIN to 11780'2531061
2022-05-18T10:57:32.664+0000 7fc78227d700  0 log_channel(cluster) log [INF] : 2.17b continuing backfill to osd.31 from (11780'2777237,11780'2779687] MIN to 11780'2779687
2022-05-18T10:57:57.904+0000 7fc78ea96700  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.25 down, but it is still running
2022-05-18T10:57:57.904+0000 7fc78ea96700  0 log_channel(cluster) log [DBG] : map e18360 wrongly marked me down at e18360
2022-05-18T10:57:57.904+0000 7fc78ea96700 -1 osd.25 18360 _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
2022-05-18T10:57:57.904+0000 7fc78ea96700  1 osd.25 18360 start_waiting_for_healthy
2022-05-18T10:57:57.904+0000 7fc784281700  1 osd.25 pg_epoch: 18360 pg[2.63b( v 11780'2414361 (11780'2408831,11780'2414361] lb MIN local-lis/les=18322/18323 n=0 ec=11860/98 lis/c=18355/18352 les/c/f=18356/18353/0 sis=18360) [13]/[13,4] r=-1 lpr=18360 pi=[17606,18360)/1 luod=0'0 crt=11780'2414361 lcod 0'0 mlcod 0'0 active+remapped mbc={}] start_peering_interval up [13,25] -> [13], acting [13,4] -> [13,4], acting_primary 13 -> 13, up_primary 13 -> 13, role -1 -> -1, features acting 4540138297136906239 upacting 4540138297136906239
2022-05-18T10:57:57.904+0000 7fc783a80700  1 osd.25 pg_epoch: 18360 pg[2.e2( v 12978'3038142 (11747'3032397,12978'3038142] lb MIN local-lis/les=18323/18324 n=0 ec=2244/98 lis/c=18355/18352 les/c/f=18356/18353/0 sis=18360) [20]/[20,10] r=-1 lpr=18360 pi=[17709,18360)/1 luod=0'0 crt=12978'3038142 lcod 0'0 mlcod 0'0 active+remapped mbc={}] start_peering_interval up [20,25] -> [20], acting [20,10] -> [20,10], acting_primary 20 -> 20, up_primary 20 -> 20, role -1 -> -1, features acting 4540138297136906239 upacting 4540138297136906239
2022-05-18T10:57:57.904+0000 7fc784281700  1 osd.25 pg_epoch: 18360 pg[2.63b( v 11780'2414361 (11780'2408831,11780'2414361] lb MIN local-lis/les=18322/18323 n=0 ec=11860/98 lis/c=18355/18352 les/c/f=18356/18353/0 sis=18360) [13]/[13,4] r=-1 lpr=18360 pi=[17606,18360)/1 crt=11780'2414361 lcod 0'0 mlcod 0'0 remapped NOTIFY mbc={}] state<Start>: transitioning to Stray
2022-05-18T10:57:57.904+0000 7fc783a80700  1 osd.25 pg_epoch: 18360 pg[2.e2( v 12978'3038142 (11747'3032397,12978'3038142] lb MIN local-lis/les=18323/18324 n=0 ec=2244/98 lis/c=18355/18352 les/c/f=18356/18353/0 sis=18360) [20]/[20,10] r=-1 lpr=18360 pi=[17709,18360)/1 crt=12978'3038142 lcod 0'0 mlcod 0'0 remapped NOTIFY mbc={}] state<Start>: transitioning to Stray
2022-05-18T10:57:57.904+0000 7fc78327f700  1 osd.25 pg_epoch: 18360 pg[1.66( v 17517'235 (169'1,17517'235] lb MIN local-lis/les=18323/18324 n=0 ec=15/11 lis/c=18355/18352 les/c/f=18356/18353/0 sis=18360) [22]/[22,4] r=-1 lpr=18360 pi=[17918,18360)/1 luod=0'0 crt=17517'235 lcod 0'0 mlcod 0'0 active+remapped mbc={}] start_peering_interval up [22,25] -> [22], acting [22,4] -> [22,4], acting_primary 22 -> 22, up_primary 22 -> 22, role -1 -> -1, features acting 4540138297136906239 upacting 4540138297136906239
2022-05-18T10:57:57.904+0000 7fc782a7e700  1 osd.25 pg_epoch: 18360 pg[4.b7( empty local-lis/les=18355/18356 n=0 ec=179/173 lis/c=18355/18355 les/c/f=18356/18356/0 sis=18360 pruub=13.691542625s) [23] r=-1 lpr=18360 pi=[18355,18360)/1 crt=0'0 mlcod 0'0 active pruub 186.271499634s@ mbc={}] start_peering_interval up [23,25] -> [23], acting [23,25] -> [23], acting_primary 23 -> 23, up_primary 23 -> 23, role 1 -> -1, features acting 4540138297136906239 upacting 4540138297136906239
2022-05-18T10:57:57.904+0000 7fc78227d700  1 osd.25 pg_epoch: 18360 pg[3.27( v 11801'75 lc 1824'2 (0'0,11801'75] local-lis/les=18356/18357 n=26 ec=170/170 lis/c=18356/18353 les/c/f=18357/18354/0 sis=18360 pruub=14.793540001s) [4] r=-1 lpr=18360 pi=[17663,18360)/2 luod=0'0 crt=11801'75 lcod 0'0 mlcod 0'0 active pruub 187.373565674s@ m=30 mbc={}] start_peering_interval up [4,25] -> [4], acting [4,25] -> [4], acting_primary 4 -> 4, up_primary 4 -> 4, role 1 -> -1, features acting 4540138297136906239 upacting 4540138297136906239
2022-05-18T10:57:57.904+0000 7fc782a7e700  1 osd.25 pg_epoch: 18360 pg[4.b7( empty local-lis/les=18355/18356 n=0 ec=179/173 lis/c=18355/18355 les/c/f=18356/18356/0 sis=18360 pruub=13.691492081s) [23] r=-1 lpr=18360 pi=[18355,18360)/1 crt=0'0 mlcod 0'0 unknown NOTIFY pruub 186.271499634s@ mbc={}] state<Start>: transitioning to Stray
2022-05-18T10:57:57.904+0000 7fc78327f700  1 osd.25 pg_epoch: 18360 pg[1.66( v 17517'235 (169'1,17517'235] lb MIN local-lis/les=18323/18324 n=0 ec=15/11 lis/c=18355/18352 les/c/f=18356/18353/0 sis=18360) [22]/[22,4] r=-1 lpr=18360 pi=[17918,18360)/1 crt=17517'235 lcod 0'0 mlcod 0'0 remapped NOTIFY mbc={}] state<Start>: transitioning to Stray
2022-05-18T10:57:57.904+0000 7fc78ea96700  0 osd.25 18360 _committed_osd_maps shutdown OSD via async signal
2022-05-18T10:57:57.904+0000 7fc78227d700  1 osd.25 pg_epoch: 18360 pg[3.27( v 11801'75 lc 1824'2 (0'0,11801'75] local-lis/les=18356/18357 n=26 ec=170/170 lis/c=18356/18353 les/c/f=18357/18354/0 sis=18360 pruub=14.793498993s) [4] r=-1 lpr=18360 pi=[17663,18360)/2 crt=11801'75 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 187.373565674s@ m=30 mbc={}] state<Start>: transitioning to Stray
2022-05-18T10:57:57.904+0000 7fc79b558700 -1 received  signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2022-05-18T10:57:57.904+0000 7fc79b558700 -1 osd.25 18360 *** Got signal Interrupt ***
2022-05-18T10:57:57.904+0000 7fc79b558700 -1 osd.25 18360 *** Immediate shutdown (osd_fast_shutdown=true) ***
2022-05-18T10:57:57.904+0000 7fc783a80700  1 osd.25 pg_epoch: 18360 pg[2.23b( v 12993'2414382 (11780'2409763,12993'2414382] lb MIN local-lis/les=18322/18323 n=0 ec=11802/98 lis/c=18355/18352 les/c/f=18356/18353/0 sis=18360) [13]/[13,4] r=-1 lpr=18360 pi=[17606,18360)/1 luod=0'0 crt=12993'2414382 lcod 0'0 mlcod 0'0 active+remapped mbc={}] start_peering_interval up [13,25] -> [13], acting [13,4] -> [13,4], acting_primary 13 -> 13, up_primary 13 -> 13, role -1 -> -1, features acting 4540138297136906239 upacting 4540138297136906239
2022-05-18T10:57:57.904+0000 7fc784281700  1 osd.25 pg_epoch: 18360 pg[2.5eb( v 17523'2669643 (11780'2667288,17523'2669643] lb MIN local-lis/les=17851/17852 n=0 ec=11802/98 lis/c=17851/17851 les/c/f=17852/17852/0 sis=18360) [2] r=-1 lpr=18360 pi=[17851,18360)/1 luod=0'0 crt=17523'2669643 lcod 0'0 mlcod 0'0 peered mbc={}] start_peering_interval up [2,25] -> [2], acting [2] -> [2], acting_primary 2 -> 2, up_primary 2 -> 2, role -1 -> -1, features acting 4540138297136906239 upacting 4540138297136906239
2022-05-18T10:57:57.904+0000 7fc784281700  1 osd.25 pg_epoch: 18360 pg[2.5eb( v 17523'2669643 (11780'2667288,17523'2669643] lb MIN local-lis/les=17851/17852 n=0 ec=11802/98 lis/c=17851/17851 les/c/f=17852/17852/0 sis=18360) [2] r=-1 lpr=18360 pi=[17851,18360)/1 crt=17523'2669643 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
2022-05-18T10:57:57.904+0000 7fc783a80700  1 osd.25 pg_epoch: 18360 pg[2.23b( v 12993'2414382 (11780'2409763,12993'2414382] lb MIN local-lis/les=18322/18323 n=0 ec=11802/98 lis/c=18355/18352 les/c/f=18356/18353/0 sis=18360) [13]/[13,4] r=-1 lpr=18360 pi=[17606,18360)/1 crt=12993'2414382 lcod 0'0 mlcod 0'0 remapped NOTIFY mbc={}] state<Start>: transitioning to Stray

mira · May 18, 2022

Can you provide the complete ceph log (/var/log/ceph/ceph.log) as well as the OSD logs? (OSD 24, 25, 26)

How much free memory is available on that host? free -h

bondif · May 18, 2022

/var/log/ceph/ceph.log is empty
Logs of OSDs 24, 25 and 26 : https://drive.google.com/file/d/1nDT-pGuQTX7Nb_J4Q7Dnt1z32CttWvDo/view?usp=sharing

mira · May 18, 2022

Please make sure you have enough free RAM for those OSDs.
An OSD requires 4GB plus an additional 1GB per TB of space.

Also make sure the network those OSDs (Cluster network if you have it separate from the Public network) run on is up and working.

bondif · May 19, 2022

Actually, yes, it was a network issue, here is the details :
We have the ceph cluster network : 192.168.0.0/24 and the ceph public network : 10.10.20.0/24
Months ago, we've added an OVS switch with and OVS bridge having the network 192.168.0.0/16
Until last week, everything was working just fine, until we've restarted one node (because of the main issue with the web GUI) after that, its ODSs won't return back to the ceph cluster, showing DOWN/OUT, we've tried to start them, but they return back to the original status : DOWN/OUT.
The problem is that when our node has restarted, its routing table has changed and it started to send ceph cluster network traffic to the OVS bridge because the is a collision between the two networks (192.168.0.0/16 and 192.168.0.0/24).
So once we've disabled the OVS bridge interface, and we've started the OSDs, this time ceph kept them as UP/IN and started to recover the data and removed all the "bad" warnings and errors (check the attachment)
I hope this will help someone!

mira · May 19, 2022

That's great to hear!

And thanks for explaining the issue and the solution in the end. This will surely help others in the future.

[SOLVED] Web interface not working, and /etc/pve not mounting

bondif

Member

mira

Proxmox Staff Member

bondif

Member

mira

Proxmox Staff Member

bondif

Member

mira

Proxmox Staff Member

bondif

Member

bondif

Member

mira

Proxmox Staff Member

bondif

Member

Attachments

bondif

Member

mira

Proxmox Staff Member

bondif

Member

mira

Proxmox Staff Member

bondif

Member

Attachments

mira

Proxmox Staff Member

We value your privacy