My vm (home assistant) will not startup - get timeout - don't know how to fix this

guitarddns

New Member
Aug 16, 2024
6
1
3
Hi all - did not do an upgrade of cluster software or anything else - try to migrate to another node - just hangs
I running 9.2.3 ( 4 node cluster ) and CEPH 19.2.3
I notice that 2 OSD ( out of 8 ) was down and I cant restart them

I dont know the details of how CEPH is storing data - but the 2 failing OSD is not on the host that the vm is live
and since there is 4 node with 2 OSD on each -and one is down - data could be ok - but again I can't see anyting

run journalctl - but it dont give any clue

any suggestions to fix this ??

thanks in advance

John
 
Hi John,

thanks for posting in the forum!

Please provide the output of the following commands so we can better assess the current state your cluster is in and help figuring out a solution:
Code:
pveversion -v
pvecm status
ceph -s
ceph osd df tree
pveceph pool ls

Please also provide the journal for the node with the failing OSD devices. You can use the following command to gather them:
journalctl --since "2026-06-17 08:00" --until "2026-06-18 18:00"
Please adapt the timestamps accordingly

Also please provide the output of a Start Task on the mentioned VM.

Yours sincerely
Jonas
 
Hi Jonas - Thanks for your reply - I was 200 km away from home - reason for delay answer
in the meentime both osd is now up - but that sad it is still reacting strange

you ask for som extra information - they are below - include log file

but what is wrong ...

thanks in advance

John


root@pve04:~# pveversion -v
proxmox-ve: 9.2.0 (running kernel: 7.0.6-2-pve)
pve-manager: 9.2.3 (running version: 9.2.3/d0fde103346cf89a)
proxmox-kernel-helper: 9.2.0
proxmox-kernel-7.0: 7.0.6-2
proxmox-kernel-7.0.6-2-pve-signed: 7.0.6-2
proxmox-kernel-7.0.2-4-pve-signed: 7.0.2-4
proxmox-kernel-7.0.2-3-pve-signed: 7.0.2-3
proxmox-kernel-7.0.2-2-pve-signed: 7.0.2-2
proxmox-kernel-6.17: 6.17.13-13
proxmox-kernel-6.17.13-13-pve-signed: 6.17.13-13
proxmox-kernel-6.17.13-9-pve-signed: 6.17.13-9
proxmox-kernel-6.17.13-8-pve-signed: 6.17.13-8
proxmox-kernel-6.17.13-7-pve-signed: 6.17.13-7
proxmox-kernel-6.17.2-1-pve-signed: 6.17.2-1
ceph: 19.2.3-pve4
ceph-fuse: 19.2.3-pve4
corosync: 3.1.10-pve2
criu: 4.1.1-1
frr-pythontools: 10.6.1-1+pve2
ifupdown2: 3.3.0-1+pmx12
intel-microcode: 3.20251111.1~deb13u1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.1
libproxmox-backup-qemu0: 2.0.2
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.1.1
libpve-apiclient-perl: 3.4.2
libpve-cluster-api-perl: 9.1.6
libpve-cluster-perl: 9.1.6
libpve-common-perl: 9.1.13
libpve-guest-common-perl: 6.0.3
libpve-http-server-perl: 6.0.5
libpve-network-perl: 1.6.6
libpve-notify-perl: 9.1.6
libpve-rs-perl: 0.15.3
libpve-storage-perl: 9.1.5
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2+pmx1
lxc-pve: 7.0.0-2
lxcfs: 7.0.0-pve1
novnc-pve: 1.7.0-1
proxmox-backup-client: 4.2.1-1
proxmox-backup-file-restore: 4.2.1-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.2.3
proxmox-kernel-helper: 9.2.0
proxmox-mail-forward: 1.0.3
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.4
proxmox-widget-toolkit: 5.2.3
pve-cluster: 9.1.6
pve-container: 6.1.10
pve-docs: 9.2.2
pve-edk2-firmware: 4.2025.05-2
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.4
pve-firmware: 3.18-4
pve-ha-manager: 5.2.4
pve-i18n: 3.7.5
pve-qemu-kvm: 11.0.0-4
pve-xtermjs: 6.0.0-1
qemu-server: 9.1.16
smartmontools: 7.5-pve2
spiceterm: 3.4.2
swtpm: 0.8.0+pve3
vncterm: 1.9.2
zfsutils-linux: 2.4.2-pve1

root@pve04:~# pvecm status
Cluster information
-------------------
Name: cluster2
Config Version: 6
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Jun 18 21:35:23 2026
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000002
Ring ID: 1.f9d
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.1.1.3
0x00000002 1 10.1.1.4 (local)
0x00000003 1 10.1.1.5
0x00000004 1 10.1.1.6
root@pve04:~#
root@pve04:~# ceph -s
cluster:
id: 5bf6b567-9ef4-4e34-a752-78d5b250525d
health: HEALTH_WARN
Reduced data availability: 217 pgs inactive, 249 pgs peering
1 slow ops, oldest one blocked for 42 sec, daemons [osd.0,osd.6] have slow ops.

services:
mon: 4 daemons, quorum pve01,pve2,pve04,pve05 (age 5h)
mgr: pve05(active, since 7h), standbys: pve2, pve04, pve01
osd: 8 osds: 8 up (since 24s), 8 in (since 6h); 164 remapped pgs

data:
pools: 2 pools, 513 pgs
objects: 67.99k objects, 264 GiB
usage: 866 GiB used, 3.0 TiB / 3.9 TiB avail
pgs: 74.269% pgs not active
217 peering
164 remapped+peering
132 active+clean

root@pve04:~#


root@pve04:~# ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-1 3.87234 - 3.9 TiB 866 GiB 861 GiB 111 KiB 4.2 GiB 3.0 TiB 21.83 1.00 - root default
-3 0.97958 - 1003 GiB 226 GiB 226 GiB 13 KiB 496 MiB 777 GiB 22.57 1.03 - host pve01
0 ssd 0.48979 1.00000 502 GiB 117 GiB 117 GiB 7 KiB 256 MiB 384 GiB 23.36 1.07 232 up osd.0
1 ssd 0.48979 1.00000 502 GiB 109 GiB 109 GiB 6 KiB 240 MiB 392 GiB 21.77 1.00 214 up osd.1
-5 0.93359 - 956 GiB 192 GiB 191 GiB 42 KiB 1.3 GiB 764 GiB 20.11 0.92 - host pve04
2 ssd 0.46680 1.00000 478 GiB 96 GiB 96 GiB 15 KiB 603 MiB 382 GiB 20.17 0.92 66 up osd.2
3 ssd 0.46680 1.00000 478 GiB 96 GiB 95 GiB 27 KiB 705 MiB 382 GiB 20.04 0.92 151 up osd.3
-7 0.97958 - 1003 GiB 220 GiB 219 GiB 19 KiB 570 MiB 783 GiB 21.90 1.00 - host pve05
4 ssd 0.48979 1.00000 502 GiB 108 GiB 108 GiB 12 KiB 276 MiB 393 GiB 21.59 0.99 212 up osd.4
5 ssd 0.48979 1.00000 502 GiB 111 GiB 111 GiB 7 KiB 294 MiB 390 GiB 22.21 1.02 220 up osd.5
-9 0.97958 - 1003 GiB 227 GiB 226 GiB 37 KiB 1.8 GiB 776 GiB 22.68 1.04 - host pve2
6 ssd 0.48979 1.00000 502 GiB 115 GiB 114 GiB 13 KiB 1.1 GiB 386 GiB 22.96 1.05 223 up osd.6
7 ssd 0.48979 1.00000 502 GiB 112 GiB 112 GiB 24 KiB 728 MiB 389 GiB 22.40 1.03 221 up osd.7
TOTAL 3.9 TiB 866 GiB 861 GiB 115 KiB 4.2 GiB 3.0 TiB 21.83
MIN/MAX VAR: 0.92/1.07 STDDEV: 1.12
root@pve04:~# pveceph pool ls
┌───────────┬──────┬──────────┬────────┬─────────────┬────────────────┬───────────────────┬──────────────────────────┬───────────────────────────┬─────────────────┬────────────
│ Name │ Size │ Min Size │ PG Num │ min. PG Num │ Optimal PG Num │ PG Autoscale Mode │ PG Autoscale Target Size │ PG Autoscale Target Ratio │ Crush Rule Name │
╞═══════════╪══════╪══════════╪════════╪═════════════╪════════════════╪═══════════════════╪══════════════════════════╪═══════════════════════════╪═════════════════╪════════════
│ .mgr │ 3 │ 2 │ 1 │ 1 │ 1 │ on │ │ │ replicated_rule │ 7.633550922
├───────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────┼─────────────────┼────────────
│ ceph_pool │ 3 │ 3 │ 512 │ │ 256 │ on │ 1073741824000 │ │ replicated_rule │ 0.2326791
└───────────┴──────┴──────────┴────────┴─────────────┴────────────────┴───────────────────┴──────────────────────────┴───────────────────────────┴─────────────────┴────────────
root@pve04:~#
 

Attachments

Hi, no worries!

│ ceph_pool │ 3 │ 3 │ 512 │ │ 256 │ on │ 1073741824000 │ │ replicated_rule │ 0.2326791
Here is most likely your problem.
You set the Size and Min Size parameters for your Ceph pool both to 3.
This means that as soon as you lose one of your OSDs or respectively the PGs on that OSD, the remaining two copies of the data won't be able to operate until a third PG was restored to a different host/OSD.
The recommended and default values for these parameters are Size 3 and Min Size 2, see [1]
This allows your cluster to lose one node completely without having data availability issues.
You should be able to set these parameters on the running pool without damaging anything, but it's still always a good idea to have current backups in place.

pgs: 74.269% pgs not active
The reason you still have availability issues is this metric.
This means that only about a quarter of your data is available to the VMs which can lead to unexpected behavior.

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_pool_options
 
  • Like
Reactions: Johannes S