Questions:
Details:
We have a five host HCI cluster with ceph (rbd only), each host has 12 SSD OSD, 10G ceph network. The ceph public interface is shared for PBS backups and throttling (100MiB/s) is set for backups but it's unclear if the throttling applies to backups or just restoration.
We have recently had an issue where a host ("host 1") locked up. The host became unresponsive from the network and from console. I used the IPMI to reset the machine and it came up. When it did it started booting the VM's per configuration but the VMs didn't seem to have access to their images. I then noticed that there were six other OSD down/out across the cluster, specifically three in host 4 and three in host 5. I migrated VMs around and rebooted those machines too.
Once all of the hosts were back up and ceph returned to "healthy" I rebooted all of the VMs in host 1 and they came up fine. I went through all other VMs and found that there were some other VMs that had hung due to storage access issues. I rebooted them as well and they came up cleanly.
I understand why the guests running from host 1 went offline but I don't understand why the other six OSD were down/out and why VMs seemed to have lost access to their storage. There are 60 OSD across the cluster, 12 went missing due to host lock up and the other 6 down/out, I assume were related to the other 12.
I could see if a VM image had data on one of the 12 from host one, and also one of the other 6, that ceph would wait for it to have 2 writes before confirming the write to the guest. Could it be that the rebalance after the outage could have taken so long to complete that by the time it was done the guests had locked up waiting on some writes to complete?
The cause of the lock up is still unkown, I am assuming it's hardware related. IPMI metrics for power in/out look normal, never exceeding 250/750W and I have been running stress tests to get it to lock up again but no such luck. Could these type of lock ups be due to other conditions like storage latency, perhaps caused by running backups over the same links used by the ceph public network?).
Sorry about the complicated question, let me know how I can provide more clarification and thanks in advance for the help.
Cluster Info:
Ceph Info:
- Why would a host lock up/hang cause ceph storage outage for VMs?
- Why or how would a host lock up/hang cause OSD down/out in other hosts?
- What configuration changes can be made to ensure uptime in event of host failure?
- Do you see anything about my setup that would cause this type of outage, like misconfiguration?
- Can you think of any reasons why a host would lock up like this, that isn't hardware related, or if you have any ideas on hardware I'll take them too.
Details:
We have a five host HCI cluster with ceph (rbd only), each host has 12 SSD OSD, 10G ceph network. The ceph public interface is shared for PBS backups and throttling (100MiB/s) is set for backups but it's unclear if the throttling applies to backups or just restoration.
We have recently had an issue where a host ("host 1") locked up. The host became unresponsive from the network and from console. I used the IPMI to reset the machine and it came up. When it did it started booting the VM's per configuration but the VMs didn't seem to have access to their images. I then noticed that there were six other OSD down/out across the cluster, specifically three in host 4 and three in host 5. I migrated VMs around and rebooted those machines too.
Once all of the hosts were back up and ceph returned to "healthy" I rebooted all of the VMs in host 1 and they came up fine. I went through all other VMs and found that there were some other VMs that had hung due to storage access issues. I rebooted them as well and they came up cleanly.
I understand why the guests running from host 1 went offline but I don't understand why the other six OSD were down/out and why VMs seemed to have lost access to their storage. There are 60 OSD across the cluster, 12 went missing due to host lock up and the other 6 down/out, I assume were related to the other 12.
I could see if a VM image had data on one of the 12 from host one, and also one of the other 6, that ceph would wait for it to have 2 writes before confirming the write to the guest. Could it be that the rebalance after the outage could have taken so long to complete that by the time it was done the guests had locked up waiting on some writes to complete?
The cause of the lock up is still unkown, I am assuming it's hardware related. IPMI metrics for power in/out look normal, never exceeding 250/750W and I have been running stress tests to get it to lock up again but no such luck. Could these type of lock ups be due to other conditions like storage latency, perhaps caused by running backups over the same links used by the ceph public network?).
Sorry about the complicated question, let me know how I can provide more clarification and thanks in advance for the help.
Cluster Info:
- Proxmox 8.2.2 (licensed)
- 5x R730xd, 2x 3.2Ghz CPU, 128Gb RAM
- HBA330, 12x 1.92Tb SSD (1x OSD per) ea (60 total OSD)
- 1x 4 port 1Gb NIC in LACP for VM access
- 2x 1Gb NIC for PVE management active/backup
- 1x 2 port 10Gb NIC
Ceph Info:
- Ceph Reef 18.2.2
- Default configuration
- 10G Public, 10G Private, dedicated ports on same NIC connected to same 10G switch, different networks and isolated VLANs
Code:
root@pve1:~# ceph status
cluster:
id: 4dcba9f2-821e-4704-89b4-db0a7846097b
health: HEALTH_OK
services:
mon: 5 daemons, quorum pve1,pve2,pve3,pve4,pve5 (age 31h)
mgr: pve3(active, since 12d), standbys: pve4, pve2, pve5, pve1
osd: 60 osds: 60 up (since 31h), 60 in (since 12d)
data:
pools: 3 pools, 2113 pgs
objects: 1.99M objects, 7.4 TiB
usage: 21 TiB used, 84 TiB / 105 TiB avail
pgs: 2113 active+clean
io:
client: 30 MiB/s rd, 4.1 MiB/s wr, 524 op/s rd, 225 op/s wr
Code:
root@pve1:~# ceph osd pool stats
pool .mgr id 1
nothing is going on
pool datastore1 id 2
client io 2.0 MiB/s rd, 6.6 MiB/s wr, 163 op/s rd, 289 op/s wr
pool datastore2 id 3
nothing is going on
Last edited: