PVE HCI with Ceph, question regarding recent host lock-up and storage availability...

bfrd9k

New Member
Apr 23, 2024
17
7
3
Portland Oregon
Questions:

  1. Why would a host lock up/hang cause ceph storage outage for VMs?
  2. Why or how would a host lock up/hang cause OSD down/out in other hosts?
  3. What configuration changes can be made to ensure uptime in event of host failure?
  4. Do you see anything about my setup that would cause this type of outage, like misconfiguration?
  5. Can you think of any reasons why a host would lock up like this, that isn't hardware related, or if you have any ideas on hardware I'll take them too.

Details:

We have a five host HCI cluster with ceph (rbd only), each host has 12 SSD OSD, 10G ceph network. The ceph public interface is shared for PBS backups and throttling (100MiB/s) is set for backups but it's unclear if the throttling applies to backups or just restoration.

We have recently had an issue where a host ("host 1") locked up. The host became unresponsive from the network and from console. I used the IPMI to reset the machine and it came up. When it did it started booting the VM's per configuration but the VMs didn't seem to have access to their images. I then noticed that there were six other OSD down/out across the cluster, specifically three in host 4 and three in host 5. I migrated VMs around and rebooted those machines too.

Once all of the hosts were back up and ceph returned to "healthy" I rebooted all of the VMs in host 1 and they came up fine. I went through all other VMs and found that there were some other VMs that had hung due to storage access issues. I rebooted them as well and they came up cleanly.

I understand why the guests running from host 1 went offline but I don't understand why the other six OSD were down/out and why VMs seemed to have lost access to their storage. There are 60 OSD across the cluster, 12 went missing due to host lock up and the other 6 down/out, I assume were related to the other 12.

I could see if a VM image had data on one of the 12 from host one, and also one of the other 6, that ceph would wait for it to have 2 writes before confirming the write to the guest. Could it be that the rebalance after the outage could have taken so long to complete that by the time it was done the guests had locked up waiting on some writes to complete?

The cause of the lock up is still unkown, I am assuming it's hardware related. IPMI metrics for power in/out look normal, never exceeding 250/750W and I have been running stress tests to get it to lock up again but no such luck. Could these type of lock ups be due to other conditions like storage latency, perhaps caused by running backups over the same links used by the ceph public network?).

Sorry about the complicated question, let me know how I can provide more clarification and thanks in advance for the help.

Cluster Info:
  • Proxmox 8.2.2 (licensed)
  • 5x R730xd, 2x 3.2Ghz CPU, 128Gb RAM
  • HBA330, 12x 1.92Tb SSD (1x OSD per) ea (60 total OSD)
  • 1x 4 port 1Gb NIC in LACP for VM access
  • 2x 1Gb NIC for PVE management active/backup
  • 1x 2 port 10Gb NIC

Ceph Info:
  • Ceph Reef 18.2.2
  • Default configuration
  • 10G Public, 10G Private, dedicated ports on same NIC connected to same 10G switch, different networks and isolated VLANs

Code:
root@pve1:~# ceph status
cluster:
    id:     4dcba9f2-821e-4704-89b4-db0a7846097b
    health: HEALTH_OK
 
  services:
    mon: 5 daemons, quorum pve1,pve2,pve3,pve4,pve5 (age 31h)
    mgr: pve3(active, since 12d), standbys: pve4, pve2, pve5, pve1
    osd: 60 osds: 60 up (since 31h), 60 in (since 12d)
 
  data:
    pools:   3 pools, 2113 pgs
    objects: 1.99M objects, 7.4 TiB
    usage:   21 TiB used, 84 TiB / 105 TiB avail
    pgs:     2113 active+clean
 
  io:
    client:   30 MiB/s rd, 4.1 MiB/s wr, 524 op/s rd, 225 op/s wr

Code:
root@pve1:~# ceph osd pool stats
pool .mgr id 1
  nothing is going on

pool datastore1 id 2
  client io 2.0 MiB/s rd, 6.6 MiB/s wr, 163 op/s rd, 289 op/s wr

pool datastore2 id 3
  nothing is going on
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!