CEPH 4node cluster - Frozen VMs during node reboot

Bran-Ko · Jul 14, 2022

HI I don't know if this is bug/bad configuration or feature. I have 4 node cluster of PVE with CEPH on every node. Every node has 5 OSD and 10GbE card.
Whenever I reboot PVE node all VMs on other nodes are frozen. After boot and all OSD online - VMs starding working corrctly.

I thout that exist some configuration of CEPH when I can patch/reboot PVE without downtime. Is it possible ?

sterzy · Jul 15, 2022

Hi,

Bran-Ko said:
Whenever I reboot PVE node all VMs on other nodes are frozen. After boot and all OSD online - VMs starding working corrctly.

Have you configured enough monitors [1] and manager [2]?

[1]: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pve_ceph_monitors
[2]: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pve_ceph_manager

Bran-Ko · Jul 15, 2022

Yes I have monitor and manager on every node. This means 4 monitors and 4 managers.

sterzy · Jul 15, 2022

How full are your OSDs? What is the reported Ceph status when one node is down?

Bran-Ko · Jul 15, 2022

OSD are used from 10-18%.
I don't remeber exact status. But I can reboot another node. (during weekend).
Is there any config what I need to share or status report ?

sterzy · Jul 15, 2022

When you are encountering the problem again, you can run ceph status and post the output here. Please post the output between [code][/code] tags to preserve formatting. Thanks!

In the meantime you can provide the outputs of the following commands:

cat /etc/ceph/ceph.conf
pveceph pool ls
pvecm status

The last command would also be interesting to see whille one node is down.

Bran-Ko · Jul 15, 2022

here are config and status files
I'll upload files during node reboot late

Zerstoiber · Jul 15, 2022

i am still a noob regarding Proxmox/Ceph, but i think this looks like your problem:

Code:

pveceph pool ls
───────────────────────┬──────┬──────────┬
 Name                  │ Size │ Min Size │
═══════════════════════╪══════╪══════════╪
 ceph-pool             │    2 │        2 │
───────────────────────┼──────┼──────────┼

https://pve.proxmox.com/pve-docs/chapter-pveceph.html
Size
The number of replicas per object. Ceph always tries to have this many copies of an object. Default: 3.

Min. Size
The minimum number of replicas per object. Ceph will reject I/O on the pool if a PG has less than this many replicas. Default: 2.

You have reduced size below the default 3.
So if you reboot one node which is part of this pool, the pool will temporary be below min size, resulting in I/O freeze.
As far as i understood, that's why "size" should always be bigger than "min size".
Reducing min size to 1 would mean a higher risk of data loss - so i guess going up to the default size of 3 would be preferred in your situation.

You can change that easily in the GUI (Node->Ceph->Pools)

Bran-Ko · Jul 16, 2022

Thank for Advice. It was misunderstood from my side - I mean parameter "Size".
I changed parameter Size before reboot. And reboot was done after succefull reballancing of storage.

All other VM servers working - without freezed I/O.
Thanks a lot, again...

Search

Search

CEPH 4node cluster - Frozen VMs during node reboot

Bran-Ko

Active Member

sterzy

Proxmox Staff Member

Bran-Ko

Active Member

sterzy

Proxmox Staff Member

Bran-Ko

Active Member

sterzy

Proxmox Staff Member

Bran-Ko

Active Member

Attachments

Zerstoiber

Member

Bran-Ko

Active Member

Attachments

We value your privacy