CEPH 4node cluster - Frozen VMs during node reboot

Bran-Ko

Active Member
Jul 31, 2019
15
3
43
48
Slovakia, Zilina
HI I don't know if this is bug/bad configuration or feature. I have 4 node cluster of PVE with CEPH on every node. Every node has 5 OSD and 10GbE card.
Whenever I reboot PVE node all VMs on other nodes are frozen. After boot and all OSD online - VMs starding working corrctly.

I thout that exist some configuration of CEPH when I can patch/reboot PVE without downtime. Is it possible ?
 
How full are your OSDs? What is the reported Ceph status when one node is down?
 
Last edited:
OSD are used from 10-18%.
I don't remeber exact status. But I can reboot another node. (during weekend).
Is there any config what I need to share or status report ?
 
When you are encountering the problem again, you can run ceph status and post the output here. Please post the output between [code][/code] tags to preserve formatting. Thanks!

In the meantime you can provide the outputs of the following commands:
  • cat /etc/ceph/ceph.conf
  • pveceph pool ls
  • pvecm status
The last command would also be interesting to see whille one node is down.
 
here are config and status files
I'll upload files during node reboot late
 

Attachments

  • pveceph_pool_ls.txt
    3.2 KB · Views: 10
  • pvecm_status.txt
    740 bytes · Views: 5
  • ceph.conf.txt
    781 bytes · Views: 5
i am still a noob regarding Proxmox/Ceph, but i think this looks like your problem:

Code:
pveceph pool ls
───────────────────────┬──────┬──────────┬
 Name                  │ Size │ Min Size │
═══════════════════════╪══════╪══════════╪
 ceph-pool             │    2 │        2 │
───────────────────────┼──────┼──────────┼

https://pve.proxmox.com/pve-docs/chapter-pveceph.html
Size
The number of replicas per object. Ceph always tries to have this many copies of an object. Default: 3.

Min. Size
The minimum number of replicas per object. Ceph will reject I/O on the pool if a PG has less than this many replicas. Default: 2.


You have reduced size below the default 3.
So if you reboot one node which is part of this pool, the pool will temporary be below min size, resulting in I/O freeze.
As far as i understood, that's why "size" should always be bigger than "min size".
Reducing min size to 1 would mean a higher risk of data loss - so i guess going up to the default size of 3 would be preferred in your situation.

You can change that easily in the GUI (Node->Ceph->Pools)
 
  • Like
Reactions: Bran-Ko and Neobin
Thank for Advice. It was misunderstood from my side - I mean parameter "Size".
I changed parameter Size before reboot. And reboot was done after succefull reballancing of storage.

All other VM servers working - without freezed I/O.
Thanks a lot, again...
 

Attachments

  • pveceph_pool_ls.txt
    6.5 KB · Views: 6
  • pvecm_status.txt
    1.4 KB · Views: 4

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!