logrotate restarts Ceph and VMs hang

pratclot · Feb 27, 2024

Hello everyone!

I have a primitive homelab with 2 machines, it is a hyper-converged setup with Ceph configured from UI (VM storage).

Once in 2 days all the VMs "hang" and become unresponsive via network (ssh does not work, but I do not remember if ping also fails) or console (there are kernel stacktraces and OOMs in dmesg, trying to log in hangs indefinitely; if there was already a logged-in session, anything that accesses storage will hang the session). I understand that it is linked to the VM storage suddenly becoming unavailable (although I did not test with a locally stored VM).

On the nodes I see logrotate restarting Ceph services (there is evidence in /etc/logrotate.d/ceph-common):

Bash:

Feb 27 00:00:04 nuc ceph-mgr[1198]: 2024-02-27T00:00:04.282+0100 7ed9d0b956c0 -1 received  signal: Hangup from  (PID: 1878791) UID: 0
Feb 27 00:00:04 nuc ceph-mon[1199]: 2024-02-27T00:00:04.282+0100 798d3d32e6c0 -1 mon.nuc@0(leader) e4 *** Got Signal Hangup ***
Feb 27 00:00:04 nuc ceph-mon[1199]: 2024-02-27T00:00:04.282+0100 798d3d32e6c0 -1 received  signal: Hangup from  (PID: 1878791) UID: 0
Feb 27 00:00:04 nuc ceph-osd[1477]: 2024-02-27T00:00:04.282+0100 789e607c96c0 -1 received  signal: Hangup from  (PID: 1878791) UID: 0
Feb 27 00:00:04 nuc ceph-osd[1477]: 2024-02-27T00:00:04.262+0100 789e607c96c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse rados>
Feb 27 00:00:04 nuc ceph-mgr[1198]: 2024-02-27T00:00:04.262+0100 7ed9d0b956c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse rados>
Feb 27 00:00:04 nuc ceph-mon[1199]: 2024-02-27T00:00:04.262+0100 798d3d32e6c0 -1 mon.nuc@0(leader) e4 *** Got Signal Hangup ***
Feb 27 00:00:04 nuc ceph-mon[1199]: 2024-02-27T00:00:04.262+0100 798d3d32e6c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse rados>

Then there are complaints from ceph-crash:

Bash:

Feb 27 00:07:27 nuc ceph-crash[670]: WARNING:ceph-crash:unable to read crash path /var/lib/ceph/crash/2024-02-15T09:14:19.817639Z_7d1eee2d-ac78-4062-9a13-a29e92644588

I understand this is a bug with directory permissions because not even root can list files there (by pressing <Tab>). I attached the log under that path, from what I have researched it looks like Ceph is trying to access something that never existed. This makes me think that it probably is not related to anything I configured through the Web UI.

From the node it appears as if Ceph is working as usual though (status reports that volumes are healthy, pgs are active+clean). In order to make VMs work without cluster reboot I once killed all processes with "ceph" on the commandline. This also killed all VMs because they have this argument:

Bash:

-drive file=rbd:testpool/vm-100-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/testpool.keyring,if=none,id=drive-virtio0,format=raw,cache=none,aio=io_uring,detect-zeroes=on

Then I was able to restart the VMs and all was back to normal. Just restarting the machines (via "pkill -9") does not work and connecting through the console in Proxmox's web interface will just timeout, and I don't think I saw any logs explaining why the timeout happens. I did not try to start the machines from cmdline though.

I figured I would try asking here because it does not look like Ceph is doing everything properly when asked to restart, and I cannot find anything that could help me influence its behavior (not that I know how to influence it, lol).

Here are the versions of PVE and Ceph.

Bash:

root@nuc:~# pveversion
pve-manager/8.1.4/ec5affc9e41f1d79 (running kernel: 6.5.13-1-pve)
root@nuc:~# ceph --version
ceph version 18.2.1 (850293cdaae6621945e1191aa8c28ea2918269c3) reef (stable)

fabian · Feb 27, 2024

pratclot said:
I have a primitive homelab with 2 machines, it is a hyper-converged setup with Ceph configured from UI (VM storage).

Ceph requires (at least!) three nodes to work reliably..

bdbz · Jul 29, 2024

@fabian I am seeing this issue with a 3 node ceph cluster

Code:

Jul 29 00:00:53 hv1 systemd[1]: Starting logrotate.service - Rotate log files...
Jul 29 00:00:53 hv1 ceph-mds[5078]: 2024-07-29T00:00:53.337-0700 79e509b5c6c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1179295) UID: 0
Jul 29 00:00:53 hv1 ceph-mon[5082]: 2024-07-29T00:00:53.337-0700 71ec97bfd6c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1179295) UID: 0
Jul 29 00:00:53 hv1 ceph-mon[5082]: 2024-07-29T00:00:53.337-0700 71ec97bfd6c0 -1 mon.hv1@0(leader) e3 *** Got Signal Hangup ***
Jul 29 00:00:53 hv1 ceph-osd[5972]: 2024-07-29T00:00:53.337-0700 7c3a547336c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1179295) UID: 0
Jul 29 00:00:53 hv1 ceph-osd[5956]: 2024-07-29T00:00:53.337-0700 7eef2abeb6c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1179295) UID: 0
Jul 29 00:00:53 hv1 ceph-osd[5969]: 2024-07-29T00:00:53.337-0700 74c9d66176c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1179295) UID: 0
Jul 29 00:00:53 hv1 ceph-osd[5964]: 2024-07-29T00:00:53.337-0700 7bd09c0b86c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1179295) UID: 0
Jul 29 00:00:53 hv1 ceph-osd[5979]: 2024-07-29T00:00:53.337-0700 7d69808206c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1179295) UID: 0
Jul 29 00:00:53 hv1 ceph-osd[5975]: 2024-07-29T00:00:53.337-0700 7f8cc67c66c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1179295) UID: 0
Jul 29 00:00:53 hv1 ceph-osd[5977]: 2024-07-29T00:00:53.337-0700 70dd3cd8a6c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1179295) UID: 0
Jul 29 00:00:53 hv1 ceph-mgr[250261]: 2024-07-29T00:00:53.341-0700 7c2b2a58c6c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1179295) UID: 0

fabian · Jul 30, 2024

those messages are normal, it's part of rotating the logs..

pratclot · Saturday at 21:30

For people coming here from google searches.

I realized that there were not enough Ceph daemons running, and this is why my "cluster" died every time ceph-mgr or ceph-mon was restarted. The UI in Proxmox shows you how many of them you have, but does not warn you if you have 1 instance of a manager, for example. As an inexperienced Ceph user, this is something very easy to overlook, and I believe it could be improved from Proxmox UI side. Of course, it is amazing that Proxmox allows us to create Ceph clusters via GUI.

By the way, my "cluster" of two consumer SSDs shows 250 MB/s "sequential" writes (qm disk import is a good test). I recently added a third one, and it dropped to 10 MB/s (25 times). So, for my primitive homelab purposes, 2 consumer disks work a lot better than a three-node cluster (and it is very stable, to the point of surviving hard reboots and not breaking anything).

UdoB · Sunday at 08:01

pratclot said:
I recently added a third one, and it dropped to 10 MB/s (25 times).

During the required initial re-balancing this might be normal. After balancing had finished it seems... slow. Remember: the slowest of your (too) cheap SSDs limits the write speed for all of them.

How is the ceph pool organized? What is the output of ceph osd pool ls detail?

Are you still with only two machines? (Edit: with a third OSD - as I understood it?) How are the machines connected? 1 GBit/s or more?

To be clear: the failure domain usually is "host". You probably want to be able to lose one and stay "up". After that one is dead you need two to be up to be able to administrate Proxmox and the same two for Ceph to not drop down to readonly. So three nodes is the absolute minimum for a cluster. While PVE and Ceph are independent stacks this is true for Proxmox Quorum and also for Ceph MON-majority. (There is no trick like a Quorum-Device for Ceph.)

Good luck!

pratclot · Sunday at 09:58

Hey Udo, thank you for the tips!

UdoB said:
During the required initial re-balancing this might be normal. After balancing had finished it seems... slow. Remember: the slowest of your (too) cheap SSDs limits the write speed for all of them.

I believe speed stayed the same after the re-balancing. I noticed the process in the UI, it showed 14 hours or so to finish, I let it run overnight, and disk import was still slow the next day.

UdoB said:
How is the ceph pool organized? What is the output of ceph osd pool ls detail?

Here you go:

Bash:

root@nuc:~# ceph osd pool ls detail
pool 1 '.mgr' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 14 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 2.00
pool 8 'testpool' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 pg_num_target 32 pgp_num_target 32 autoscale_mode on last_change 1321 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 1.02
pool 9 'cephfs_data' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 pg_num_target 32 pgp_num_target 32 autoscale_mode on last_change 1319 flags hashpspool stripe_width 0 application cephfs read_balance_score 1.06
pool 10 'cephfs_metadata' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1318 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs read_balance_score 1.25

UdoB said:
Are you still with only two machines? (Edit: with a third OSD - as I understood it?) How are the machines connected? 1 GBit/s or more?

I have 3 machines now, yes, but only 2 OSDs due to speed issues

Ceph network uses Thunderbolt connection, iperf shows ~20 Gb/s between the nodes.

UdoB said:
To be clear: the failure domain usually is "host". You probably want to be able to lose one and stay "up". After that one is dead you need two to be up to be able to administrate Proxmox and the same two for Ceph to not drop down to readonly. So three nodes is the absolute minimum for a cluster. While PVE and Ceph are independent stacks this is true for Proxmox Quorum and also for Ceph MON-majority. (There is no trick like a Quorum-Device for Ceph.)

Yes, I wanted to have an ability to turn off one of the machines for dust cleaning and still have VMs running. I am actually considering lowering the replicas to 1 in Ceph for now, just to have that feature. Later I plan to get some PLP disks, it is just a bit too expensive to buy everything at once for a not-so-important side project.

As a side question, would a USB-attached SSD work well with a DB+WAL on a separate PLP disk? I use "thin" Intel NUC machines and there is only a space for one internal SSD. I understand that I am pushing this too far with my crappy setup

UdoB said:
Good luck!

Thank you!

UdoB · Sunday at 11:21

pratclot said:
replicated size 2 min_size 2

Okay. Data is stored twice. But as min_size=2 you can not lose one and stay up.

pratclot said:
lowering the replicas to 1 in Ceph for now

Sure, for debugging or disaster recovery this may work. (I have not experience with this, personally I would really avoid this.) I am sure this has been discussed elsewhere several times. Read: if anything goes wrong during that condition then your data is toast.

pratclot said:
would a USB-attached SSD work well with a DB+WAL on a separate PLP disk?

I have no idea, sorry. But... a separate WAL was a really good idea when rotating rust was used for the normal OSDs.

Because you are on SSD-only I would not do that for a specific reason: beside the quality of the SSDs the number of OSDs is important and the number of connectors in Mini-PCs is limited. If you are really thinking about connecting devices via USB (which ist NOT recommended - but I do it too... in my homelab, not at my dayjob) then opt for more OSDs!

One often heard recommendation is to have four or more OSDs per node. Especially with the minimum number of nodes and only two OSDs per node this is the typical pitfall: as soon as one of these two fails the other OSD in the same node has to absorb the data formerly stored on the dead one.

For a three node cluster with two OSD each you can only fill every OSD up to 45%. When a single one fails the neighbor will be using 90% which is already above the warning level. (With four OSD per node is effect is not so dramatic.)

The only good solution is to go for five [edit: or four - for this specific Ceph aspect] (or more) nodes... ;-)

Ceph scales well up and beyond. But the low end in a homelab is tricky to get a stable situation with. Oh, and everything I wrote is for size=3, min_size=2 in replication mode. Never go below this if you want to have a stable system.

pratclot · Monday at 09:25

UdoB said:
One often heard recommendation is to have four or more OSDs per node. Especially with the minimum number of nodes and only two OSDs per node this is the typical pitfall: as soon as one of these two fails the other OSD in the same node has to absorb the data formerly stored on the dead one.

For a three node cluster with two OSD each you can only fill every OSD up to 45%. When a single one fails the neighbor will be using 90% which is already above the warning level. (With four OSD per node is effect is not so dramatic.)

The only good solution is to go for five [edit: or four - for this specific Ceph aspect] (or more) nodes... ;-)

Thank you Udo, this looks like something that can only be learned through practice, and I appreciate this advice. I thought to have an OSD for a USB disk and an OSD for the remaining part of a PLP disk, and now it does not look like a great idea.

UdoB said:
Ceph scales well up and beyond. But the low end in a homelab is tricky to get a stable situation with. Oh, and everything I wrote is for size=3, min_size=2 in replication mode. Never go below this if you want to have a stable system.

This is a very sad realization for me, haha!

UdoB · Monday at 09:33

pratclot said:
This is a very sad realization for me, haha!

Yeah - been there, done that.

The important thing is: to be able to lose one item without interfering the normal operation you need to have one "too many" in the first place. For Ceph OSDs this is "size - min_size" = 3 - 2 = 1. One OSD may be lost. If you start with size=2,min_size=2 you get 2-2=0 so none can be lost.

And the concept of "failure domain = host" introduces those other not-so-nice aspects...

Search

Search

logrotate restarts Ceph and VMs hang

pratclot

New Member

Attachments

fabian

Proxmox Staff Member

bdbz

New Member

fabian

Proxmox Staff Member

pratclot

New Member

UdoB

Distinguished Member

pratclot

New Member

UdoB

Distinguished Member

pratclot

New Member

UdoB

Distinguished Member