logrotate restarts Ceph and VMs hang

pratclot

New Member
Jan 28, 2024
4
0
1
Hello everyone!

I have a primitive homelab with 2 machines, it is a hyper-converged setup with Ceph configured from UI (VM storage).

Once in 2 days all the VMs "hang" and become unresponsive via network (ssh does not work, but I do not remember if ping also fails) or console (there are kernel stacktraces and OOMs in dmesg, trying to log in hangs indefinitely; if there was already a logged-in session, anything that accesses storage will hang the session). I understand that it is linked to the VM storage suddenly becoming unavailable (although I did not test with a locally stored VM).

On the nodes I see logrotate restarting Ceph services (there is evidence in /etc/logrotate.d/ceph-common):
Bash:
Feb 27 00:00:04 nuc ceph-mgr[1198]: 2024-02-27T00:00:04.282+0100 7ed9d0b956c0 -1 received  signal: Hangup from  (PID: 1878791) UID: 0
Feb 27 00:00:04 nuc ceph-mon[1199]: 2024-02-27T00:00:04.282+0100 798d3d32e6c0 -1 mon.nuc@0(leader) e4 *** Got Signal Hangup ***
Feb 27 00:00:04 nuc ceph-mon[1199]: 2024-02-27T00:00:04.282+0100 798d3d32e6c0 -1 received  signal: Hangup from  (PID: 1878791) UID: 0
Feb 27 00:00:04 nuc ceph-osd[1477]: 2024-02-27T00:00:04.282+0100 789e607c96c0 -1 received  signal: Hangup from  (PID: 1878791) UID: 0
Feb 27 00:00:04 nuc ceph-osd[1477]: 2024-02-27T00:00:04.262+0100 789e607c96c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse rados>
Feb 27 00:00:04 nuc ceph-mgr[1198]: 2024-02-27T00:00:04.262+0100 7ed9d0b956c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse rados>
Feb 27 00:00:04 nuc ceph-mon[1199]: 2024-02-27T00:00:04.262+0100 798d3d32e6c0 -1 mon.nuc@0(leader) e4 *** Got Signal Hangup ***
Feb 27 00:00:04 nuc ceph-mon[1199]: 2024-02-27T00:00:04.262+0100 798d3d32e6c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse rados>

Then there are complaints from ceph-crash:
Bash:
Feb 27 00:07:27 nuc ceph-crash[670]: WARNING:ceph-crash:unable to read crash path /var/lib/ceph/crash/2024-02-15T09:14:19.817639Z_7d1eee2d-ac78-4062-9a13-a29e92644588

I understand this is a bug with directory permissions because not even root can list files there (by pressing <Tab>). I attached the log under that path, from what I have researched it looks like Ceph is trying to access something that never existed. This makes me think that it probably is not related to anything I configured through the Web UI.

From the node it appears as if Ceph is working as usual though (status reports that volumes are healthy, pgs are active+clean). In order to make VMs work without cluster reboot I once killed all processes with "ceph" on the commandline. This also killed all VMs because they have this argument:
Bash:
-drive file=rbd:testpool/vm-100-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/testpool.keyring,if=none,id=drive-virtio0,format=raw,cache=none,aio=io_uring,detect-zeroes=on

Then I was able to restart the VMs and all was back to normal. Just restarting the machines (via "pkill -9") does not work and connecting through the console in Proxmox's web interface will just timeout, and I don't think I saw any logs explaining why the timeout happens. I did not try to start the machines from cmdline though.

I figured I would try asking here because it does not look like Ceph is doing everything properly when asked to restart, and I cannot find anything that could help me influence its behavior (not that I know how to influence it, lol).

Here are the versions of PVE and Ceph.

Bash:
root@nuc:~# pveversion
pve-manager/8.1.4/ec5affc9e41f1d79 (running kernel: 6.5.13-1-pve)
root@nuc:~# ceph --version
ceph version 18.2.1 (850293cdaae6621945e1191aa8c28ea2918269c3) reef (stable)
 

Attachments

  • log.tar.gz
    1.8 KB · Views: 0
I have a primitive homelab with 2 machines, it is a hyper-converged setup with Ceph configured from UI (VM storage).

Ceph requires (at least!) three nodes to work reliably..
 
@fabian I am seeing this issue with a 3 node ceph cluster

Code:
Jul 29 00:00:53 hv1 systemd[1]: Starting logrotate.service - Rotate log files...
Jul 29 00:00:53 hv1 ceph-mds[5078]: 2024-07-29T00:00:53.337-0700 79e509b5c6c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1179295) UID: 0
Jul 29 00:00:53 hv1 ceph-mon[5082]: 2024-07-29T00:00:53.337-0700 71ec97bfd6c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1179295) UID: 0
Jul 29 00:00:53 hv1 ceph-mon[5082]: 2024-07-29T00:00:53.337-0700 71ec97bfd6c0 -1 mon.hv1@0(leader) e3 *** Got Signal Hangup ***
Jul 29 00:00:53 hv1 ceph-osd[5972]: 2024-07-29T00:00:53.337-0700 7c3a547336c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1179295) UID: 0
Jul 29 00:00:53 hv1 ceph-osd[5956]: 2024-07-29T00:00:53.337-0700 7eef2abeb6c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1179295) UID: 0
Jul 29 00:00:53 hv1 ceph-osd[5969]: 2024-07-29T00:00:53.337-0700 74c9d66176c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1179295) UID: 0
Jul 29 00:00:53 hv1 ceph-osd[5964]: 2024-07-29T00:00:53.337-0700 7bd09c0b86c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1179295) UID: 0
Jul 29 00:00:53 hv1 ceph-osd[5979]: 2024-07-29T00:00:53.337-0700 7d69808206c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1179295) UID: 0
Jul 29 00:00:53 hv1 ceph-osd[5975]: 2024-07-29T00:00:53.337-0700 7f8cc67c66c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1179295) UID: 0
Jul 29 00:00:53 hv1 ceph-osd[5977]: 2024-07-29T00:00:53.337-0700 70dd3cd8a6c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1179295) UID: 0
Jul 29 00:00:53 hv1 ceph-mgr[250261]: 2024-07-29T00:00:53.341-0700 7c2b2a58c6c0 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1179295) UID: 0
 
those messages are normal, it's part of rotating the logs..
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!