Hello everyone!
I have a primitive homelab with 2 machines, it is a hyper-converged setup with Ceph configured from UI (VM storage).
Once in 2 days all the VMs "hang" and become unresponsive via network (ssh does not work, but I do not remember if ping also fails) or console (there are kernel stacktraces and OOMs in dmesg, trying to log in hangs indefinitely; if there was already a logged-in session, anything that accesses storage will hang the session). I understand that it is linked to the VM storage suddenly becoming unavailable (although I did not test with a locally stored VM).
On the nodes I see logrotate restarting Ceph services (there is evidence in /etc/logrotate.d/ceph-common):
Then there are complaints from ceph-crash:
I understand this is a bug with directory permissions because not even root can list files there (by pressing <Tab>). I attached the log under that path, from what I have researched it looks like Ceph is trying to access something that never existed. This makes me think that it probably is not related to anything I configured through the Web UI.
From the node it appears as if Ceph is working as usual though (status reports that volumes are healthy, pgs are active+clean). In order to make VMs work without cluster reboot I once killed all processes with "ceph" on the commandline. This also killed all VMs because they have this argument:
Then I was able to restart the VMs and all was back to normal. Just restarting the machines (via "pkill -9") does not work and connecting through the console in Proxmox's web interface will just timeout, and I don't think I saw any logs explaining why the timeout happens. I did not try to start the machines from cmdline though.
I figured I would try asking here because it does not look like Ceph is doing everything properly when asked to restart, and I cannot find anything that could help me influence its behavior (not that I know how to influence it, lol).
Here are the versions of PVE and Ceph.
I have a primitive homelab with 2 machines, it is a hyper-converged setup with Ceph configured from UI (VM storage).
Once in 2 days all the VMs "hang" and become unresponsive via network (ssh does not work, but I do not remember if ping also fails) or console (there are kernel stacktraces and OOMs in dmesg, trying to log in hangs indefinitely; if there was already a logged-in session, anything that accesses storage will hang the session). I understand that it is linked to the VM storage suddenly becoming unavailable (although I did not test with a locally stored VM).
On the nodes I see logrotate restarting Ceph services (there is evidence in /etc/logrotate.d/ceph-common):
Bash:
Feb 27 00:00:04 nuc ceph-mgr[1198]: 2024-02-27T00:00:04.282+0100 7ed9d0b956c0 -1 received signal: Hangup from (PID: 1878791) UID: 0
Feb 27 00:00:04 nuc ceph-mon[1199]: 2024-02-27T00:00:04.282+0100 798d3d32e6c0 -1 mon.nuc@0(leader) e4 *** Got Signal Hangup ***
Feb 27 00:00:04 nuc ceph-mon[1199]: 2024-02-27T00:00:04.282+0100 798d3d32e6c0 -1 received signal: Hangup from (PID: 1878791) UID: 0
Feb 27 00:00:04 nuc ceph-osd[1477]: 2024-02-27T00:00:04.282+0100 789e607c96c0 -1 received signal: Hangup from (PID: 1878791) UID: 0
Feb 27 00:00:04 nuc ceph-osd[1477]: 2024-02-27T00:00:04.262+0100 789e607c96c0 -1 received signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse rados>
Feb 27 00:00:04 nuc ceph-mgr[1198]: 2024-02-27T00:00:04.262+0100 7ed9d0b956c0 -1 received signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse rados>
Feb 27 00:00:04 nuc ceph-mon[1199]: 2024-02-27T00:00:04.262+0100 798d3d32e6c0 -1 mon.nuc@0(leader) e4 *** Got Signal Hangup ***
Feb 27 00:00:04 nuc ceph-mon[1199]: 2024-02-27T00:00:04.262+0100 798d3d32e6c0 -1 received signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse rados>
Then there are complaints from ceph-crash:
Bash:
Feb 27 00:07:27 nuc ceph-crash[670]: WARNING:ceph-crash:unable to read crash path /var/lib/ceph/crash/2024-02-15T09:14:19.817639Z_7d1eee2d-ac78-4062-9a13-a29e92644588
I understand this is a bug with directory permissions because not even root can list files there (by pressing <Tab>). I attached the log under that path, from what I have researched it looks like Ceph is trying to access something that never existed. This makes me think that it probably is not related to anything I configured through the Web UI.
From the node it appears as if Ceph is working as usual though (status reports that volumes are healthy, pgs are active+clean). In order to make VMs work without cluster reboot I once killed all processes with "ceph" on the commandline. This also killed all VMs because they have this argument:
Bash:
-drive file=rbd:testpool/vm-100-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/testpool.keyring,if=none,id=drive-virtio0,format=raw,cache=none,aio=io_uring,detect-zeroes=on
Then I was able to restart the VMs and all was back to normal. Just restarting the machines (via "pkill -9") does not work and connecting through the console in Proxmox's web interface will just timeout, and I don't think I saw any logs explaining why the timeout happens. I did not try to start the machines from cmdline though.
I figured I would try asking here because it does not look like Ceph is doing everything properly when asked to restart, and I cannot find anything that could help me influence its behavior (not that I know how to influence it, lol).
Here are the versions of PVE and Ceph.
Bash:
root@nuc:~# pveversion
pve-manager/8.1.4/ec5affc9e41f1d79 (running kernel: 6.5.13-1-pve)
root@nuc:~# ceph --version
ceph version 18.2.1 (850293cdaae6621945e1191aa8c28ea2918269c3) reef (stable)