Hello there, apologies in advance if this is the wrong place but I'm at my wits end. This may require more real-time chatting. I've been trying to debug this for a month, going through all the documentation, changing hardware, and everything I can think of.
`ceph-csi-cephfs` seems to cause my system (running kernel `6.16.2`, `6.16.1`, `6.15.11`?, `6.12.43`) to freeze, or deadlock, causing the watchdog to force a reboot. It's a VM, the Ceph host is Proxmox running `6.14.8` kernel (and previous versions). This does not happen with a mounted cephfs volume on the host itself. This issue also happened with Ceph 19.2.2 and the latest Proxmox 8. `ceph-csi-cephfs` is installed using the Helm chart, currently at 3.15.0, but it happened with 3.14.x as well.
Currently Ceph is `19.2.3`, hardware is 3 nodes, each with: 9700X CPU, 3 osds (1 nvme Micron 7450 Pro with heatsink and no thermal issues, 2 hdd Seagate Exos), 64 GB RAM, 2x 10GbE ethernet ports, 1 for Proxmox 1 for Ceph storage. VMs each have a 2.5GbE NIC.
I cannot seem to trigger it consistently, but I have two pods mounting 2 static volumes (1 shared) using cephfs (backed by HDDs), and their own dynamic volumes on NVMes. My current workload has one pod adding files to the shared static volume, moving them to the other static volume. The node gets rebooted, and all of the ceph-csi-cephfs pods show exit code 255, with nothing in the logs.
I don't see anything on the host machine, the closest I see is `client.0 error registering admin socket command: (17) File exists` from the mgr around the time on one of the nodes, but that may just be the other pod picking it up (this seems to be an unrelated issue, everything runs fine, and manually stopping the mgr, deleting the socket file, and restarting it still has this error).
The only other thing I see in the osd logs is `osd.0 1541 mon_cmd_maybe_osd_create fail: 'osd.0 has already bound to class 'nvme', can not reset class to 'ssd'; use 'ceph osd crush rm-device-class <id>>`. I'm sure that's just a configuration option I just haven't seen yet, I picked nvme as the class when I created the OSD in the Proxmox UI.
I see nothing in the host indicating something went wrong, but my pods data gets corrupted (their own dynamic volume), and I have to start over again. I think I've removed any possible hardware issues, I bought a separate NIC for each VM and do PCI passthrough. There seems to be no packet loss.
If there's any tips anyone has to begin to figure this out, I would be very grateful, thanks! Because of the hard lock, I figure it must be the Ceph client in the kernel, but this is my first time using Ceph and I'm not sure how to go about debugging this, it isn't in the standard troubleshooting list.
`ceph-csi-cephfs` seems to cause my system (running kernel `6.16.2`, `6.16.1`, `6.15.11`?, `6.12.43`) to freeze, or deadlock, causing the watchdog to force a reboot. It's a VM, the Ceph host is Proxmox running `6.14.8` kernel (and previous versions). This does not happen with a mounted cephfs volume on the host itself. This issue also happened with Ceph 19.2.2 and the latest Proxmox 8. `ceph-csi-cephfs` is installed using the Helm chart, currently at 3.15.0, but it happened with 3.14.x as well.
Currently Ceph is `19.2.3`, hardware is 3 nodes, each with: 9700X CPU, 3 osds (1 nvme Micron 7450 Pro with heatsink and no thermal issues, 2 hdd Seagate Exos), 64 GB RAM, 2x 10GbE ethernet ports, 1 for Proxmox 1 for Ceph storage. VMs each have a 2.5GbE NIC.
I cannot seem to trigger it consistently, but I have two pods mounting 2 static volumes (1 shared) using cephfs (backed by HDDs), and their own dynamic volumes on NVMes. My current workload has one pod adding files to the shared static volume, moving them to the other static volume. The node gets rebooted, and all of the ceph-csi-cephfs pods show exit code 255, with nothing in the logs.
I don't see anything on the host machine, the closest I see is `client.0 error registering admin socket command: (17) File exists` from the mgr around the time on one of the nodes, but that may just be the other pod picking it up (this seems to be an unrelated issue, everything runs fine, and manually stopping the mgr, deleting the socket file, and restarting it still has this error).
The only other thing I see in the osd logs is `osd.0 1541 mon_cmd_maybe_osd_create fail: 'osd.0 has already bound to class 'nvme', can not reset class to 'ssd'; use 'ceph osd crush rm-device-class <id>>`. I'm sure that's just a configuration option I just haven't seen yet, I picked nvme as the class when I created the OSD in the Proxmox UI.
I see nothing in the host indicating something went wrong, but my pods data gets corrupted (their own dynamic volume), and I have to start over again. I think I've removed any possible hardware issues, I bought a separate NIC for each VM and do PCI passthrough. There seems to be no packet loss.
If there's any tips anyone has to begin to figure this out, I would be very grateful, thanks! Because of the hard lock, I figure it must be the Ceph client in the kernel, but this is my first time using Ceph and I'm not sure how to go about debugging this, it isn't in the standard troubleshooting list.
Last edited: