Hi,
we have a 4 node cluster with ceph configured and running for years.
Recently starting with around 6.2-6 we got problems that our smb backups regular failing (ZSTD/Snaptshot). Same situation with 6.2-12.
Likely its due a network failure, but why this happens no idea - nothing obvious were found.
some dmesg from that point
Even worse, after the backup is stuck, the lxc/vm is locked and not responding to anything.
Its getting even more problematic as soon I click "abort the backup", at the proxmox admin gui. The whole node become unresponsive after that.
There is no other possibility besides rebooting the node.
Start/stop/reboot via console of any lxc/vm is not possible anymore (gui/shell/local shell) and even the normal running one are getting slower and getting unresponsive over time.
I know such things are likely not easy to track down and I can't reproduce it at the moment, I have to wait for a failing backup.
Are there any logs that could help to find the underlying issue ? I could collect such logs at the next failing backup.
we have a 4 node cluster with ceph configured and running for years.
Recently starting with around 6.2-6 we got problems that our smb backups regular failing (ZSTD/Snaptshot). Same situation with 6.2-12.
Likely its due a network failure, but why this happens no idea - nothing obvious were found.
some dmesg from that point
Code:
[32528.438433] CIFS VFS: Close unmatched open
[32528.581811] CIFS VFS: No writable handle in writepages rc=-9
[32528.582850] CIFS VFS: No writable handle in writepages rc=-9
Even worse, after the backup is stuck, the lxc/vm is locked and not responding to anything.
Its getting even more problematic as soon I click "abort the backup", at the proxmox admin gui. The whole node become unresponsive after that.
There is no other possibility besides rebooting the node.
Start/stop/reboot via console of any lxc/vm is not possible anymore (gui/shell/local shell) and even the normal running one are getting slower and getting unresponsive over time.
I know such things are likely not easy to track down and I can't reproduce it at the moment, I have to wait for a failing backup.
Are there any logs that could help to find the underlying issue ? I could collect such logs at the next failing backup.