For some time now, my VM of the Proxmox Backup Server (PBS) has been saying goodbye every night. The following backups then fail, of course.
I have already tried to find a reason in the logs. However, I have failed so far.
As a QickFix, I restart pbs every night via crontab before the scheduled backup:
This is followed by some messages that seem harmless to me:
But then messages related to CEPH appear:
After that, both VMs of the server are no longer accessible:
Somehow VM 108 seems to recover, but not PBS (VM 300).
Subsequent backups then fail, of course.
I have already tried to find a reason in the logs. However, I have failed so far.
As a QickFix, I restart pbs every night via crontab before the scheduled backup:
- qm[2255297]: <root@pam> end task UPID: pve72:...:qmstart:300:root@pam: OK
This is followed by some messages that seem harmless to me:
- got inotify poll request in wrong process - disabling inotify
- pam_unix(cron:session): session closed for user root
- pvedaemon[...]: got timeout
But then messages related to CEPH appear:
- pvestatd[2072]: got timeout
- pveproxy[2268810]: proxy detected vanished client connection
- It looks like ceph is suddenly having problems:
- ceph-osd[2126]: 2022-09-02T03:08:49.715+0200 7f161ba44700 -1 osd.2 3529 heartbeat_check: no reply from 192.168.178.113:6806 osd.0 since back 2022-09-02T03:08:49.4>
- ceph-osd[2126]: 2022-09-02T03:08:49.767+0200 7f161ba44700 -1 osd.2 3529 heartbeat_check: no reply from 192.168.178.123:6814 osd.1 since back 2022-09-02T03:08:19.5>
- ceph-osd[2126]: 2022-09-02T03:08:50.107+0200 7f161ba44700 -1 osd.2 3529 heartbeat_check: no reply from 192.168.178.123:6806 osd.3 since back 2022-09-02T03:08:19.9>
- ceph-osd[2126]: 2022-09-02T03:08:53.271+0200 7f161ba44700 -1 osd.2 3529 heartbeat_check: no reply from 192.168.178.113:6806 osd.0 since back 2022-09-02T03:08:49.4>
- ceph-osd[2126]: 2022-09-02T03:08:53.287+0200 7f161ba44700 -1 osd.2 3529 heartbeat_check: no reply from 192.168.178.123:6814 osd.1 since back 2022-09-02T03:08:19.5>
- ceph-osd[2126]: 2022-09-02T03:08:53.679+0200 7f161ba44700 -1 osd.2 3529 heartbeat_check: no reply from 192.168.178.123:6806 osd.3 since back 2022-09-02T03:08:19.9>
- ceph-osd[2126]: 2022-09-02T03:08:54.339+0200 7f161ba44700 -1 osd.2 3529 get_health_metrics reporting 1 slow ops, oldest is osd_repop(client.69273025.0:25987546 8.>
After that, both VMs of the server are no longer accessible:
- pvestatd[2072]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - got timeout
- pvestatd[2072]: VM 300 qmp command failed - VM 300 qmp command 'query-proxmox-support' failed - got timeout
Somehow VM 108 seems to recover, but not PBS (VM 300).
Subsequent backups then fail, of course.