VM quits unintentionally every night

RoFrie

Member
May 10, 2020
6
0
21
69
For some time now, my VM of the Proxmox Backup Server (PBS) has been saying goodbye every night. The following backups then fail, of course.

I have already tried to find a reason in the logs. However, I have failed so far.

As a QickFix, I restart pbs every night via crontab before the scheduled backup:
  • qm[2255297]: <root@pam> end task UPID: pve72:...:qmstart:300:root@pam: OK
So that works.

This is followed by some messages that seem harmless to me:
  • got inotify poll request in wrong process - disabling inotify
  • pam_unix(cron:session): session closed for user root
  • pvedaemon[...]: got timeout

But then messages related to CEPH appear:
  • pvestatd[2072]: got timeout
  • pveproxy[2268810]: proxy detected vanished client connection
  • It looks like ceph is suddenly having problems:
  • ceph-osd[2126]: 2022-09-02T03:08:49.715+0200 7f161ba44700 -1 osd.2 3529 heartbeat_check: no reply from 192.168.178.113:6806 osd.0 since back 2022-09-02T03:08:49.4>
  • ceph-osd[2126]: 2022-09-02T03:08:49.767+0200 7f161ba44700 -1 osd.2 3529 heartbeat_check: no reply from 192.168.178.123:6814 osd.1 since back 2022-09-02T03:08:19.5>
  • ceph-osd[2126]: 2022-09-02T03:08:50.107+0200 7f161ba44700 -1 osd.2 3529 heartbeat_check: no reply from 192.168.178.123:6806 osd.3 since back 2022-09-02T03:08:19.9>
  • ceph-osd[2126]: 2022-09-02T03:08:53.271+0200 7f161ba44700 -1 osd.2 3529 heartbeat_check: no reply from 192.168.178.113:6806 osd.0 since back 2022-09-02T03:08:49.4>
  • ceph-osd[2126]: 2022-09-02T03:08:53.287+0200 7f161ba44700 -1 osd.2 3529 heartbeat_check: no reply from 192.168.178.123:6814 osd.1 since back 2022-09-02T03:08:19.5>
  • ceph-osd[2126]: 2022-09-02T03:08:53.679+0200 7f161ba44700 -1 osd.2 3529 heartbeat_check: no reply from 192.168.178.123:6806 osd.3 since back 2022-09-02T03:08:19.9>
  • ceph-osd[2126]: 2022-09-02T03:08:54.339+0200 7f161ba44700 -1 osd.2 3529 get_health_metrics reporting 1 slow ops, oldest is osd_repop(client.69273025.0:25987546 8.>

After that, both VMs of the server are no longer accessible:
  • pvestatd[2072]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - got timeout
  • pvestatd[2072]: VM 300 qmp command failed - VM 300 qmp command 'query-proxmox-support' failed - got timeout

Somehow VM 108 seems to recover, but not PBS (VM 300).
Subsequent backups then fail, of course.