VM quits unintentionally every night

RoFrie

Member
May 10, 2020
6
0
21
68
For some time now, my VM of the Proxmox Backup Server (PBS) has been saying goodbye every night. The following backups then fail, of course.

I have already tried to find a reason in the logs. However, I have failed so far.

As a QickFix, I restart pbs every night via crontab before the scheduled backup:
  • qm[2255297]: <root@pam> end task UPID: pve72:...:qmstart:300:root@pam: OK
So that works.

This is followed by some messages that seem harmless to me:
  • got inotify poll request in wrong process - disabling inotify
  • pam_unix(cron:session): session closed for user root
  • pvedaemon[...]: got timeout

But then messages related to CEPH appear:
  • pvestatd[2072]: got timeout
  • pveproxy[2268810]: proxy detected vanished client connection
  • It looks like ceph is suddenly having problems:
  • ceph-osd[2126]: 2022-09-02T03:08:49.715+0200 7f161ba44700 -1 osd.2 3529 heartbeat_check: no reply from 192.168.178.113:6806 osd.0 since back 2022-09-02T03:08:49.4>
  • ceph-osd[2126]: 2022-09-02T03:08:49.767+0200 7f161ba44700 -1 osd.2 3529 heartbeat_check: no reply from 192.168.178.123:6814 osd.1 since back 2022-09-02T03:08:19.5>
  • ceph-osd[2126]: 2022-09-02T03:08:50.107+0200 7f161ba44700 -1 osd.2 3529 heartbeat_check: no reply from 192.168.178.123:6806 osd.3 since back 2022-09-02T03:08:19.9>
  • ceph-osd[2126]: 2022-09-02T03:08:53.271+0200 7f161ba44700 -1 osd.2 3529 heartbeat_check: no reply from 192.168.178.113:6806 osd.0 since back 2022-09-02T03:08:49.4>
  • ceph-osd[2126]: 2022-09-02T03:08:53.287+0200 7f161ba44700 -1 osd.2 3529 heartbeat_check: no reply from 192.168.178.123:6814 osd.1 since back 2022-09-02T03:08:19.5>
  • ceph-osd[2126]: 2022-09-02T03:08:53.679+0200 7f161ba44700 -1 osd.2 3529 heartbeat_check: no reply from 192.168.178.123:6806 osd.3 since back 2022-09-02T03:08:19.9>
  • ceph-osd[2126]: 2022-09-02T03:08:54.339+0200 7f161ba44700 -1 osd.2 3529 get_health_metrics reporting 1 slow ops, oldest is osd_repop(client.69273025.0:25987546 8.>

After that, both VMs of the server are no longer accessible:
  • pvestatd[2072]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - got timeout
  • pvestatd[2072]: VM 300 qmp command failed - VM 300 qmp command 'query-proxmox-support' failed - got timeout

Somehow VM 108 seems to recover, but not PBS (VM 300).
Subsequent backups then fail, of course.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!