Hi!
We have a problem with one of our Proxmox hosts. This is a cluster with 3 hosts.
One of the hosts has 7 VMs running CentOS 7. It is capable of doing any backups with no errors, except one where the machine has lots of storage. It starts doing the backup, and it will suddenly reboot. On the logs we can see it apparently disconnected from the Backups storage, but that's not true, when it says it disconnected, i can still access the backups storage from proxmox from any other host.
We tried doing this backup locally, but it fails the same way. We have HA setup so we always find all services running on the other server on mondays after getting the "backup failed" email because of this error.
We seem to have issues with ZFS RAM usage too (this host has 188GB of RAM, ZFS sometimes will use up to 100GB, i assume its because it has 12TB storage and 10TB are assigned to one VM). We solved it setting up a cron job that releases 2GB of ram from the cache everytime, we looked into limiting it but we're doing something wrong and its just not doing it. We were told ZFS shouldn't use more than 50% of memory but in our case it uses way more (70% or more).
We can't see any error on the logs other than the storage issue, but it doesn't make sense that the server suddenly reboots (it doesn't even shut off VMs, it just does a hard reset!)
Sep 07 14:00:09 LabHost qm[3615100]: <root@pam> starting task UPID:LabHost:003729B7:03B4B051:66DC4049:qmpause:402:root@pam:
Sep 07 14:00:09 LabHost qm[3615159]: suspend VM 402: UPID:LabHost:003729B7:03B4B051:66DC4049:qmpause:402:root@pam:
Sep 07 14:00:09 LabHost qm[3615100]: <root@pam> end task UPID:LabHost:003729B7:03B4B051:66DC4049:qmpause:402:root@pam: OK
Sep 07 14:00:12 LabHost pvescheduler[3615092]: VM 402 qmp command failed - VM 402 qmp command 'guest-ping' failed - got timeout
Sep 07 14:10:01 LabHost CRON[3619506]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 07 14:10:01 LabHost CRON[3619507]: (root) CMD (sync; echo 3 > /proc/sys/vm/drop_caches)
Sep 07 14:10:26 LabHost kernel: sh (3619507): drop_caches: 3
Sep 07 14:10:26 LabHost CRON[3619506]: pam_unix(cron:session): session closed for user root
Sep 07 14:17:01 LabHost CRON[3622238]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 07 14:17:01 LabHost CRON[3622241]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 07 14:17:01 LabHost CRON[3622238]: pam_unix(cron:session): session closed for user root
Sep 07 14:19:08 LabHost pve-firewall[2019]: firewall update time (9.590 seconds)
Sep 07 14:19:12 LabHost pvestatd[2023]: status update time (33.853 seconds)
Sep 07 14:19:57 LabHost pve-ha-lrm[2064]: loop take too long (55 seconds)
Sep 07 14:20:01 LabHost CRON[3623200]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 07 14:20:01 LabHost CRON[3623201]: (root) CMD (sync; echo 3 > /proc/sys/vm/drop_caches)
Sep 07 14:20:04 LabHost pvestatd[2023]: status update time (52.078 seconds)
Sep 07 14:20:23 LabHost pve-ha-lrm[3623259]: VM 402 qmp command failed - VM 402 qmp command 'query-status' failed - got timeout
Sep 07 14:20:23 LabHost pve-ha-lrm[3623259]: VM 402 qmp command 'query-status' failed - got timeout
Sep 07 14:20:24 LabHost kernel: sh (3623201): drop_caches: 3
Sep 07 14:20:25 LabHost CRON[3623200]: pam_unix(cron:session): session closed for user root
Sep 07 14:20:36 LabHost pvestatd[2023]: storage 'ProxmoxBackups' is not online
Sep 07 14:20:45 LabHost pvestatd[2023]: status update time (40.971 seconds)
Sep 07 14:21:34 LabHost pve-firewall[2019]: firewall update time (15.472 seconds)
Sep 07 14:21:40 LabHost pvestatd[2023]: status update time (24.866 seconds)
Sep 07 14:21:56 LabHost pmxcfs[1924]: [dcdb] notice: data verification successful
Sep 07 14:21:58 LabHost pvestatd[2023]: VM 401 qmp command failed - VM 401 qmp command 'query-proxmox-support' failed - got timeout
Sep 07 14:22:21 LabHost pvestatd[2023]: VM 405 qmp command failed - VM 405 qmp command 'query-proxmox-support' failed - got timeout
Sep 07 14:22:21 LabHost pve-firewall[2019]: firewall update time (23.302 seconds)
Sep 07 14:22:21 LabHost pvestatd[2023]: status update time (30.892 seconds)
Sep 07 14:22:40 LabHost pvestatd[2023]: VM 402 qmp command failed - VM 402 qmp command 'query-proxmox-support' failed - unable to connect to VM 402 qmp socket - timeout after 51 retries
Sep 07 14:23:26 LabHost pvestatd[2023]: storage 'ProxmoxBackups' is not online
Sep 07 14:23:31 LabHost pve-ha-crm[2054]: loop take too long (57 seconds)
Sep 07 14:23:36 LabHost snmpd[1770]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Sep 07 14:23:44 LabHost pvestatd[2023]: status update time (73.041 seconds)
-- Reboot --
We have a problem with one of our Proxmox hosts. This is a cluster with 3 hosts.
One of the hosts has 7 VMs running CentOS 7. It is capable of doing any backups with no errors, except one where the machine has lots of storage. It starts doing the backup, and it will suddenly reboot. On the logs we can see it apparently disconnected from the Backups storage, but that's not true, when it says it disconnected, i can still access the backups storage from proxmox from any other host.
We tried doing this backup locally, but it fails the same way. We have HA setup so we always find all services running on the other server on mondays after getting the "backup failed" email because of this error.
We seem to have issues with ZFS RAM usage too (this host has 188GB of RAM, ZFS sometimes will use up to 100GB, i assume its because it has 12TB storage and 10TB are assigned to one VM). We solved it setting up a cron job that releases 2GB of ram from the cache everytime, we looked into limiting it but we're doing something wrong and its just not doing it. We were told ZFS shouldn't use more than 50% of memory but in our case it uses way more (70% or more).
We can't see any error on the logs other than the storage issue, but it doesn't make sense that the server suddenly reboots (it doesn't even shut off VMs, it just does a hard reset!)
Sep 07 14:00:09 LabHost qm[3615100]: <root@pam> starting task UPID:LabHost:003729B7:03B4B051:66DC4049:qmpause:402:root@pam:
Sep 07 14:00:09 LabHost qm[3615159]: suspend VM 402: UPID:LabHost:003729B7:03B4B051:66DC4049:qmpause:402:root@pam:
Sep 07 14:00:09 LabHost qm[3615100]: <root@pam> end task UPID:LabHost:003729B7:03B4B051:66DC4049:qmpause:402:root@pam: OK
Sep 07 14:00:12 LabHost pvescheduler[3615092]: VM 402 qmp command failed - VM 402 qmp command 'guest-ping' failed - got timeout
Sep 07 14:10:01 LabHost CRON[3619506]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 07 14:10:01 LabHost CRON[3619507]: (root) CMD (sync; echo 3 > /proc/sys/vm/drop_caches)
Sep 07 14:10:26 LabHost kernel: sh (3619507): drop_caches: 3
Sep 07 14:10:26 LabHost CRON[3619506]: pam_unix(cron:session): session closed for user root
Sep 07 14:17:01 LabHost CRON[3622238]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 07 14:17:01 LabHost CRON[3622241]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 07 14:17:01 LabHost CRON[3622238]: pam_unix(cron:session): session closed for user root
Sep 07 14:19:08 LabHost pve-firewall[2019]: firewall update time (9.590 seconds)
Sep 07 14:19:12 LabHost pvestatd[2023]: status update time (33.853 seconds)
Sep 07 14:19:57 LabHost pve-ha-lrm[2064]: loop take too long (55 seconds)
Sep 07 14:20:01 LabHost CRON[3623200]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 07 14:20:01 LabHost CRON[3623201]: (root) CMD (sync; echo 3 > /proc/sys/vm/drop_caches)
Sep 07 14:20:04 LabHost pvestatd[2023]: status update time (52.078 seconds)
Sep 07 14:20:23 LabHost pve-ha-lrm[3623259]: VM 402 qmp command failed - VM 402 qmp command 'query-status' failed - got timeout
Sep 07 14:20:23 LabHost pve-ha-lrm[3623259]: VM 402 qmp command 'query-status' failed - got timeout
Sep 07 14:20:24 LabHost kernel: sh (3623201): drop_caches: 3
Sep 07 14:20:25 LabHost CRON[3623200]: pam_unix(cron:session): session closed for user root
Sep 07 14:20:36 LabHost pvestatd[2023]: storage 'ProxmoxBackups' is not online
Sep 07 14:20:45 LabHost pvestatd[2023]: status update time (40.971 seconds)
Sep 07 14:21:34 LabHost pve-firewall[2019]: firewall update time (15.472 seconds)
Sep 07 14:21:40 LabHost pvestatd[2023]: status update time (24.866 seconds)
Sep 07 14:21:56 LabHost pmxcfs[1924]: [dcdb] notice: data verification successful
Sep 07 14:21:58 LabHost pvestatd[2023]: VM 401 qmp command failed - VM 401 qmp command 'query-proxmox-support' failed - got timeout
Sep 07 14:22:21 LabHost pvestatd[2023]: VM 405 qmp command failed - VM 405 qmp command 'query-proxmox-support' failed - got timeout
Sep 07 14:22:21 LabHost pve-firewall[2019]: firewall update time (23.302 seconds)
Sep 07 14:22:21 LabHost pvestatd[2023]: status update time (30.892 seconds)
Sep 07 14:22:40 LabHost pvestatd[2023]: VM 402 qmp command failed - VM 402 qmp command 'query-proxmox-support' failed - unable to connect to VM 402 qmp socket - timeout after 51 retries
Sep 07 14:23:26 LabHost pvestatd[2023]: storage 'ProxmoxBackups' is not online
Sep 07 14:23:31 LabHost pve-ha-crm[2054]: loop take too long (57 seconds)
Sep 07 14:23:36 LabHost snmpd[1770]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Sep 07 14:23:44 LabHost pvestatd[2023]: status update time (73.041 seconds)
-- Reboot --