I have a 3-node cluster running Proxmox. I just upgraded it to 2.3 last weekend. Yesterday, when we came in, one of the VMs on one of the cluster was hard locked. We rebooted it, and it came up with the dreaded "NTLOADER MISSING".
Our monitoring showed it failed about an hour after its backup started. The backup itself said it completed without any issues. We restored from a backup a day earlier, and the machine running on the node was horribly slow (disk IO was maxxing at about 5 MB/s).
We finally scrapped it, and restored from a backup onto the main node, and it started running fine.
Today, I came in, and a windows VM sitting on a different node wasn't responding. I connected into it, and there are hundreds of "delayed write failed" error messages sitting on the screen. I rebooted it, and while it did come back up (thank god!) it is running very very slow.
The only thing I can find in the logs is some entries like this around the backup (this is in syslog)
Mar 21 00:30:48 node2 pvestatd[1895]: WARNING: command 'df -P -B 1 /mnt/pve/backup-server' failed: got timeout
All 3 nodes backup to the same NFS share (backup-server)
Not really sure where to go from here to debug the issue. For now I'm going to suspend backups.
Our monitoring showed it failed about an hour after its backup started. The backup itself said it completed without any issues. We restored from a backup a day earlier, and the machine running on the node was horribly slow (disk IO was maxxing at about 5 MB/s).
We finally scrapped it, and restored from a backup onto the main node, and it started running fine.
Today, I came in, and a windows VM sitting on a different node wasn't responding. I connected into it, and there are hundreds of "delayed write failed" error messages sitting on the screen. I rebooted it, and while it did come back up (thank god!) it is running very very slow.
The only thing I can find in the logs is some entries like this around the backup (this is in syslog)
Mar 21 00:30:48 node2 pvestatd[1895]: WARNING: command 'df -P -B 1 /mnt/pve/backup-server' failed: got timeout
All 3 nodes backup to the same NFS share (backup-server)
Not really sure where to go from here to debug the issue. For now I'm going to suspend backups.