We have a Proxmox set-up of three nodes sharing some 15 VMs. The nodes and VMs are behind an actual router.
This has been happily running for a couple of years now without much trouble, until recently. I wanted to update a box from Debian 9 to 10, and during the `apt upgrade` (which would download about 1 GB) I lost my remote connection. All the other VMs and nodes also became unreachable. After about 30 minutes the whole system came back online.
I tried the upgrade a day later with the exact same result. Then a few days later I wanted to download a backup of a VM to do some local tests, which was also quickly interrupted by a lost connection, again making all the VMs and nodes unreachable.
I've attached sections of the syslog for the second and third crash (from 22-03, ~19:04 and 30-03, ~18:43). In both cases the systems had been running fine for days.
The first actual error appears to be:
I'm at loss what's happening here, I know how to use Proxmox but I know little of the underlying mechanics.
Maybe this thread would be related: https://forum.proxmox.com/threads/new-cluster-totem-failed-to-receive-after-4mins.58935/, that error is in my logs too.
We're using Proxmox VE 5.2-1. (I know, embarrassingly old. With limited physical access to servers we're having a hard time upgrading.)
I would much appreciate any thoughts on what the problem could be!
This has been happily running for a couple of years now without much trouble, until recently. I wanted to update a box from Debian 9 to 10, and during the `apt upgrade` (which would download about 1 GB) I lost my remote connection. All the other VMs and nodes also became unreachable. After about 30 minutes the whole system came back online.
I tried the upgrade a day later with the exact same result. Then a few days later I wanted to download a backup of a VM to do some local tests, which was also quickly interrupted by a lost connection, again making all the VMs and nodes unreachable.
I've attached sections of the syslog for the second and third crash (from 22-03, ~19:04 and 30-03, ~18:43). In both cases the systems had been running fine for days.
The first actual error appears to be:
Mar 30 18:42:56 bismuth corosync[1318]: error [TOTEM ] FAILED TO RECEIVE
Mar 30 18:42:56 bismuth corosync[1318]: [TOTEM ] FAILED TO RECEIVE
I'm at loss what's happening here, I know how to use Proxmox but I know little of the underlying mechanics.
Maybe this thread would be related: https://forum.proxmox.com/threads/new-cluster-totem-failed-to-receive-after-4mins.58935/, that error is in my logs too.
We're using Proxmox VE 5.2-1. (I know, embarrassingly old. With limited physical access to servers we're having a hard time upgrading.)
I would much appreciate any thoughts on what the problem could be!