How to deal with unresponsive lxc and kvm guests in HA context

lifeboy

Renowned Member
We had an interesting situation this morning. For some reason one node in our cluster was not showing as active (green "running" arrows on the guest icon on the tree) and all the LXC's were not responding. We managed to address the issue as quickly as possible by simply resetting the node and all came back up. (If this happens again, we will have to investigate the cause more closely)

However, a number of those non-responsive containers are under HA management. Of course, if the node is taken down orderly, they will migrate to other nodes, but in this case (as has happened in previous instances), the node was responding, but the guests not. Is there a way in which we can tell HA to restart the LXC's (and VM's for that matter) on another node if they are not responding for an extended (definable) period of time? Personally I've almost never had node failures, but I've had hanging guests for various reasons. HA should be able to address that, not?

This is what it looks like, but no services are not running?

CopyQ.LjQUGJ.png
 
Last edited:
This morning it is another node that displays the exact same symptoms! Node B. It get's stuck trying to take a snapshot of an LXC and then all the LXC's become unresponsive.

I upgraded all the nodes yesterday to ensure that I've got all the latest patches, including PBS, so what could be causing this?
 
regarding the question if ha is able to detect 'hanging' containers. no it's not, it monitors the basic state 'started' or 'stopped' and tries to keep the guests in that state
detecting 'hanging' would be non trivial, because when does that happen, when ssh is not reachable? when its fully cpu loaded? without knowing what's running
inside, it is basically impossible to detect correctly

as for your hanging, situations like this happen most often when the underlying storage ist not fast enough or has problems, i'd check if that's happening in your case
 
regarding the question if ha is able to detect 'hanging' containers. no it's not, it monitors the basic state 'started' or 'stopped' and tries to keep the guests in that state
detecting 'hanging' would be non trivial, because when does that happen, when ssh is not reachable? when its fully cpu loaded? without knowing what's running inside, it is basically impossible to detect correctly
Regarding the hanging specifically: When the "pct status xxx" command times out, it's not the what's inside the container, it's the system. Some of these lxc's run on NVMe storage and for the rest of the system everything is running perfectly fine.
I still don't know why some nodes just look like they die, the qemu machines and the node keep running and not even has a heavy load, yet the lxc's just seem to freeze.

as for your hanging, situations like this happen most often when the underlying storage ist not fast enough or has problems, i'd check if that's happening in your case
I think separate 25Gb/s interconnects nic's and NVMe storage should be totally fine and has been to date. There are no ceph health errors, nothing seems wrong on the surface.
 
Last edited:
This is what it looks like, but no services are not running?
This screencap denotes that pve-proxy is not functioning properly. That means that there is a monitored device/service that is not responsive- not necessarily a ct or vm. Its just as likely you have a store defined that is not responding. so, what to check:

systemctl status pveproxy
systemctl status pvestatd
dmesg

you should see clues as to what is hanging.
 
This screencap denotes that pve-proxy is not functioning properly. That means that there is a monitored device/service that is not responsive- not necessarily a ct or vm. Its just as likely you have a store defined that is not responding. so, what to check:

systemctl status pveproxy
systemctl status pvestatd
dmesg

you should see clues as to what is hanging.
It seems that my remote backup server (PBS) may be the cause of this. I have now determined that the link to it is very slow (1Mb/s) instead of the Gb/s is used to be, so I'm investigating that.

I find it peculiar though that not being able to do a fast backup can kill a whole node's running containers. It just doesn't sit right with me that this can happen.
 
This screencap denotes that pve-proxy is not functioning properly. That means that there is a monitored device/service that is not responsive- not necessarily a ct or vm. Its just as likely you have a store defined that is not responding. so, what to check:

systemctl status pveproxy
systemctl status pvestatd
dmesg

you should see clues as to what is hanging.

The only one of the these command that shows a problem is this one

Code:
root@FT1-NodeD:~# systemctl status pvestatd
● pvestatd.service - PVE Status Daemon
     Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2022-10-19 09:59:23 SAST; 4 days ago
    Process: 1797183 ExecReload=/usr/bin/pvestatd restart (code=exited, status=0/SUCCESS)
   Main PID: 2022 (pvestatd)
      Tasks: 2 (limit: 154191)
     Memory: 133.0M
        CPU: 9h 56min 43.558s
     CGroup: /system.slice/pvestatd.service
             ├─   2022 pvestatd
             └─2710545 lxc-info -n 115 -p

Oct 23 18:12:51 FT1-NodeD pvestatd[2022]: PBS-one: error fetching datastores - 500 Can't connect to 192.168.121.200:8007
Oct 23 18:12:51 FT1-NodeD pvestatd[2022]: status update time (8.139 seconds)
Oct 23 18:13:00 FT1-NodeD pvestatd[2022]: PBS-one: error fetching datastores - 500 Can't connect to 192.168.121.200:8007
Oct 23 18:13:02 FT1-NodeD pvestatd[2022]: status update time (8.525 seconds)
Oct 23 18:13:10 FT1-NodeD pvestatd[2022]: PBS-one: error fetching datastores - 500 Can't connect to 192.168.121.200:8007
Oct 23 18:13:11 FT1-NodeD pvestatd[2022]: status update time (8.252 seconds)
Oct 23 18:13:21 FT1-NodeD pvestatd[2022]: PBS-one: error fetching datastores - 500 Can't connect to 192.168.121.200:8007
Oct 23 18:13:21 FT1-NodeD pvestatd[2022]: status update time (8.223 seconds)
Oct 23 18:13:31 FT1-NodeD pvestatd[2022]: PBS-one: error fetching datastores - 500 Can't connect to 192.168.121.200:8007
Oct 23 18:13:32 FT1-NodeD pvestatd[2022]: status update time (8.352 seconds)

When I restart pvestatd, the nodes shows green status again, but the lxc's don't respond. The only solution then is to reboot te node.
 
Yep.

remove it from your main clusters and your problems will stop. You can readd it once you figured out why its unreachable.
Surely this a bug and can't be by design. If the main service goes down because to place one backs up to cannot be reached consistently, the backup service should balk, but the main service must be stable.

I'll open a bug report for this.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!