How to deal with unresponsive lxc and kvm guests in HA context

lifeboy · Oct 18, 2022

We had an interesting situation this morning. For some reason one node in our cluster was not showing as active (green "running" arrows on the guest icon on the tree) and all the LXC's were not responding. We managed to address the issue as quickly as possible by simply resetting the node and all came back up. (If this happens again, we will have to investigate the cause more closely)

However, a number of those non-responsive containers are under HA management. Of course, if the node is taken down orderly, they will migrate to other nodes, but in this case (as has happened in previous instances), the node was responding, but the guests not. Is there a way in which we can tell HA to restart the LXC's (and VM's for that matter) on another node if they are not responding for an extended (definable) period of time? Personally I've almost never had node failures, but I've had hanging guests for various reasons. HA should be able to address that, not?

This is what it looks like, but no services are not running?

lifeboy · Oct 19, 2022

Updated original post with image of situation.

lifeboy · Oct 19, 2022

This same thing has happened two days in a row now... :-(

lifeboy · Oct 20, 2022

This morning it is another node that displays the exact same symptoms! Node B. It get's stuck trying to take a snapshot of an LXC and then all the LXC's become unresponsive.

I upgraded all the nodes yesterday to ensure that I've got all the latest patches, including PBS, so what could be causing this?

dcsapak · Oct 20, 2022

regarding the question if ha is able to detect 'hanging' containers. no it's not, it monitors the basic state 'started' or 'stopped' and tries to keep the guests in that state
detecting 'hanging' would be non trivial, because when does that happen, when ssh is not reachable? when its fully cpu loaded? without knowing what's running
inside, it is basically impossible to detect correctly

as for your hanging, situations like this happen most often when the underlying storage ist not fast enough or has problems, i'd check if that's happening in your case

lifeboy · Oct 20, 2022

dcsapak said:
regarding the question if ha is able to detect 'hanging' containers. no it's not, it monitors the basic state 'started' or 'stopped' and tries to keep the guests in that state
detecting 'hanging' would be non trivial, because when does that happen, when ssh is not reachable? when its fully cpu loaded? without knowing what's running inside, it is basically impossible to detect correctly

Regarding the hanging specifically: When the "pct status xxx" command times out, it's not the what's inside the container, it's the system. Some of these lxc's run on NVMe storage and for the rest of the system everything is running perfectly fine.
I still don't know why some nodes just look like they die, the qemu machines and the node keep running and not even has a heavy load, yet the lxc's just seem to freeze.

dcsapak said:
as for your hanging, situations like this happen most often when the underlying storage ist not fast enough or has problems, i'd check if that's happening in your case

I think separate 25Gb/s interconnects nic's and NVMe storage should be totally fine and has been to date. There are no ceph health errors, nothing seems wrong on the surface.

alexskysilk · Oct 20, 2022

lifeboy said:
This is what it looks like, but no services are not running?

This screencap denotes that pve-proxy is not functioning properly. That means that there is a monitored device/service that is not responsive- not necessarily a ct or vm. Its just as likely you have a store defined that is not responding. so, what to check:

systemctl status pveproxy
systemctl status pvestatd
dmesg

you should see clues as to what is hanging.

lifeboy · Oct 22, 2022

alexskysilk said:
This screencap denotes that pve-proxy is not functioning properly. That means that there is a monitored device/service that is not responsive- not necessarily a ct or vm. Its just as likely you have a store defined that is not responding. so, what to check:

systemctl status pveproxy
systemctl status pvestatd
dmesg

you should see clues as to what is hanging.

It seems that my remote backup server (PBS) may be the cause of this. I have now determined that the link to it is very slow (1Mb/s) instead of the Gb/s is used to be, so I'm investigating that.

I find it peculiar though that not being able to do a fast backup can kill a whole node's running containers. It just doesn't sit right with me that this can happen.

lifeboy · Oct 24, 2022

alexskysilk said:
This screencap denotes that pve-proxy is not functioning properly. That means that there is a monitored device/service that is not responsive- not necessarily a ct or vm. Its just as likely you have a store defined that is not responding. so, what to check:

systemctl status pveproxy
systemctl status pvestatd
dmesg

you should see clues as to what is hanging.

The only one of the these command that shows a problem is this one

Code:

root@FT1-NodeD:~# systemctl status pvestatd
● pvestatd.service - PVE Status Daemon
     Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2022-10-19 09:59:23 SAST; 4 days ago
    Process: 1797183 ExecReload=/usr/bin/pvestatd restart (code=exited, status=0/SUCCESS)
   Main PID: 2022 (pvestatd)
      Tasks: 2 (limit: 154191)
     Memory: 133.0M
        CPU: 9h 56min 43.558s
     CGroup: /system.slice/pvestatd.service
             ├─   2022 pvestatd
             └─2710545 lxc-info -n 115 -p

Oct 23 18:12:51 FT1-NodeD pvestatd[2022]: PBS-one: error fetching datastores - 500 Can't connect to 192.168.121.200:8007
Oct 23 18:12:51 FT1-NodeD pvestatd[2022]: status update time (8.139 seconds)
Oct 23 18:13:00 FT1-NodeD pvestatd[2022]: PBS-one: error fetching datastores - 500 Can't connect to 192.168.121.200:8007
Oct 23 18:13:02 FT1-NodeD pvestatd[2022]: status update time (8.525 seconds)
Oct 23 18:13:10 FT1-NodeD pvestatd[2022]: PBS-one: error fetching datastores - 500 Can't connect to 192.168.121.200:8007
Oct 23 18:13:11 FT1-NodeD pvestatd[2022]: status update time (8.252 seconds)
Oct 23 18:13:21 FT1-NodeD pvestatd[2022]: PBS-one: error fetching datastores - 500 Can't connect to 192.168.121.200:8007
Oct 23 18:13:21 FT1-NodeD pvestatd[2022]: status update time (8.223 seconds)
Oct 23 18:13:31 FT1-NodeD pvestatd[2022]: PBS-one: error fetching datastores - 500 Can't connect to 192.168.121.200:8007
Oct 23 18:13:32 FT1-NodeD pvestatd[2022]: status update time (8.352 seconds)

When I restart pvestatd, the nodes shows green status again, but the lxc's don't respond. The only solution then is to reboot te node.

alexskysilk · Oct 24, 2022

lifeboy said:
It seems that my remote backup server (PBS) may be the cause of this

lifeboy said:
Oct 23 18:12:51 FT1-NodeD pvestatd[2022]: PBS-one: error fetching datastores - 500 Can't connect to 192.168.121.200:8007

Yep.

remove it from your main clusters and your problems will stop. You can readd it once you figured out why its unreachable.

lifeboy · Oct 24, 2022

alexskysilk said:
Yep.

remove it from your main clusters and your problems will stop. You can readd it once you figured out why its unreachable.

Surely this a bug and can't be by design. If the main service goes down because to place one backs up to cannot be reached consistently, the backup service should balk, but the main service must be stable.

I'll open a bug report for this.

alexskysilk · Oct 25, 2022

lifeboy said:
Surely this a bug and can't be by design.

brother if only.... yeah open a bug report, and I really wish you luck.

How to deal with unresponsive lxc and kvm guests in HA context

lifeboy

Renowned Member

lifeboy

Renowned Member

lifeboy

Renowned Member

lifeboy

Renowned Member

dcsapak

Proxmox Staff Member

lifeboy

Renowned Member

alexskysilk

Distinguished Member

lifeboy

Renowned Member

lifeboy

Renowned Member

alexskysilk

Distinguished Member

lifeboy

Renowned Member

alexskysilk

Distinguished Member

We value your privacy