[SOLVED] restarting lxcfs kicks running containers in the balls

grin · Oct 28, 2020

How to restart lxcfs, or, rather, how to resurrect /proc on running containers?

If lxcfs gets restarted by some or other reason all the CT's get choked:

Code:

# ps awuxf
Error: /proc must be mounted
  To mount /proc at boot you need an /etc/fstab line like:
      proc   /proc   proc    defaults
  In the meantime, run "mount proc /proc -t proc"

Is there any way

to make them not to lose proc on lxcfs restart, and
to give them proc back when they did anyway?

fabian · Oct 29, 2020

that's why it gets reloaded on upgrades and not restarted

I think you'll need to restart those containers, and not restart lxcfs in the future (it has a live-reloading mechanism built-in for that reason).

grin · Nov 3, 2020

Then it's just a suggestion to mention this in the documetation, since, as we all know CT online migration is impossible, so such a restart requires a reboot of all the CTs.

fabian · Nov 3, 2020

what exactly should we mention? don't restart random services things might break? we already ensure things get restarted or reloaded on upgrades depending on which is the appropriate solution. there is no need for a manual action unless instructed to do so (e.g., after manually applying some test/debug patch, or guest restart when needed to enable new features).

grin · Nov 3, 2020

There are numerous times when PVE chokes, the UI become gray, all tasks stuck, system gets in iowait, unresponsive, gets on fire and explode.

Most of the time it requires some services to be restarted, usually after the culprit is found and eredicated, be that some STOP tasks never stopping but putting the CT in iowait, rbd which was screwed by outside factors (in the last month usually pbs), or various network related problems which should not have created a problem but did anyway.

Most often we repeatedly have to restart pvedaemon and pveproxy, but sometimes it isn't enough, and corosync needs to be restarted, along with various other daemons (we have became familiar with in due course). Since not all of them are documented in detail sometimes we have to make [more or less] educated guesses at which one may cause (or resolve) the problem. If you call this "random" then that's what sometimes needed to get stuff fixed. Most of the time there isn't enough data to report the issue, that's why you don't see me complaining monthly. Nevertheless it's not really fair to say "randomly" restarted services: lxcfs was restarted when not the "usual" restarts solved the "stuck CT" problem, which can't be stopped by any means (not even kill -9 or lxc direct tools), and it is one of the "last resort" acts. (Or it was, since, as it turns out, it would never help.)

If the documentation would have mentioned that "never restart this daemon since it will never fix any problem but all CTs will lose /proc until their restart" then obviously I would not have tried to fix the CT problem by restarting (pretty related) lxcfs daemon. I believe this sounds like a reasonable action considering the problem and the available information.

Majority cause of these problems is that CT online migration does not work since if it would I would just migrate them away and restart the node. But since it's not possible it's not easy to actually shut down everything, and some of these don't start that fast either. I do not see any progress regarding that in the last years: neither real onlie, nor something mixed with suspend or hybernate modes. Shutdown usually is not preferred.

That is the longer explanation.

Search

Search

[SOLVED] restarting lxcfs kicks running containers in the balls

grin

Renowned Member

fabian

Proxmox Staff Member

grin

Renowned Member

fabian

Proxmox Staff Member

grin

Renowned Member