LXC using NFS crashing if NFS server goes offline

rossd

Member
Nov 16, 2023
30
4
8
I'm running OpenMediaVault in a VM, something is going wrong with it that I need to figure out, but it occasionally crashes and needs rebooting.

On the same host I'm running an LXC container (running plex) which uses fuse and NFS to mount a drive from the openmediavault server.

The issue I have is if OpenMediaVault does down, the LXC container totally dies - I've tried using pct to stop the container but it just won't. It also results in all of the containers on the server having gray question marks (I suspect because it can't get stats from the lxc container thats died it doesn't report any).

Restarting the server just hangs (I presume while it's waiting for the lxc container to shutdown), so in the end I have to pull the power.

Is there a way I can diagnose this, or is it just a known issue?
 
If you've absolutely tried everything, (PCT stop, unlock etc.) then you could do the following:

Code:
ps ax | grep lxc
Then identify the PID process number associated with your CTID, and issue
Code:
kill PID
replacing PID with the actual process number
 
  • Like
Reactions: Kingneutron
I had to use kill -9 PID but this did work. The server came fully back up when I did this, and I noticed that a backup started almost immediately when I killed that container.

I've tried to restart the container, but I just get "Error: Startup for container 112 failed" now, so I guess I probably do need to bounce the server

Looking through syslogs I see a lot of this at the time the VM went down:

Code:
Mar 11 20:36:14 pve0 pvestatd[1279]: VM 103 qmp command failed - VM 103 qmp command 'query-proxmox-support' failed - unable to connect to VM 103 qmp socket - timeout after 51 retries
Mar 11 20:36:15 pve0 pvestatd[1279]: status update time (8.185 seconds)
Mar 11 20:36:17 pve0 kernel: nfs: server 192.168.86.59 not responding, timed out
Mar 11 20:36:21 pve0 kernel: nfs: server 192.168.86.59 not responding, timed out
Mar 11 20:36:22 pve0 kernel: nfs: server 192.168.86.59 not responding, timed out

then I see this once I managed to reboot the VM:

Mar 11 21:26:08 pve0 kernel: nfs: server 192.168.86.59 OK
Mar 11 21:26:08 pve0 kernel: nfs: server 192.168.86.59 OK

Thanks for the tips so far.
 
  • Like
Reactions: Kingneutron
This is the output of debugging the startup:

Code:
root@pve0:~# pct start 112 --debug
run_buffer: 322 Script exited with status 16
lxc_init: 844 Failed to run lxc.hook.pre-start for container "112"
__lxc_start: 2027 Failed to initialize container "112"
t-hook" for container "112", config section "lxc"
DEBUG    conf - ../src/lxc/conf.c:run_buffer:311 - Script exec /usr/share/lxc/hooks/lxc-pve-prestart-hook 112 lxc pre-start produced output: failed to remove directory '/sys/fs/cgroup/lxc/112/ns/system.slice/plexmediaserver.service': Device or resource busy

ERROR    conf - ../src/lxc/conf.c:run_buffer:322 - Script exited with status 16
ERROR    start - ../src/lxc/start.c:lxc_init:844 - Failed to run lxc.hook.pre-start for container "112"
ERROR    start - ../src/lxc/start.c:__lxc_start:2027 - Failed to initialize container "112"
INFO     conf - ../src/lxc/conf.c:run_script_argv:338 - Executing script "/usr/share/lxc/hooks/lxc-pve-poststop-hook" for container "112", config section "lxc"
startup for container '112' failed
 
I had to use kill -9 PID
This sends a "SIGKILL (9) — Kill signal" as opposed to the default kill "SIGTERM (15) — Termination signal". Its radical - but it should work.

You now need to work out a strategy on how to manage the LXC (112 Plex ?) container gracefully, even if the VM is down.

But you still have the problem of the crashing VM (103 OMV ?) .

If I were you, I'd start by getting the VM in a working state first.
To do this, start by disenabling the LXC (112 Plex ?) to run on startup.
Reboot (full power down/up) your Proxmox server - and start working on your VM - to make it stable & functioning.
Only after you have a stable & functioning VM start work on the LXC.

I don't know anything about your HW, VM's or LXC's. But it's out there on Google, you just need to sift through it.
Good luck.
 
  • Like
Reactions: Kingneutron
My gut instinct is saying this is what happened - does this ring true?

Plex was transcoding some files when the nfs share went down. It still had files locked or something like that.

When I brought the VM back up plex picked back up, but still had files locked.

At 2 am when the server tried to back up the LXC container it was really locked up by the nfs mounts.

A lot of this I'm guessing from other threads I've seen about nfs shared inside lxc containers, and backups. For now I'll turn backups off on this container and see if that helps.
 
  • Like
Reactions: Kingneutron
Everything is possible.
You still don't deal with why the share went/goes down.

In my personal setup - I only backup LXC's when they are down. VM's even when they are up, and occasionally I do a backup with a VM down.
Thats just my personal way of doing it. It works - so as the saying goes: if it isn't broken don't fix it!
 
  • Like
Reactions: Kingneutron
The default for NFS is a "hard" mount. Which means that if the server goes away any process that tries to read or write the share will hang until it comes back online. Killing the offending process doesn't help because it will be stuck in an uninterruptible kernel wait (state "D" in ps output).

You could use a "soft" mount instead, which will error out after a while. Risk with that is data loss in case the application does not handle errors correctly. That is actually pretty common so the risk is real. If this is a read-only mount then there's less risk. See "man nfs" for more information.

In any case it is never a good idea to have your remote filesystems on unreliable servers.
 
I'm thinking of moving the nas onto some dedicated hardware to try and figure out whats going on with it. For now I've added a serial terminal connection so I can try and connect to it when it dies and see whats going on. I've not had any good logs yet so I don't have much to go on.

For now I've moved plex onto another node which doesn't back it up. Making the mount soft fixed the issue with that. Now I can just concentrate on figuring out whats up with my nas!

Many thanks all :)
 
Following... I have same issue but I have no time right now to describe my exact situation... will do it later.

What I can tell you is that the culprit should not be OMV VM going down, since I have your same exact problem (host crashing, question marks in GUI ecc. ecc...) even if my NAS (bare metal, LAN attached) is running perfectly fine. The problem arises ~1/4 of the times I try to stop a Docker LXC from anything (gui, pct shutdown, ecc ecc) but the shell itself inside the container (doing pct enter <ID>). This one is the only reliable way of shutting down the container without facing a probability of crashing the whole server.

As soon as I fail shutting down the LXC, my host's dmesg gets flooded with

nfs: server <Truenas IP> not responding, timed out, even if my NAS is perfecty up and running!!!

So I suspect it has something to do with how LXCs handle unmounting shares.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!