LXC using NFS crashing if NFS server goes offline

rossd · Mar 12, 2024

I'm running OpenMediaVault in a VM, something is going wrong with it that I need to figure out, but it occasionally crashes and needs rebooting.

On the same host I'm running an LXC container (running plex) which uses fuse and NFS to mount a drive from the openmediavault server.

The issue I have is if OpenMediaVault does down, the LXC container totally dies - I've tried using pct to stop the container but it just won't. It also results in all of the containers on the server having gray question marks (I suspect because it can't get stats from the lxc container thats died it doesn't report any).

Restarting the server just hangs (I presume while it's waiting for the lxc container to shutdown), so in the end I have to pull the power.

Is there a way I can diagnose this, or is it just a known issue?

gfngfn256 · Mar 12, 2024

If you've absolutely tried everything, (PCT stop, unlock etc.) then you could do the following:

Code:

ps ax | grep lxc

Then identify the PID process number associated with your CTID, and issue

Code:

kill PID

replacing PID with the actual process number

rossd · Mar 12, 2024

I had to use kill -9 PID but this did work. The server came fully back up when I did this, and I noticed that a backup started almost immediately when I killed that container.

I've tried to restart the container, but I just get "Error: Startup for container 112 failed" now, so I guess I probably do need to bounce the server

Looking through syslogs I see a lot of this at the time the VM went down:

Code:

Mar 11 20:36:14 pve0 pvestatd[1279]: VM 103 qmp command failed - VM 103 qmp command 'query-proxmox-support' failed - unable to connect to VM 103 qmp socket - timeout after 51 retries
Mar 11 20:36:15 pve0 pvestatd[1279]: status update time (8.185 seconds)
Mar 11 20:36:17 pve0 kernel: nfs: server 192.168.86.59 not responding, timed out
Mar 11 20:36:21 pve0 kernel: nfs: server 192.168.86.59 not responding, timed out
Mar 11 20:36:22 pve0 kernel: nfs: server 192.168.86.59 not responding, timed out

then I see this once I managed to reboot the VM:

Mar 11 21:26:08 pve0 kernel: nfs: server 192.168.86.59 OK
Mar 11 21:26:08 pve0 kernel: nfs: server 192.168.86.59 OK

Thanks for the tips so far.

rossd · Mar 12, 2024

This is the output of debugging the startup:

Code:

root@pve0:~# pct start 112 --debug
run_buffer: 322 Script exited with status 16
lxc_init: 844 Failed to run lxc.hook.pre-start for container "112"
__lxc_start: 2027 Failed to initialize container "112"
t-hook" for container "112", config section "lxc"
DEBUG    conf - ../src/lxc/conf.c:run_buffer:311 - Script exec /usr/share/lxc/hooks/lxc-pve-prestart-hook 112 lxc pre-start produced output: failed to remove directory '/sys/fs/cgroup/lxc/112/ns/system.slice/plexmediaserver.service': Device or resource busy

ERROR    conf - ../src/lxc/conf.c:run_buffer:322 - Script exited with status 16
ERROR    start - ../src/lxc/start.c:lxc_init:844 - Failed to run lxc.hook.pre-start for container "112"
ERROR    start - ../src/lxc/start.c:__lxc_start:2027 - Failed to initialize container "112"
INFO     conf - ../src/lxc/conf.c:run_script_argv:338 - Executing script "/usr/share/lxc/hooks/lxc-pve-poststop-hook" for container "112", config section "lxc"
startup for container '112' failed

gfngfn256 · Mar 12, 2024

rossd said:
I had to use kill -9 PID

This sends a "SIGKILL (9) — Kill signal" as opposed to the default kill "SIGTERM (15) — Termination signal". Its radical - but it should work.

You now need to work out a strategy on how to manage the LXC (112 Plex ?) container gracefully, even if the VM is down.

But you still have the problem of the crashing VM (103 OMV ?) .

If I were you, I'd start by getting the VM in a working state first.
To do this, start by disenabling the LXC (112 Plex ?) to run on startup.
Reboot (full power down/up) your Proxmox server - and start working on your VM - to make it stable & functioning.
Only after you have a stable & functioning VM start work on the LXC.

I don't know anything about your HW, VM's or LXC's. But it's out there on Google, you just need to sift through it.
Good luck.

rossd · Mar 12, 2024

My gut instinct is saying this is what happened - does this ring true?

Plex was transcoding some files when the nfs share went down. It still had files locked or something like that.

When I brought the VM back up plex picked back up, but still had files locked.

At 2 am when the server tried to back up the LXC container it was really locked up by the nfs mounts.

A lot of this I'm guessing from other threads I've seen about nfs shared inside lxc containers, and backups. For now I'll turn backups off on this container and see if that helps.

gfngfn256 · Mar 12, 2024

Everything is possible.
You still don't deal with why the share went/goes down.

In my personal setup - I only backup LXC's when they are down. VM's even when they are up, and occasionally I do a backup with a VM down.
Thats just my personal way of doing it. It works - so as the saying goes: if it isn't broken don't fix it!

BobhWasatch · Mar 12, 2024

The default for NFS is a "hard" mount. Which means that if the server goes away any process that tries to read or write the share will hang until it comes back online. Killing the offending process doesn't help because it will be stuck in an uninterruptible kernel wait (state "D" in ps output).

You could use a "soft" mount instead, which will error out after a while. Risk with that is data loss in case the application does not handle errors correctly. That is actually pretty common so the risk is real. If this is a read-only mount then there's less risk. See "man nfs" for more information.

In any case it is never a good idea to have your remote filesystems on unreliable servers.

rossd · Mar 12, 2024

I'm thinking of moving the nas onto some dedicated hardware to try and figure out whats going on with it. For now I've added a serial terminal connection so I can try and connect to it when it dies and see whats going on. I've not had any good logs yet so I don't have much to go on.

For now I've moved plex onto another node which doesn't back it up. Making the mount soft fixed the issue with that. Now I can just concentrate on figuring out whats up with my nas!

Many thanks all

Search

Search

LXC using NFS crashing if NFS server goes offline

rossd

New Member

gfngfn256

Well-Known Member

rossd

New Member

rossd

New Member

gfngfn256

Well-Known Member

rossd

New Member

gfngfn256

Well-Known Member

BobhWasatch

Renowned Member

rossd

New Member