[SOLVED] PVE suddunly stopped working, all CTs unrecheable

Typical, now i got a hang just because i said that... But i didn't restart the containers yesterday, so i might have been that.

EDIT: The last thing i saw i the dmesg before the reboot was apparmor denying some read, I'll try and catch it next time.
 
Last edited:
Typical, now i got a hang just because i said that... But i didn't restart the containers yesterday, so i might have been that.

EDIT: The last thing i saw i the dmesg before the reboot was apparmor denying some read, I'll try and catch it next time.

Please collect the debug output if you experience any hangs or issues with the new version!
 
Same issue here...just applied the new packages and will report back. Just as a sidenote, one effect i see when this occurs it there are pids going wild inside the lxc's, eg cronjobs never finish (and get more and more):

Code:
root@proxmox:~# lxc-exec-all "pgrep cron"
processing container #100 (owncloud)
981
11739
11762
11842
11870
11943
11945
11974
11981
12006
12010
12035
12043
12068
12073
12098
12106
12131
12136
12161
12169
12194
12199
12224
12232
12257
processing container #101 (www)
1013
9941
10059
10158
10258
10398
10698
10809
10910
11035
11133
11234
11373
processing container #102 (database)
1015
1695
1720
1740
1763
1783
1806
1826
1849
1869
1892
1912
1935
processing container #103 (jabber)
987
1281
...

When everything is running fine, i only see one pid per container.
 
Same issue here...just applied the new packages and will report back. Just as a sidenote, one effect i see when this occurs it there are pids going wild inside the lxc's, eg cronjobs never finish (and get more and more):


When everything is running fine, i only see one pid per container.

I suspected something i daily cron since it happened in the morning, but couldn't find anything obvious.
 
Same issue here...just applied the new packages and will report back. Just as a sidenote, one effect i see when this occurs it there are pids going wild inside the lxc's, eg cronjobs never finish (and get more and more):

When everything is running fine, i only see one pid per container.

That is most probably a symptom, and not the cause. I guess your cron job is accessing files provided via lxcfs, and if the lxcfs process hangs, the cron job hangs in the kernel forever. After whatever interval you configured in your crontab, the con job starts again and hangs again, and so on.
 
HI, the error appears again in my lab environment three days ago.

I've noticed multiple lxcfs processes. After i killed them and started lxcfs all problems where gone.
 

Attachments

  • ps-axjf-output.txt
    582.2 KB · Views: 4
  • lxc-panic.txt
    17.6 KB · Views: 5
HI, the error appears again in my lab environment three days ago.

I've noticed multiple lxcfs processes. After i killed them and started lxcfs all problems where gone.

Please provide the debugging output requested in this thread, otherwise it is impossible to tell why lxcfs was hanging.

Before the issue occurs:
  • update to current version ("apt-get update; apt-get dist-upgrade")
  • "apt-get install gdb lxcfs-dbg"
When the issue occurs:
  • save complete output of "ps faxl" on the host (it is important to have the "l" option here, otherwise the kernel wait channel is not displayed!)
  • get a gdb backtrace of the hanging / all lxcfs processes ("attach <PID>", "bt", "detach" inside gdb)
  • post the above two, the output of "pveversion -v" and the config of the affected containers ("pct config <ID>") here
 
I got one more hang this morning. CPU load avg > 1200 (never seen it before).

There were 3 lxcfs processes on the server. One was eating huge amounts of memory.
It was not possible to debug. See attached session:

Code:
~# ps auxw|grep lxcfs
root      2067  0.0  0.0 751588  1412 ?        S    03:49   0:00 /usr/bin/lxcfs /var/lib/lxcfs/
root      2068  0.0  0.0 751720  1412 ?        S    03:49   0:00 /usr/bin/lxcfs /var/lib/lxcfs/
root      2474  0.3  0.1 28841028 45420 ?      Ssl  Mar14  64:32 /usr/bin/lxcfs /var/lib/lxcfs/
~# uptime
09:40:47 up 13 days, 14:39,  2 users,  load average: 1221.00, 1172.77, 1120.05


(gdb) attach 2474
Attaching to process 2474
/usr/bin/lxcfs (deleted): No such file or directory.
(gdb)
(gdb) bt
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x8831ad00:
#0  0x84ff3050 in ?? ()
Cannot access memory at address 0x8831ad00
(gdb) attach 2067
Attaching to process 2067
/usr/bin/lxcfs (deleted): No such file or directory.
(gdb) bt
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7d7f9750:
#0  0x84ff44c9 in ?? ()
Cannot access memory at address 0x7d7f9750

[1177216.490435] Memory cgroup out of memory: Kill process 9637 (mysqld) score 166 or sacrifice child
[1177216.490485] Killed process 9637 (mysqld) total-vm:2311652kB, anon-rss:324280kB, file-rss:0kB
[1177317.132890] systemd-journald[289]: /dev/kmsg buffer overrun, some messages lost.
[1177317.472785] do_general_protection: 1 callbacks suppressed
[1177317.472790] traps: sh[15643] general protection ip:7f7eaef3b2fc sp:7ffce35ccec0 error:0 in libc-2.13.so[7f7eaef06000+184000]
[1177317.808201] traps: sh[15634] general protection ip:7f47cdaf82fc sp:7ffe652a60a0 error:0 in libc-2.13.so[7f47cdac3000+184000]
[1177318.773376] traps: sh[15648] general protection ip:7f350d4fe2fc sp:7ffe0ea4dd10 error:0 in libc-2.13.so[7f350d4c9000+184000]
[1177321.036155] traps: sh[15669] general protection ip:7fe8b5c312fc sp:7ffc5184f0c0 error:0 in libc-2.13.so[7fe8b5bfc000+184000]

Here's pveversion output:
# pveversion -v
proxmox-ve: 4.1-39 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-22 (running version: 4.1-22/aca130cf)
pve-kernel-4.2.8-1-pve: 4.2.8-39
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-36
qemu-server: 4.0-64
pve-firmware: 1.1-7
libpve-common-perl: 4.0-54
libpve-access-control: 4.0-13
libpve-storage-perl: 4.0-45
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-9
pve-container: 1.0-52
pve-firewall: 2.0-22
pve-ha-manager: 1.0-25
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie
fence-agents-pve: not correctly installed
openvswitch-switch: 2.3.2-2
 
Last edited:
I have been looking at the server monitoring in an attempt to help you isolate the problem. All seems to point to a memory leak somewhere inside lxcfs.

When the problem starts, I see this:
1. The processes in interruptable state start to increase linearly.
2. The number of total processes starts to increase linearly.
3. The memory used by the server increases linearly. This is all "committed" memory.
4. Network traffic drops.
5. Server load increases exponentially.
threads-day-2.png multips_memory-day.png processes-day-2.png cpu-day.png load-day.png memory-day-2.png
 
I got one more hang this morning. CPU load avg > 1200 (never seen it before).

There were 3 lxcfs processes on the server. One was eating huge amounts of memory.
It was not possible to debug. See attached session:

Code:
~# ps auxw|grep lxcfs
root      2067  0.0  0.0 751588  1412 ?        S    03:49   0:00 /usr/bin/lxcfs /var/lib/lxcfs/
root      2068  0.0  0.0 751720  1412 ?        S    03:49   0:00 /usr/bin/lxcfs /var/lib/lxcfs/
root      2474  0.3  0.1 28841028 45420 ?      Ssl  Mar14  64:32 /usr/bin/lxcfs /var/lib/lxcfs/
~# uptime
09:40:47 up 13 days, 14:39,  2 users,  load average: 1221.00, 1172.77, 1120.05

Please use "ps faxl" and post the complete output, like I said above. This enables us to see if (and where) processes are waiting in kernel space. Your lxcfs process was started before the updated package was put into pvetest - did you upgrade from lxcfs < 2.0.0? The reload mechanism for lxcfs was introduced in 2.0.0, and like I said, might not work in all cases even when upgrading from 2.0.0. It's very likely that you were hit by the issue because your container was still using the old, outdated copy of the lxcfs binaries.

(gdb) attach 2474
Attaching to process 2474
/usr/bin/lxcfs (deleted): No such file or directory.
(gdb)
(gdb) bt
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x8831ad00:
#0 0x84ff3050 in ?? ()
Cannot access memory at address 0x8831ad00
(gdb) attach 2067
Attaching to process 2067
/usr/bin/lxcfs (deleted): No such file or directory.
(gdb) bt
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7d7f9750:
#0 0x84ff44c9 in ?? ()
Cannot access memory at address 0x7d7f9750

Again, the (deleted) mark shows that this process is using binaries from before the upgrade. Please test if the issue persists with the correctly upgraded lxcfs - if in doubt, restart the containers to make sure you don't have any lxcfs processes using the old binaries.
 
fabian,

I think that the lxcfs was upgraded, then the containers restarted. I might be wrong tough. Now a couple of servers were downgraded to kernel 4.2.6 with an upgraded lxfs 2.0.0-pve2. There are a few other servers running 4.2.8 with lxcfs 2.0.0-pve2 so I'm waiting now.

I will try to post the required info next time it happens (if it happens). The problem is that these are live systems and downtime is very expensive.

It looks related to something that happens during load and/or disk access. I do backups at night (when it's more likely to happen). I can make it happen during the day if I load two/three of the heavy containers on the same node and give it a couple of days. They're postfix/dovecot toaster boxes, loading the mail spool from nfs with bind-mount.
 
Hey,

I have every day one "hung_task_timeout_secs". But I am not sure if that is a similar issue because in every kernel log output appears:
"task corosync:2280 blocked for more than 120 seconds"

When this issue happened:
  • Webinterface doesnt work correctly (login buggy or not accessible)
  • LXC not working correcty (no SSH connection possilbe)
the system is still running in this state. And I can provide more information.

pveversion -v:
Code:
root@vp-proxmoxS2:~# pveversion -v
proxmox-ve: 4.1-41 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-22 (running version: 4.1-22/aca130cf)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.2.8-1-pve: 4.2.8-41
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-36
qemu-server: 4.0-64
pve-firmware: 1.1-7
libpve-common-perl: 4.0-54
libpve-access-control: 4.0-13
libpve-storage-perl: 4.0-45
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-9
pve-container: 1.0-52
pve-firewall: 2.0-22
pve-ha-manager: 1.0-25
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie
openvswitch-switch: 2.3.2-2
 

Attachments

  • gdb.output.txt
    2.9 KB · Views: 2
  • ps_faxl.txt
    59.4 KB · Views: 2
  • syslog.1.txt
    112.4 KB · Views: 2
Last edited:
Hey,

I have every day one "hung_task_timeout_secs". But I am not sure if that is a similar issue because in every kernel log output appears:
"task corosync:2280 blocked for more than 120 seconds"

When this issue happened:
  • Webinterface doesnt work correctly (login buggy or not accessible)
  • LXC not working correcty (no SSH connection possilbe)
the system is still running in this state. And I can provide more information.

This is an unrelated issue (your lxcfs process does not hang). The "task XX blocked for more than.." message is usually just a symptom of an overloaded system (the kernel tells you that something blocked for more than two minutes, which can happen without any error other than an overloaded system). If you want to further troubleshoot this, please open a new thread!
 
Fabian,

you might be interested to know that I've had no issues for some days now. The servers running 4.2.6+lxcfs pve2 are solid (two of them, different hardware). The ones with 4.2.8 + pve2 have been working too.

jinjer.
 
Fabian,

you might be interested to know that I've had no issues for some days now. The servers running 4.2.6+lxcfs pve2 are solid (two of them, different hardware). The ones with 4.2.8 + pve2 have been working too.

jinjer.

Good to hear! Thanks for your feedback.
 
I am having regular issues but it extends to not allowing login either through webgui and SSH. Last time this happened I lost access to webgui and all CTs except one. Unfortunately I lost access, while away, after nearly 2 years of relatively trouble free uptime. (Thank you ProxMox team you guys rock!)

Has the main repository been updated with the patch or we must still use experimental?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!