[SOLVED] PVE suddunly stopped working, all CTs unrecheable

eXtremeSHOk · Mar 15, 2016

Containers are running, but unresponsive.

Containers will not shutdown and the only way to restart/recover the host is to hard reset.

Code:

root@px-a:~# pct shutdown 801
command 'lxc-wait -n 801 -t 5 -s STOPPED' failed: exit code 1

Code:

Mar 14 15:05:20 px-a kernel: [94764.686192]  [<ffffffff81806967>] schedule+0x37/0x80
Mar 14 15:05:20 px-a kernel: [94764.686199]  [<ffffffff81306374>] __fuse_direct_read+0x44/0x60
Mar 14 15:05:20 px-a kernel: [94764.686205]  [<ffffffff8180aaf2>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 14 15:07:20 px-a kernel: [94884.793163]  [<ffffffff810bdd30>] ? wait_woken+0x90/0x90
Mar 14 15:07:20 px-a kernel: [94884.793169]  [<ffffffff813063d0>] fuse_direct_read_iter+0x40/0x60
Mar 14 15:07:20 px-a kernel: [94884.793177]  [<ffffffff8180aaf2>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 14 15:07:20 px-a kernel: [94884.794275]  ffff880e66387be8 0000000000000086 ffff881038aa6e00 ffff880c813e8dc0
Mar 14 15:07:20 px-a kernel: [94884.794281]  [<ffffffff812fb863>] request_wait_answer+0x163/0x280
Mar 14 15:07:20 px-a kernel: [94884.794287]  [<ffffffff81306374>] __fuse_direct_read+0x44/0x60
Mar 14 15:07:20 px-a kernel: [94884.794293]  [<ffffffff811fe7e5>] SyS_read+0x55/0xc0

eXtremeSHOk · Mar 15, 2016

Had this issue for 2 weeks and only since upgrading to the latest kernel

fabian · Mar 15, 2016

Please install the lxcfs-dbg and gdb packages and provide the output of "ps faxl" on the host and gdb backtraces of the lxcfs processes (start gdb, enter "attach <PID>" where <PID> is the lxcfs pid, enter "bt", copy output, enter "detach", repeat for each lxcfs process) when the issue occurs. We cannot reproduce this so far..

ianux · Mar 15, 2016

It happens again right now.

ps faxl output is attached.

gdb output (3 lxcfs processes):

#0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1 0x00007f9e88e14588 in fuse_session_loop_mt () from /lib/x86_64-linux-gnu/libfuse.so.2
#2 0x00007f9e88e19d27 in fuse_loop_mt () from /lib/x86_64-linux-gnu/libfuse.so.2
#3 0x00007f9e88e1c91d in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#4 0x000000000040165a in main (argc=2, argv=<optimized out>) at lxcfs.c:937

#0 0x00007f9e887da4c9 in __libc_waitpid (pid=pid@entry=10201, stat_loc=stat_loc@entry=0x7f9e87c1d774,
options=options@entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:40
#1 0x00007f9e88bef94c in wait_for_pid (pid=10201) at bindings.c:975
#2 0x00007f9e88bf036a in wait_for_pid (pid=<optimized out>) at bindings.c:889
#3 write_task_init_pid_exit (target=10199, sock=5) at bindings.c:890
#4 get_init_pid_for_task (task=10199) at bindings.c:923
#5 lookup_initpid_in_store (qpid=10199) at bindings.c:955
#6 0x00007f9e88bf5312 in proc_meminfo_read (fi=<optimized out>, offset=<optimized out>, size=<optimized out>,
buf=<optimized out>) at bindings.c:2864
#7 proc_read (path=0x27d9 <error: Cannot access memory at address 0x27d9>, buf=0x7f9e78000b20 "", size=4096,
offset=0, fi=0x7f9e70000aa0) at bindings.c:3738
#8 0x000000000040217b in do_proc_read (fi=<optimized out>, offset=<optimized out>, size=<optimized out>,
buf=<optimized out>, path=<optimized out>) at lxcfs.c:184
#9 lxcfs_read (path=<optimized out>, buf=0x7f9e78000b20 "", size=4096, offset=0, fi=0x7f9e87c1dd40) at lxcfs.c:508
#10 0x00007f9e88e0e574 in fuse_fs_read_buf () from /lib/x86_64-linux-gnu/libfuse.so.2
#11 0x00007f9e88e0e732 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#12 0x00007f9e88e1708e in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#13 0x00007f9e88e17895 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#14 0x00007f9e88e14394 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#15 0x00007f9e887d30a4 in start_thread (arg=0x7f9e87c1e700) at pthread_create.c:309
#16 0x00007f9e8850887d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

#0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007f9e8849e5fb in _L_lock_11305 () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f9e8849c758 in __GI___libc_realloc (oldmem=0x7f9e887c5620 <main_arena>, bytes=bytes@entry=567)
at malloc.c:3025
#3 0x00007f9e8849216b in _IO_vasprintf (result_ptr=0x7f9e87c1d6d8, format=<optimized out>,
args=args@entry=0x7f9e87c1d5b8) at vasprintf.c:84
#4 0x00007f9e88470f47 in ___asprintf (string_ptr=string_ptr@entry=0x7f9e87c1d6d8,
format=format@entry=0x7f9e88587528 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n") at asprintf.c:35
#5 0x00007f9e8844e1c2 in __assert_fail_base (fmt=0x7f9e88587528 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=assertion@entry=0x7f9e8858a840 "({ __typeof (self->tid) __value; if (sizeof (__value) == 1) asm volatile (\"movb %%fs:%P2,%b0\" : \"=q\" (__value) : \"0\" (0), \"i\" (__builtin_offsetof (struct pthread, tid))); else if (sizeof (__value) == "..., file=file@entry=0x7f9e8858a808 "../nptl/sysdeps/unix/sysv/linux/x86_64/../fork.c",
line=line@entry=141, function=function@entry=0x7f9e88585052 <__PRETTY_FUNCTION__.11206> "__libc_fork")
at assert.c:57
#6 0x00007f9e8844e312 in __GI___assert_fail (
assertion=0x7f9e8858a840 "({ __typeof (self->tid) __value; if (sizeof (__value) == 1) asm volatile (\"movb %%fs:%P2,%b0\" : \"=q\" (__value) : \"0\" (0), \"i\" (__builtin_offsetof (struct pthread, tid))); else if (sizeof (__value) == "..., file=0x7f9e8858a808 "../nptl/sysdeps/unix/sysv/linux/x86_64/../fork.c", line=141,
function=0x7f9e88585052 <__PRETTY_FUNCTION__.11206> "__libc_fork") at assert.c:101
#7 0x00007f9e884da235 in __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/x86_64/../fork.c:141
#8 0x00007f9e887dc425 in __fork () at ../nptl/sysdeps/unix/sysv/linux/pt-fork.c:25
#9 0x00007f9e88bf035d in write_task_init_pid_exit (target=10199, sock=5) at bindings.c:886
#10 get_init_pid_for_task (task=10199) at bindings.c:923
#11 lookup_initpid_in_store (qpid=10199) at bindings.c:955
#12 0x00007f9e88bf5312 in proc_meminfo_read (fi=<optimized out>, offset=<optimized out>, size=<optimized out>,
buf=<optimized out>) at bindings.c:2864
#13 proc_read (path=0x7f9e887c5620 <main_arena> "\002", buf=0x7f9e78000b20 "", size=4096, offset=0, fi=0x7f9e70000aa0)
at bindings.c:3738
#14 0x000000000040217b in do_proc_read (fi=<optimized out>, offset=<optimized out>, size=<optimized out>,
buf=<optimized out>, path=<optimized out>) at lxcfs.c:184
#15 lxcfs_read (path=<optimized out>, buf=0x7f9e78000b20 "", size=4096, offset=0, fi=0x7f9e87c1dd40) at lxcfs.c:508
#16 0x00007f9e88e0e574 in fuse_fs_read_buf () from /lib/x86_64-linux-gnu/libfuse.so.2
#17 0x00007f9e88e0e732 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#18 0x00007f9e88e1708e in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#19 0x00007f9e88e17895 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#20 0x00007f9e88e14394 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#21 0x00007f9e887d30a4 in start_thread (arg=0x7f9e87c1e700) at pthread_create.c:309
#22 0x00007f9e8850887d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

ianux · Mar 15, 2016

It happens again on another host.

It happened the first time yesterday at 18:30, I had to hard reboot. It happened again just now (first symptoms: Icinga going full red). Hard reboot.

Same thing:

Mar 15 11:20:21 xxx kernel: [59849.436873] ffff880b2e58fbe8 0000000000000086 ffffffff81e14580 ffff880e3bd75280
Mar 15 11:20:21 xxx kernel: [59849.455453] ffff8810341c6040 fffffffffffffe00 ffff880108e17c08 ffffffff81806967
Mar 15 11:20:21 xxx kernel: [59849.459213] [<ffffffff812fba47>] fuse_request_send+0x27/0x30
Mar 15 11:22:21 xxx kernel: [59969.646558] [<ffffffff811fd264>] new_sync_read+0x94/0xd0
Mar 15 11:22:21 xxx kernel: [59969.650939] ffff8810341c6040 fffffffffffffe00 ffff880b2e58fc08 ffffffff81806967
Mar 15 11:22:21 xxx kernel: [59969.652641] [<ffffffff810bdd30>] ? wait_woken+0x90/0x90
Mar 15 11:22:21 xxx kernel: [59969.653817] [<ffffffff81306128>] fuse_direct_io+0x3a8/0x5b0
Mar 15 11:22:21 xxx kernel: [59969.656056] [<ffffffff811fe7e5>] SyS_read+0x55/0xc0
Mar 15 11:22:21 xxx kernel: [59969.656538] [<ffffffff8180aaf2>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 15 11:22:21 xxx kernel: [59969.657674] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 15 11:22:21 xxx kernel: [59969.660989] [<ffffffff810bdd30>] ? wait_woken+0x90/0x90
Mar 15 11:22:21 xxx kernel: [59969.662818] [<ffffffff813063d0>] fuse_direct_read_iter+0x40/0x60
Mar 15 11:22:21 xxx kernel: [59969.664835] [<ffffffff8180aaf2>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 15 11:22:21 xxx kernel: [59969.665963] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 15 11:22:21 xxx kernel: [59969.669314] [<ffffffff810bdd30>] ? wait_woken+0x90/0x90
Mar 15 11:22:21 xxx kernel: [59969.671183] [<ffffffff813063d0>] fuse_direct_read_iter+0x40/0x60
Mar 15 11:22:21 xxx kernel: [59969.673183] [<ffffffff8180aaf2>] entry_SYSCALL_64_fastpath+0x16/0x75
Mar 15 11:22:21 xxx kernel: [59969.675017] ffff88010c47bbe8 0000000000000086 ffff881038535280 ffff880ef61b3700
Mar 15 11:22:21 xxx kernel: [59969.677900] [<ffffffff812fba10>] __fuse_request_send+0x90/0xa0
Mar 15 11:22:21 xxx kernel: [59969.681038] [<ffffffff811fe7e5>] SyS_read+0x55/0xc0
Mar 15 11:22:21 xxx kernel: [59969.685027] [<ffffffff81806967>] schedule+0x37/0x80
Mar 15 11:22:21 xxx kernel: [59969.690358] Tainted: P O 4.2.8-1-pve #1
Mar 15 11:22:21 xxx kernel: [59969.693556] [<ffffffff812fb863>] request_wait_answer+0x163/0x280
Mar 15 11:22:21 xxx kernel: [59969.695024] [<ffffffff81306128>] fuse_direct_io+0x3a8/0x5b0
Mar 15 11:22:21 xxx kernel: [59969.696672] [<ffffffff811fd2c6>] __vfs_read+0x26/0x40
Mar 15 11:22:21 xxx kernel: [59969.699628] ffff8809c773bbe8 0000000000000082 ffffffff81e14580 ffff880f51c36e00
Mar 15 11:22:21 xxx kernel: [59969.701324] [<ffffffff81806967>] schedule+0x37/0x80
Mar 15 11:22:21 xxx kernel: [59969.705867] [<ffffffff8180aaf2>] entry_SYSCALL_64_fastpath+0x16/0x75

SPQRInc · Mar 15, 2016

Got the same problem. Urgent fix is required.

koelt · Mar 15, 2016

I have downgraded kernel to 4.2.6-1-pve and lxcfs to 0.13-pve3. So far so good, but I have done it yesterday, so it is too soon to tell if it really helped.

camaran · Mar 15, 2016

koelt said:
I have downgraded kernel to 4.2.6-1-pve and lxcfs to 0.13-pve3. So far so good, but I have done it yesterday, so it is too soon to tell if it really helped.

how?

fabian · Mar 15, 2016

ianux said:
It happens again right now.

ps faxl output is attached.

gdb output (3 lxcfs processes):

Thank you! An updated lxfs package for testing that should fix this issue will be available soon.

@ others: please collect the debugging output to see whether your issue has the same cause or is a different one!

fabian · Mar 15, 2016

Updated lxcfs packages for testing are available in pvetest:

http://download.proxmox.com/debian/dists/jessie/pvetest/binary-amd64/lxcfs-dbg_2.0.0-pve2_amd64.deb
http://download.proxmox.com/debian/dists/jessie/pvetest/binary-amd64/lxcfs_2.0.0-pve2_amd64.deb

Please report back whether they fix this issue for you!

SPQRInc · Mar 15, 2016

Is it possible to use the updated lxcfs with the included bugfix without using the testing-repository without problems?

In January Mr. Maurer told me to hot fix a similar problem using

# wget ftp://download1.proxmox.com/debian/dists/jessie/pvetest/binary-amd64/lxcfs_0.13-pve3_amd64.deb
# dpkg -i lxcfs_0.13-pve3_amd64.deb

This hot fix was usable on nodes using the stable repositories and did not affect any updates/upgrades later on. Will this be the case here or do I have to wait until the fix is available in the stable branch of Proxmox?

camaran · Mar 15, 2016

after installed the pacth must reboot all?

fabian · Mar 15, 2016

The only change between the current pve-enterprise and pve-no-subscription lxcfs packages and the two in pvetest is a bugfix for exactly the issue posted by ianux. You can safely downgrade to the version in pve-no-subscription again after testing if it does not help. The difference between lxcfs 0.13 (which is outdated) and 2.0.0 are a lot bigger, but 0.13 is not supported any more anyway.

If you are installing the test packages on an uptodate system (i.e., lxcfs 2.0.0-pve1), lxcfs should reload automatically without the need to restart containers or the host (you can observe this in the system log). But this feature is new with 2.0.0, so if possible (and to rule out false negatives regarding the effectiveness of the bug fix), it might be a good idea to restart the containers anyway.

Note: only install those two packages (lxcfs, lxcfs-dbg) from pvetest. There are a lot of other testing packages there that are probably not safe to install on production systems!

windinternet · Mar 15, 2016

Could it be that there has been a change of /proc/sys/kernel/pid_max?

SPQRInc · Mar 15, 2016

To outline that: I should change the sources from pve-subscription to testing and not only install the deb's you provided?

If I change to testing - are there other packages that are being upgraded? I can not use unstable software on nodes that are in production.

fabian · Mar 15, 2016

SPQRInc said:
To outline that: I should change the sources from pve-subscription to testing and not only install the deb's you provided?

If I change to testing - are there other packages that are being upgraded? I can not use unstable software on nodes that are in production.

No! Like I said, only install those two packages (i.e., by downloading them and using "dpkg -i <PKG>"). If you enable the pvetest repository and upgrade, ALL the packages will be upgraded to their version in pvetest. This is not recommended unless you do it on a system purely for testing/development.

SPQRInc · Mar 15, 2016

Okay, thank you very much. And there are no problems expected with upcoming updates of the pve-enterprise-branch?

fabian · Mar 15, 2016

Subsequent released upgrades of the lxcfs packages will be compatible, yes.

ianux · Mar 15, 2016

fabian said:
Updated lxcfs packages for testing are available in pvetest:

http://download.proxmox.com/debian/dists/jessie/pvetest/binary-amd64/lxcfs-dbg_2.0.0-pve2_amd64.deb
http://download.proxmox.com/debian/dists/jessie/pvetest/binary-amd64/lxcfs_2.0.0-pve2_amd64.deb

Please report back whether they fix this issue for you!

I'm currently testing them. Wait and see...

tmikaeld · Mar 15, 2016

Have the same issue, been struggling with it for a couple of days...

Applied the fixes, hopefully it helps - seems to happen once or twice a day.

Was convinced it was caused by ZFS, luckily that's not the issue!

Code:

[17520.128358] INFO: task pyzor:7408 blocked for more than 120 seconds.
[17520.130319]       Tainted: P           O    4.2.8-1-pve #1
[17520.131461] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[17520.133198] pyzor           D ffff880792c16a00     0  7408  30261 0x00000104
[17520.134731]  ffff880515c13be8 0000000000000082 ffffffff81e14580 ffff8805f9afe040
[17520.136422]  0000000000000246 ffff880515c14000 ffff880515c13c38 ffff88076808c8a0
[17520.137779]  ffff880768765040 fffffffffffffe00 ffff880515c13c08 ffffffff81806967
[17520.138612] Call Trace:
[17520.138886]  [<ffffffff81806967>] schedule+0x37/0x80
[17520.139419]  [<ffffffff812fb863>] request_wait_answer+0x163/0x280
[17520.140088]  [<ffffffff810bdd30>] ? wait_woken+0x90/0x90
[17520.140666]  [<ffffffff812fba10>] __fuse_request_send+0x90/0xa0
[17520.141311]  [<ffffffff812fba47>] fuse_request_send+0x27/0x30
[17520.141926]  [<ffffffff81306128>] fuse_direct_io+0x3a8/0x5b0
[17520.142529]  [<ffffffff81306374>] __fuse_direct_read+0x44/0x60
[17520.143146]  [<ffffffff813063d0>] fuse_direct_read_iter+0x40/0x60
[17520.143797]  [<ffffffff811fd264>] new_sync_read+0x94/0xd0
[17520.144394]  [<ffffffff811fd2c6>] __vfs_read+0x26/0x40
[17520.144940]  [<ffffffff811fd91a>] vfs_read+0x8a/0x130
[17520.145503]  [<ffffffff811fe7e5>] SyS_read+0x55/0xc0
[17520.146032]  [<ffffffff8180aaf2>] entry_SYSCALL_64_fastpath+0x16/0x75

[SOLVED] PVE suddunly stopped working, all CTs unrecheable

Renowned Member

Renowned Member

Attachments

Proxmox Staff Member

New Member

Attachments

New Member

Member

New Member

Active Member

Proxmox Staff Member

Proxmox Staff Member

Member

Active Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

New Member

Renowned Member

We value your privacy