WebGUI unresponsive, qm * crashes, system wont even reboot

FanHi

New Member
Jun 19, 2020
4
0
1
34
Hi,

I have no idea why it would cause this problem, but looking into some other threads, it might have to do with a cifs-mount going offline.

I have a lxc which provides a samba-share for some backup jobs. For monitoring purpose I added this share on the host-system.
After installing some updates and rebooting the lxc, the webgui went offline.
When connecting to the host via ssh, any command remotely related to pve (qm, pvesm etc.) immediatly caused a frozen ssh-session.
Multiple pve services seem to have crashed and wouldn't restart and I was not able to enter /etc/pve without the session crashing immediatly.
The system wouldn't even reboot without a hardware reset.

I am pretty sure I have rebooted said lxc before without anything like this happening.
Why would the system behave that way and how can i prevent this in the future?
If the cifs-share is not likely to be the problem - what else could it be?

Thank you very much in advance!
Best regards,
FanHi
 
Nothing particularly conclusive (at least for me):

It appears to have started with me rebooting vmid100:

Code:
 02:38:03 server kernel: [2977849.988788] fwbr100i1: port 2(veth100i1) entered disabled state
 02:38:03 server kernel: [2977849.990693] device veth100i1 left promiscuous mode
 02:38:03 server kernel: [2977849.990700] fwbr100i1: port 2(veth100i1) entered disabled state
 02:38:04 server kernel: [2977850.423163] audit: type=1400 audit(1592786284.214:207): apparmor="STATUS" operation="profile_remove" profile="/usr/bin/lxc-st$
 02:38:05 server kernel: [2977851.530518] fwbr100i1: port 1(fwln100i1) entered disabled state
 02:38:05 server kernel: [2977851.531748] vmbr2: port 2(fwpr100p1) entered disabled state
 02:38:05 server kernel: [2977851.533632] device fwln100i1 left promiscuous mode
 02:38:05 server kernel: [2977851.533639] fwbr100i1: port 1(fwln100i1) entered disabled state
 02:38:05 server kernel: [2977851.564675] device fwpr100p1 left promiscuous mode
 02:38:05 server kernel: [2977851.564677] vmbr2: port 2(fwpr100p1) entered disabled state

and then the log contains identical blocks like this one for tasks qm and pveproxy:

Code:
03:43:26 server kernel: [2981772.845192] INFO: task qm:20874 blocked for more than 845 seconds.
 03:43:26 server kernel: [2981772.845858]       Tainted: P          IO      5.3.10-1-pve #1
 03:43:26 server kernel: [2981772.846404] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 03:43:26 server kernel: [2981772.846936] qm              D    0 20874  20704 0x00004004
 03:43:26 server kernel: [2981772.846938] Call Trace:
 03:43:26 server kernel: [2981772.846948]  __schedule+0x2bb/0x660
 03:43:26 server kernel: [2981772.846949]  schedule+0x33/0xa0
 03:43:26 server kernel: [2981772.846954]  request_wait_answer+0x133/0x210
 03:43:26 server kernel: [2981772.846957]  ? wait_woken+0x80/0x80
 03:43:26 server kernel: [2981772.846959]  __fuse_request_send+0x69/0x90
 03:43:26 server kernel: [2981772.846960]  fuse_request_send+0x29/0x30
 03:43:26 server kernel: [2981772.846961]  fuse_simple_request+0xdd/0x1a0
 03:43:26 server kernel: [2981772.846963]  fuse_dentry_revalidate+0x1a0/0x310
 03:43:26 server kernel: [2981772.846967]  lookup_fast+0x292/0x310
 03:43:26 server kernel: [2981772.846969]  walk_component+0x49/0x330
 03:43:26 server kernel: [2981772.846970]  ? inode_permission+0x63/0x1a0
 03:43:26 server kernel: [2981772.846972]  link_path_walk.part.43+0x2c6/0x540
 03:43:26 server kernel: [2981772.846973]  path_parentat.isra.44+0x2f/0x80
 03:43:26 server kernel: [2981772.846975]  filename_parentat.isra.59.part.60+0xa4/0x180
 03:43:26 server kernel: [2981772.846977]  ? ext4_file_read_iter+0x54/0xf0
 03:43:26 server kernel: [2981772.846979]  filename_create+0x55/0x180
 03:43:26 server kernel: [2981772.846980]  ? getname_flags+0x6f/0x1e0
 03:43:26 server kernel: [2981772.846981]  do_mkdirat+0x59/0x110
 03:43:26 server kernel: [2981772.846983]  __x64_sys_mkdir+0x1b/0x20
 03:43:26 server kernel: [2981772.846986]  do_syscall_64+0x5a/0x130
 03:43:26 server kernel: [2981772.846989]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 03:43:26 server kernel: [2981772.846991] RIP: 0033:0x7fdad27860d7
 03:43:26 server kernel: [2981772.846995] Code: Bad RIP value.
 03:43:26 server kernel: [2981772.846996] RSP: 002b:00007ffec3c8b798 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
 03:43:26 server kernel: [2981772.846997] RAX: ffffffffffffffda RBX: 000055ec5fc9b260 RCX: 00007fdad27860d7
 03:43:26 server kernel: [2981772.846998] RDX: 0000000000000019 RSI: 00000000000001ff RDI: 000055ec63485b00
 03:43:26 server kernel: [2981772.846999] RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000000c
 03:43:26 server kernel: [2981772.846999] R10: 0000000000000000 R11: 0000000000000246 R12: 000055ec611ff5f8
 03:43:26 server kernel: [2981772.847000] R13: 000055ec63485b00 R14: 000055ec63219890 R15: 00000000000001ff


I can check ram in a few days - i dont have easy access to the machine right now.