WebGUI unresponsive, qm * crashes, system wont even reboot

FanHi

New Member
Jun 19, 2020
4
0
1
34
Hi,

I have no idea why it would cause this problem, but looking into some other threads, it might have to do with a cifs-mount going offline.

I have a lxc which provides a samba-share for some backup jobs. For monitoring purpose I added this share on the host-system.
After installing some updates and rebooting the lxc, the webgui went offline.
When connecting to the host via ssh, any command remotely related to pve (qm, pvesm etc.) immediatly caused a frozen ssh-session.
Multiple pve services seem to have crashed and wouldn't restart and I was not able to enter /etc/pve without the session crashing immediatly.
The system wouldn't even reboot without a hardware reset.

I am pretty sure I have rebooted said lxc before without anything like this happening.
Why would the system behave that way and how can i prevent this in the future?
If the cifs-share is not likely to be the problem - what else could it be?

Thank you very much in advance!
Best regards,
FanHi
 
Nothing particularly conclusive (at least for me):

It appears to have started with me rebooting vmid100:

Code:
 02:38:03 server kernel: [2977849.988788] fwbr100i1: port 2(veth100i1) entered disabled state
 02:38:03 server kernel: [2977849.990693] device veth100i1 left promiscuous mode
 02:38:03 server kernel: [2977849.990700] fwbr100i1: port 2(veth100i1) entered disabled state
 02:38:04 server kernel: [2977850.423163] audit: type=1400 audit(1592786284.214:207): apparmor="STATUS" operation="profile_remove" profile="/usr/bin/lxc-st$
 02:38:05 server kernel: [2977851.530518] fwbr100i1: port 1(fwln100i1) entered disabled state
 02:38:05 server kernel: [2977851.531748] vmbr2: port 2(fwpr100p1) entered disabled state
 02:38:05 server kernel: [2977851.533632] device fwln100i1 left promiscuous mode
 02:38:05 server kernel: [2977851.533639] fwbr100i1: port 1(fwln100i1) entered disabled state
 02:38:05 server kernel: [2977851.564675] device fwpr100p1 left promiscuous mode
 02:38:05 server kernel: [2977851.564677] vmbr2: port 2(fwpr100p1) entered disabled state

and then the log contains identical blocks like this one for tasks qm and pveproxy:

Code:
03:43:26 server kernel: [2981772.845192] INFO: task qm:20874 blocked for more than 845 seconds.
 03:43:26 server kernel: [2981772.845858]       Tainted: P          IO      5.3.10-1-pve #1
 03:43:26 server kernel: [2981772.846404] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 03:43:26 server kernel: [2981772.846936] qm              D    0 20874  20704 0x00004004
 03:43:26 server kernel: [2981772.846938] Call Trace:
 03:43:26 server kernel: [2981772.846948]  __schedule+0x2bb/0x660
 03:43:26 server kernel: [2981772.846949]  schedule+0x33/0xa0
 03:43:26 server kernel: [2981772.846954]  request_wait_answer+0x133/0x210
 03:43:26 server kernel: [2981772.846957]  ? wait_woken+0x80/0x80
 03:43:26 server kernel: [2981772.846959]  __fuse_request_send+0x69/0x90
 03:43:26 server kernel: [2981772.846960]  fuse_request_send+0x29/0x30
 03:43:26 server kernel: [2981772.846961]  fuse_simple_request+0xdd/0x1a0
 03:43:26 server kernel: [2981772.846963]  fuse_dentry_revalidate+0x1a0/0x310
 03:43:26 server kernel: [2981772.846967]  lookup_fast+0x292/0x310
 03:43:26 server kernel: [2981772.846969]  walk_component+0x49/0x330
 03:43:26 server kernel: [2981772.846970]  ? inode_permission+0x63/0x1a0
 03:43:26 server kernel: [2981772.846972]  link_path_walk.part.43+0x2c6/0x540
 03:43:26 server kernel: [2981772.846973]  path_parentat.isra.44+0x2f/0x80
 03:43:26 server kernel: [2981772.846975]  filename_parentat.isra.59.part.60+0xa4/0x180
 03:43:26 server kernel: [2981772.846977]  ? ext4_file_read_iter+0x54/0xf0
 03:43:26 server kernel: [2981772.846979]  filename_create+0x55/0x180
 03:43:26 server kernel: [2981772.846980]  ? getname_flags+0x6f/0x1e0
 03:43:26 server kernel: [2981772.846981]  do_mkdirat+0x59/0x110
 03:43:26 server kernel: [2981772.846983]  __x64_sys_mkdir+0x1b/0x20
 03:43:26 server kernel: [2981772.846986]  do_syscall_64+0x5a/0x130
 03:43:26 server kernel: [2981772.846989]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 03:43:26 server kernel: [2981772.846991] RIP: 0033:0x7fdad27860d7
 03:43:26 server kernel: [2981772.846995] Code: Bad RIP value.
 03:43:26 server kernel: [2981772.846996] RSP: 002b:00007ffec3c8b798 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
 03:43:26 server kernel: [2981772.846997] RAX: ffffffffffffffda RBX: 000055ec5fc9b260 RCX: 00007fdad27860d7
 03:43:26 server kernel: [2981772.846998] RDX: 0000000000000019 RSI: 00000000000001ff RDI: 000055ec63485b00
 03:43:26 server kernel: [2981772.846999] RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000000c
 03:43:26 server kernel: [2981772.846999] R10: 0000000000000000 R11: 0000000000000246 R12: 000055ec611ff5f8
 03:43:26 server kernel: [2981772.847000] R13: 000055ec63485b00 R14: 000055ec63219890 R15: 00000000000001ff


I can check ram in a few days - i dont have easy access to the machine right now.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!