Repeatedly non starting LXCs

Apr 27, 2022
28
3
8
Morning,

we have a couple of PVE nodes. Our largest one is a 256GB and is heavily used. After reboot we are sitting at around 60-65% memory utilization. Over the course of hours it rises to 85-90%, but keeps 20+G of free memory available.

We now have failed lxc and vms and it seems to be memory related, but I have no clue where to check.

Code:
Sep 03 16:29:09 pve02 pvedaemon[1762147]: starting CT 108: UPID:pve02:001AE363:07104002:631364B5:vzstart:108:root@pam:
Sep 03 16:29:09 pve02 pvedaemon[3450029]: <root@pam> starting task UPID:pve02:001AE363:07104002:631364B5:vzstart:108:root@pam:
Sep 03 16:29:09 pve02 systemd[1]: Started PVE LXC Container: 108.
Sep 03 16:29:10 pve02 audit[1762165]: AVC apparmor="STATUS" operation="profile_load" profile="/usr/bin/lxc-start" name="lxc-108_</var/lib/lxc>" pid=1762165 comm="apparmor_parser"
Sep 03 16:29:10 pve02 kernel: audit: type=1400 audit(1662215350.428:310): apparmor="STATUS" operation="profile_load" profile="/usr/bin/lxc-start" name="lxc-108_</var/lib/lxc>" pid=1762165 comm="apparmor_parser"
Sep 03 16:29:10 pve02 kernel: lxc-start: page allocation failure: order:5, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=108,mems_allowed=0
Sep 03 16:29:10 pve02 kernel: CPU: 29 PID: 1762151 Comm: lxc-start Tainted: P           O      5.15.35-1-pve #1
Sep 03 16:29:10 pve02 kernel: Hardware name: primeLine Solutions egino BTO/H12SSL-C, BIOS 2.1 06/02/2021
Sep 03 16:29:10 pve02 kernel: Call Trace:
Sep 03 16:29:10 pve02 kernel:  <TASK>
Sep 03 16:29:10 pve02 kernel:  dump_stack_lvl+0x4a/0x5f
Sep 03 16:29:10 pve02 kernel:  dump_stack+0x10/0x12
Sep 03 16:29:10 pve02 kernel:  warn_alloc+0x137/0x160
Sep 03 16:29:10 pve02 kernel:  __alloc_pages_slowpath.constprop.0+0xdd0/0xe30
Sep 03 16:29:10 pve02 kernel:  __alloc_pages+0x308/0x320
Sep 03 16:29:10 pve02 kernel:  alloc_pages+0x9e/0x1e0
Sep 03 16:29:10 pve02 kernel:  kmalloc_order+0x2f/0xc0
Sep 03 16:29:10 pve02 kernel:  kmalloc_order_trace+0x1d/0x90
Sep 03 16:29:10 pve02 kernel:  __kmalloc+0x2ad/0x330
Sep 03 16:29:10 pve02 kernel:  veth_dev_init+0x88/0x120 [veth]
Sep 03 16:29:10 pve02 kernel:  register_netdevice+0x118/0x660
Sep 03 16:29:10 pve02 kernel:  ? get_random_bytes+0x43/0x90
Sep 03 16:29:10 pve02 kernel:  veth_newlink+0x1a1/0x410 [veth]
Sep 03 16:29:10 pve02 kernel:  __rtnl_newlink+0x76a/0xa20
Sep 03 16:29:10 pve02 kernel:  ? dmu_object_size_from_db+0x6c/0x80 [zfs]
Sep 03 16:29:10 pve02 kernel:  ? __cond_resched+0x1a/0x50
Sep 03 16:29:10 pve02 kernel:  ? mutex_lock+0x13/0x40
Sep 03 16:29:10 pve02 kernel:  ? __cond_resched+0x1a/0x50
Sep 03 16:29:10 pve02 kernel:  ? get_partial_node.part.0+0xdf/0x230
Sep 03 16:29:10 pve02 kernel:  rtnl_newlink+0x49/0x70
Sep 03 16:29:10 pve02 kernel:  rtnetlink_rcv_msg+0x160/0x410
Sep 03 16:29:10 pve02 kernel:  ? rtnl_calcit.isra.0+0x130/0x130
Sep 03 16:29:10 pve02 kernel:  netlink_rcv_skb+0x55/0x100
Sep 03 16:29:10 pve02 kernel:  rtnetlink_rcv+0x15/0x20
Sep 03 16:29:10 pve02 kernel:  netlink_unicast+0x221/0x330
Sep 03 16:29:10 pve02 kernel:  netlink_sendmsg+0x23f/0x4a0
Sep 03 16:29:10 pve02 kernel:  sock_sendmsg+0x65/0x70
Sep 03 16:29:10 pve02 kernel:  ____sys_sendmsg+0x257/0x2a0
Sep 03 16:29:10 pve02 kernel:  ? import_iovec+0x31/0x40
Sep 03 16:29:10 pve02 kernel:  ? sendmsg_copy_msghdr+0x7e/0xa0
Sep 03 16:29:10 pve02 kernel:  ___sys_sendmsg+0x82/0xc0
Sep 03 16:29:10 pve02 kernel:  ? wp_page_copy+0x2dc/0x570
Sep 03 16:29:10 pve02 kernel:  ? do_wp_page+0xef/0x300
Sep 03 16:29:10 pve02 kernel:  ? move_addr_to_user+0x4d/0xe0
Sep 03 16:29:10 pve02 kernel:  ? __handle_mm_fault+0xc5a/0x15c0
Sep 03 16:29:10 pve02 kernel:  __sys_sendmsg+0x62/0xb0
Sep 03 16:29:10 pve02 kernel:  __x64_sys_sendmsg+0x1f/0x30
Sep 03 16:29:10 pve02 kernel:  do_syscall_64+0x5c/0xc0
Sep 03 16:29:10 pve02 kernel:  ? exit_to_user_mode_prepare+0x37/0x1b0
Sep 03 16:29:10 pve02 kernel:  ? irqentry_exit_to_user_mode+0x9/0x20
Sep 03 16:29:10 pve02 kernel:  ? irqentry_exit+0x19/0x30
Sep 03 16:29:10 pve02 kernel:  ? exc_page_fault+0x89/0x160
Sep 03 16:29:10 pve02 kernel:  ? asm_exc_page_fault+0x8/0x30
Sep 03 16:29:10 pve02 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
Sep 03 16:29:10 pve02 kernel: RIP: 0033:0x7f5c89a28e13
Sep 03 16:29:10 pve02 kernel: Code: 8b 15 b9 91 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 89 54 24 1c 48
Sep 03 16:29:10 pve02 kernel: RSP: 002b:00007ffdf7f64f68 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
Sep 03 16:29:10 pve02 kernel: RAX: ffffffffffffffda RBX: 000055e4c8f1a630 RCX: 00007f5c89a28e13
Sep 03 16:29:10 pve02 kernel: RDX: 0000000000004000 RSI: 00007ffdf7f64f90 RDI: 0000000000000008
Sep 03 16:29:10 pve02 kernel: RBP: 00007ffdf7f65020 R08: 000000000000000a R09: 0000000000000068
Sep 03 16:29:10 pve02 kernel: R10: 0000000000000004 R11: 0000000000000246 R12: 00007ffdf7f65150
Sep 03 16:29:10 pve02 kernel: R13: 000055e4c8f11a58 R14: 000055e4c8f17780 R15: 00007ffdf7f65020
Sep 03 16:29:10 pve02 kernel:  </TASK>
Sep 03 16:29:10 pve02 kernel: Mem-Info:
Sep 03 16:29:10 pve02 kernel: active_anon:18215031 inactive_anon:8137544 isolated_anon:0
 active_file:19425 inactive_file:16754 isolated_file:0
 unevictable:38868 dirty:138 writeback:0
 slab_reclaimable:255908 slab_unreclaimable:4421258
 mapped:45804 shmem:29525 pagetables:80469 bounce:0
 kernel_misc_reclaimable:0
 free:8821935 free_pcp:215 free_cma:0
Sep 03 16:29:10 pve02 kernel: Node 0 active_anon:72860124kB inactive_anon:32550176kB active_file:77700kB inactive_file:67016kB unevictable:155472kB isolated(anon):0kB isolated(file):0kB mapped:183216kB dirty:1056kB writeback:0kB shmem:118100kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 3338240kB writeback_tmp:0kB kernel_stack:46784kB pagetables:321876kB all_unreclaimable? yes
Sep 03 16:29:10 pve02 kernel: Node 0 DMA free:11264kB min:0kB low:12kB high:24kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Sep 03 16:29:10 pve02 kernel: lowmem_reserve[]: 0 2551 257499 257499 257499
Sep 03 16:29:10 pve02 kernel: Node 0 DMA32 free:1018944kB min:668kB low:3280kB high:5892kB reserved_highatomic:2048KB active_anon:1301716kB inactive_anon:270164kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:2741616kB managed:2674808kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Sep 03 16:29:10 pve02 kernel: lowmem_reserve[]: 0 0 254947 254947 254947
Sep 03 16:29:10 pve02 kernel: Node 0 Normal free:34256288kB min:950460kB low:1211524kB high:1472588kB reserved_highatomic:0KB active_anon:71558408kB inactive_anon:32280012kB active_file:77700kB inactive_file:67016kB unevictable:155472kB writepending:1304kB present:265534464kB managed:261073064kB mlocked:155472kB bounce:0kB free_pcp:1612kB local_pcp:0kB free_cma:0kB
Sep 03 16:29:10 pve02 kernel: lowmem_reserve[]: 0 0 0 0 0
Sep 03 16:29:10 pve02 kernel: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB
Sep 03 16:29:10 pve02 kernel: Node 0 DMA32: 2012*4kB (UM) 714*8kB (UM) 1170*16kB (UMH) 1225*32kB (UMEH) 973*64kB (UME) 912*128kB (UME) 831*256kB (UME) 557*512kB (UME) 258*1024kB (UME) 3*2048kB (M) 0*4096kB = 1018944kB
Sep 03 16:29:10 pve02 kernel: Node 0 Normal: 769151*4kB (UME) 1821559*8kB (UME) 1037974*16kB (UME) 120*32kB (UE) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 34260500kB
Sep 03 16:29:10 pve02 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Sep 03 16:29:10 pve02 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Sep 03 16:29:10 pve02 kernel: 69988 total pagecache pages
Sep 03 16:29:10 pve02 kernel: 0 pages in swap cache
Sep 03 16:29:10 pve02 kernel: Swap cache stats: add 0, delete 0, find 0/0
Sep 03 16:29:10 pve02 kernel: Free swap  = 0kB
Sep 03 16:29:10 pve02 kernel: Total swap = 0kB
Sep 03 16:29:10 pve02 kernel: 67073019 pages RAM
Sep 03 16:29:10 pve02 kernel: 0 pages HighMem/MovableOnly
Sep 03 16:29:10 pve02 kernel: 1132211 pages reserved
Sep 03 16:29:10 pve02 kernel: 0 pages hwpoisoned
Sep 03 16:29:10 pve02 pvedaemon[1762147]: startup for container '108' failed
Sep 03 16:29:10 pve02 pvedaemon[3450029]: <root@pam> end task UPID:pve02:001AE363:07104002:631364B5:vzstart:108:root@pam: startup for container '108' failed
Sep 03 16:29:10 pve02 audit[1762176]: AVC apparmor="STATUS" operation="profile_remove" profile="/usr/bin/lxc-start" name="lxc-108_</var/lib/lxc>" pid=1762176 comm="apparmor_parser"
Sep 03 16:29:10 pve02 kernel: audit: type=1400 audit(1662215350.712:311): apparmor="STATUS" operation="profile_remove" profile="/usr/bin/lxc-start" name="lxc-108_</var/lib/lxc>" pid=1762176 comm="apparmor_parser"
Sep 03 16:29:11 pve02 systemd[1]: pve-container@108.service: Main process exited, code=exited, status=1/FAILURE
Sep 03 16:29:11 pve02 systemd[1]: pve-container@108.service: Failed with result 'exit-code'.
Sep 03 16:29:16 pve02 pvedaemon[1762970]: starting CT 108: UPID:pve02:001AE69A:071042D7:631364BC:vzstart:108:root@pam:
Sep 03 16:29:16 pve02 pvedaemon[3409093]: <root@pam> starting task UPID:pve02:001AE69A:071042D7:631364BC:vzstart:108:root@pam:
Sep 03 16:29:17 pve02 systemd[1]: Started PVE LXC Container: 108.
Sep 03 16:29:17 pve02 audit[1762984]: AVC apparmor="STATUS" operation="profile_load" profile="/usr/bin/lxc-start" name="lxc-108_</var/lib/lxc>" pid=1762984 comm="apparmor_parser"
Sep 03 16:29:17 pve02 kernel: audit: type=1400 audit(1662215357.684:312): apparmor="STATUS" operation="profile_load" profile="/usr/bin/lxc-start" name="lxc-108_</var/lib/lxc>" pid=1762984 comm="apparmor_parser"
Sep 03 16:29:17 pve02 pvedaemon[1762970]: startup for container '108' failed
Sep 03 16:29:17 pve02 pvedaemon[3409093]: <root@pam> end task UPID:pve02:001AE69A:071042D7:631364BC:vzstart:108:root@pam: startup for container '108' failed
Sep 03 16:29:17 pve02 audit[1762994]: AVC apparmor="STATUS" operation="profile_remove" profile="/usr/bin/lxc-start" name="lxc-108_</var/lib/lxc>" pid=1762994 comm="apparmor_parser"
Sep 03 16:29:18 pve02 kernel: audit: type=1400 audit(1662215357.912:313): apparmor="STATUS" operation="profile_remove" profile="/usr/bin/lxc-start" name="lxc-108_</var/lib/lxc>" pid=1762994 comm="apparmor_parser"
Sep 03 16:29:18 pve02 systemd[1]: pve-container@108.service: Main process exited, code=exited, status=1/FAILURE
Sep 03 16:29:18 pve02 systemd[1]: pve-container@108.service: Failed with result 'exit-code'.

Sometimes it starts fine after a couple of tries. Sometimes it helps to remove the NIC but later on starting WITH the NIC works just fine.

Kinda at a loss here and would greatly appreciated any hints at where to go.

Thanks
Marie.
 
In case you are using ZFS it might be possible that the hosts needs RAM but ZFS can't free the ARC fast enough so the OOM killer triggers. Default ARC size is UP TO 50% of the hosts total RAM. So up to 128GB in your case. In such a case it might help to limit the ARC size like described here: https://pve.proxmox.com/wiki/ZFS_on_Linux#sysadmin_zfs_limit_memory_usage

You can check the min/max/current ARC sizes with arc_summary.
 
Thank you very much.
I've amended the setting (which was set at default), and report back my findings.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!