Ubuntu 18 VM Locks up and maxes out CPU on hypervisor - Kernal and FS Errors in logs

jsalas424

Active Member
Jul 5, 2020
142
3
38
34
I have had an Ubuntu 18 VM exclusively running ZoneMinder for a couple of months now. It has recently started locking up and causing the hypervisor CPU to max out. I have attached a picture from my Grafana dashboard showing, "TracheNode B Core Temps" is the temperature of the hypervisor. You'll see that this issue started around 2/6 and the VM was running stably the entire time before.

PVE 7.1-10

I have included logs around the latest crash. There are funky kernel logs that I can't read.

VM Config:
Code:
#scsi1%3A nodebzpool.1.local%3A420/vm-420-disk-2.qcow2,backup=0,discard=on,size=200G
agent: 1
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 2
cpu: kvm64,flags=+aes
efidisk0: local-zfs-dir:420/vm-420-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
hostpci0: 0000:01:00,pcie=1,x-vga=1
ide2: none,media=cdrom
machine: q35
memory: 10240
meta: creation-qemu=6.1.0,ctime=1637382539
name: ZoneMinderGPU
net0: virtio=26:77:A8:6E:5E:4F,bridge=vmbr0,tag=3
numa: 0
onboot: 1
ostype: l26
scsi0: local-zfs-dir:420/vm-420-disk-1.qcow2,discard=on,size=32G,ssd=1
scsi2: nodebzpool.1.hdd:420/vm-420-disk-2.qcow2,backup=0,size=200G
scsihw: virtio-scsi-pci
smbios1: uuid=a15a0edc-bf44-480d-b910-1a21284c8f8b
sockets: 4
startup: order=5
vga: virtio
vmgenid: dcd95706-8e2f-431f-8411-6bc92d849eeb
vmstatestorage: local

Syslog:
Code:
"timestamp","source","message"
"2022-02-12T14:39:01.000Z","zonemindergpu","(root) CMD (  [ -x /usr/lib/php/sessionclean ] && if [ ! -d /run/systemd/system ]; then /usr/lib/php/sessionclean; fi)"
"2022-02-12T14:39:01.000Z","zonemindergpu","pam_unix(cron:session): session closed for user root"
"2022-02-12T14:39:01.000Z","zonemindergpu","pam_unix(cron:session): session opened for user root by (uid=0)"
"2022-02-12T14:39:33.000Z","zonemindergpu","Started Clean php session files."
"2022-02-12T14:39:33.000Z","zonemindergpu","Starting Clean php session files..."
"2022-02-12T14:44:52.000Z","zonemindergpu","Started Cleanup of Temporary Directories."
"2022-02-12T14:44:52.000Z","zonemindergpu","Starting Cleanup of Temporary Directories..."
"2022-02-12T15:02:41.000Z","zonemindergpu","Normal exit (0 jobs run)"
"2022-02-12T15:02:41.000Z","zonemindergpu","Anacron 2.3 started on 2022-02-12"
"2022-02-12T15:02:41.000Z","zonemindergpu","Started Run anacron jobs."
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157]  ? kthread_create_worker_on_cpu+0x70/0x70"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157]  shrink_dentry_list+0xdb/0x320"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157]  ? mem_cgroup_shrink_node+0x190/0x190"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157] RIP: dentry_unlink_inode+0x43/0xe0 RSP: ffff9df402017bf0"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157]  ret_from_fork+0x35/0x40"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157]  __dentry_kill+0xd4/0x170"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157] ---[ end trace 4019d754d604056b ]---"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157] CR2: 0000000000002008"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157]  kthread+0x121/0x140"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157] Code: 07 48 c7 47 30 00 00 00 00 25 ff ff 8f fe 89 07 48 8b 87 b8 00 00 00 48 85 c0 74 29 48 8b 97 b0 00 00 00 48 85 d2 48 89 10 74 04 <48> 89 42 08 48 c7 83 b0 00 00 00 00 00 00 00 48 c7 83 b8 00 00"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157]  shrink_slab+0x29/0x30"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157]  super_cache_scan+0x104/0x1b0"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157]  prune_dcache_sb+0x5a/0x80"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157]  shrink_node+0x117/0x2f0"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157] CR2: 0000000000002008 CR3: 00000002eb912000 CR4: 00000000000006e0"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157] Call Trace:"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157]  shrink_slab.part.49+0x1e7/0x440"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157]  kswapd+0x2b1/0x710"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157] R13: ffff898d1423f058 R14: ffff898d1423f000 R15: ffff898d1423f080"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157] CPU: 3 PID: 82 Comm: kswapd0 Tainted: P           O     4.15.0-167-generic #175-Ubuntu"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157] R10: 00000000003623c2 R11: 00000000000000cb R12: ffff898c2a2f90e8"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157]  drm"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157] RSP: 0018:ffff9df402017bf0 EFLAGS: 00010206"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.698789] PGD 80000002ebe46067 P4D 80000002ebe46067 PUD 2ebf18067 PMD 0"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157] RIP: 0010:dentry_unlink_inode+0x43/0xe0"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157] RBP: ffff9df402017c00 R08: ffff9df402017c78 R09: ffff9df402017d68"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157] RAX: ffff898c2a2f9228 RBX: ffff898d1423f000 RCX: 0000000000000000"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157] FS:  0000000000000000(0000) GS:ffff898d7fcc0000(0000) knlGS:0000000000000000"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.698855] Modules linked in: snd_hda_codec_hdmi nls_iso8859_1 nvidia_uvm(PO) snd_hda_intel snd_seq_midi snd_seq_midi_event snd_hda_codec snd_rawmidi nvidia(PO) snd_hda_core snd_hwdep snd_seq joydev snd_pcm snd_seq_device input_leds snd_timer snd mac_hid lpc_ich qemu_fw_cfg shpchp soundcore serio_raw sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic pcbc virtio_gpu aesni_intel aes_x86_64 crypto_simd glue_helper ttm cryptd usbhid psmouse ahci drm_kms_helper syscopyarea virtio_net libahci sysfillrect hid virtio_scsi sysimgblt fb_sys_fops"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157] RDX: 0000000000002000 RSI: ffff9df400755af8 RDI: ffff898d1423f000"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.698830] Oops: 0002 [#1] SMP PTI"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.698710] BUG: unable to handle kernel paging request at 0000000000002008"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.698764] IP: dentry_unlink_inode+0x43/0xe0"
"2022-02-12T15:09:01.000Z","zonemindergpu","pam_unix(cron:session): session closed for user root"
"2022-02-12T15:09:01.000Z","zonemindergpu","(root) CMD (  [ -x /usr/lib/php/sessionclean ] && if [ ! -d /run/systemd/system ]; then /usr/lib/php/sessionclean; fi)"
"2022-02-12T15:09:01.000Z","zonemindergpu","pam_unix(cron:session): session opened for user root by (uid=0)"
"2022-02-12T15:09:33.000Z","zonemindergpu","Started Clean php session files."
"2022-02-12T15:09:33.000Z","zonemindergpu","Starting Clean php session files..."
"2022-02-12T15:17:01.000Z","zonemindergpu","pam_unix(cron:session): session closed for user root"
"2022-02-12T15:17:01.000Z","zonemindergpu","(root) CMD (   cd / && run-parts --report /etc/cron.hourly)"
"2022-02-12T15:17:01.000Z","zonemindergpu","pam_unix(cron:session): session opened for user root by (uid=0)"
"2022-02-12T15:28:33.000Z","zonemindergpu","Synchronized to time server 91.189.94.4:123 (ntp.ubuntu.com)."
"2022-02-12T15:28:33.000Z","zonemindergpu","enp6s18: Configured"
"2022-02-12T15:28:33.000Z","zonemindergpu","Network configuration changed, trying to establish connection."
Grafana Plots
1644855002212.png
1644855906756.png
 
Last edited:
After re-reading this segment of the error traceback:
Code:
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157]  shrink_slab+0x29/0x30"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157]  super_cache_scan+0x104/0x1b0"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157]  prune_dcache_sb+0x5a/0x80"
"2022-02-12T15:08:23.000Z","zonemindergpu","[88804.699157]  shrink_node+0x117/0x2f0"

I have now disabled the "Discard" option. The disk here at scsi0 is an SSD with a directory mounted at /rpool/data for the corresponding hypervisor. No known data errors for rpool, no SMART failures, directory is configured as follows
Code:
  GNU nano 5.4                                    /etc/pve/storage.cfg                                             
dir: local-zfs-dir
        path /rpool/data
        content snippets,iso,images,rootdir,vztmpl,backup
        nodes TracheNodeB,TracheServ
        shared 0

1644856236541.png
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!