PVE Crashing After 6 to 7 Upgrade

spl1974 · Jul 9, 2021

Hello,

I've been having multiple crashes a day after upgrading from pve 6 to 7. I first tried on my test system (a small i3 with 12GB of RAM) and it worked OK. This machine runs TrueNAS Scale for testing and supplies ISCSI storage to a VM on my main PVE system. I have passed through 2x 3TB drives without issue both on 6 and now pve 7 (but it's a very, very light workload).

Once my test system seemed OK I upgraded my main system (i7-8700K and Asus Motherboard, 64GB DDR4) from pve6 which has been running perfectly for about 2 years to pve7. I read through the wiki first, ran the pve6to7 --full and had no warnings.

The install went fine but now I'm having multiple crashes per day and nothing in the normal logs (I will have to look up how to enable additional logging).

My system right now:
i7-8700K (6 cores)
Linux 5.11.22-1-pve #1 SMP PVE 5.11.22-2
pve-manager/7.0-9
Proxmox VE updates Non production-ready repository enabled

Average load is low maybe 1.00 and a few percent utilized on the CPU. RAM is about 60% utilized.

I have disabled my only LXC container, and I also disabled a VM of TrueNAS Core that I use as backup (with 3x 8TB drives passed through) but it still crashes with them turned off. I hooked up a monitor and the last 2 times it crashed it was at the normal CLI page showing the Welcome page but was frozen.

Please let me what else may help to know. This is my homelab stuff and I think for the time-being I'm just going to leave it off as it stays up at like an hour or two and just goes down again.

Thanks

spl1974 · Jul 10, 2021

Quick update: I rebooted a while ago and selected an older kernel (Linux 5.4.124-1-pve) and it seems to be more stable. I shutdown my LXC and my VM running with hard drives passed through to it. Not sure what (if anything) has helped but I'm semi running for now.

If there are any suggestions on things to try let me know

spl1974 · Jul 13, 2021

After running stable for about a day on the older kernel (Linux 5.4.124-1-pve) I did a bios update to the latest because I read that may help. I also tried to setup dumps on a remote machine (but that failed). I started the latest 5.11 kernel and was up for about 4 or so hours before it crashed.

Keyboard locked up but took a picture (I couldn't get the end of the output to show)

Most of what I've seen has to do with AMD CPU's but in my case I'm on an i7 if that helps. My test system that's an i3 is still running OK (but mainly idle) so I don't know if it's a setting or something I need to change

Lutris · Jul 13, 2021

Experiencing the same after upgrading to pve 7.0.9 today. Have had 2 crashes now in a short time. Previously been running without much trouble.
When it crashes I can not ssh, connect to the webgui, ping it or get any response.

Info:
Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
ASRock Z270M Extreme4
Corsair Vengance CMK32GX4M2A2666C16 - 64gigs total
1 SSD - VMS saved to LVM
Linux 5.11.22-1-pve #1 SMP PVE 5.11.22-2
pve-manager/7.0-9/228c9caa

Externally:
No monitor, keyboard or mouse connected.
I have a 4G usb modem connected and a CyberPower BR1000 UPS connected with USB, monitoring it through NUTS.

Internally:
I have a Dell PERC H310 (IT flashed) in pass-through mode
Also passing through the integrated intel 630 as a mediated device.

Running 5 vms with various OSes. System load is seemingly stable.

Troubleshooting:
Checking dmesg and journalctl at the time of the crash just shows an output of these characters ^@^@^@

I have installed kdump-tools and are currently waiting for a new crash to occour. Hopefully the kdump will show something.

GRUB

Code:

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on i915.enable_gvt=1"
GRUB_CMDLINE_LINUX="net.ifnames=0 biosdevname=0"

proxmox-ve: 7.0-2 (running kernel: 5.11.22-1-pve)
pve-manager: 7.0-9 (running version: 7.0-9/228c9caa)
pve-kernel-helper: 7.0-4
pve-kernel-5.11: 7.0-3
pve-kernel-5.4: 6.4-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.11.22-1-pve: 5.11.22-2
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-5.4.119-1-pve: 5.4.119-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 14.2.21-1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.1.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-4
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-9
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-2
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.1-1
proxmox-backup-file-restore: 2.0.1-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.2-4
pve-cluster: 7.0-3
pve-container: 4.0-8
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.2-4
pve-ha-manager: 3.3-1
pve-i18n: 2.4-1
pve-qemu-kvm: 6.0.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-10
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.4-pve1

VM info:

Code:

root@prox:~# qm list
      VMID NAME                 STATUS     MEM(MB)    BOOTDISK(GB) PID
       100 nrp                  running    2500              50.00 5614
       101 hassos               running    10000             50.00 4962
       102 nginx                running    20000            200.00 5333
       103 tools-plex           running    20000            200.00 5418
       104 nas                  running    10000             80.00 5081

agent: 1
boot: order=ide2;scsi0;net0
cores: 1
ide2: none,media=cdrom
memory: 2500
name: nrp
net0: virtio=D2:E7:8A

4:35:F2,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: data:vm-100-disk-0,discard=on,size=50G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=919d5277-7f4a-4778-841d-819db1327011
sockets: 1
vmgenid: 64c08681-920d-4cf0-b4b6-dd30f7d52fd6

agent: 1
bios: ovmf
bootdisk: sata0
cores: 1
efidisk0: data:vm-101-disk-0,size=4M
memory: 10000
name: hassos
net0: virtio=EE:B1:6C:69:96:B2,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
sata0: data:vm-101-disk-1,size=50G
scsihw: virtio-scsi-pci
smbios1: uuid=ce57c6db-8eb7-45e8-96a6-7e98e7e0d7e1
sockets: 1
startup: order=1
usb0: host=12d1:1506
usb1: host=3-2,usb3=1
vga: qxl
vmgenid: ca82ee3b-2af7-4d6d-85ec-6358a3a94b3d

agent: 1
bootdisk: scsi0
cores: 3
memory: 20000
name: nginx
net0: virtio=E2:07

B:F2:F4:48,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: data:vm-102-disk-0,discard=on,size=200G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=a43becb6-1763-4f8a-84b5-9331fdf09e42
sockets: 1
startup: order=2
vga: virtio
vmgenid: 7fc97779-11d2-48e0-80da-637e2d0198b2

agent: 1
bios: seabios
bootdisk: scsi0
cores: 4
hostpci0: 0000:00:02.0,mdev=i915-GVTg_V5_4
memory: 20000
name: tools-plex
net0: virtio=A6

F:E3:89:A1:F4,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: data:vm-103-disk-0,discard=on,size=200G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=6187827f-2b2c-4c32-b07c-bbc41aa8b6ea
sockets: 1
startup: order=2
vga: none
vmgenid: afe33a7c-d353-4d6c-8b7a-3406fc5d28ca

agent: 1
bootdisk: scsi0
cores: 2
hostpci0: 01:00
memory: 10000
name: nas
net0: virtio=2A:73:CF:5D:17:26,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: data:vm-104-disk-0,size=80G
scsihw: virtio-scsi-pci
smbios1: uuid=c311ed73-cb92-4784-929a-1984ddd43609
sockets: 1
startup: order=1,up=30
vmgenid: 2319ef7f-d660-41af-87ec-a2eebeafc088

Edit: Just saw in a similar thread that it was suggested to specify the cpu type. I just changed them all to "host".

Edit2: Changing disks from no cache to Write Back is working as mentioned in this thread.

Been running some I/O tests with FIO and proxmox is crashing almost every run when cache is off, changing cache to write back and it is running stable for me as well.

FIO test parameters:

Code:

fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting

Edit3: The kdump-tools gives me the following error, so Im not able to catch the kernel panic in a dump. If anyone knows what to do, Ill be happy to run a couple of dumps to see if we can catch the problem.

Code:

Jul 14 03:17:17 prox kdump-tools[762]: Starting kdump-tools:
Jul 14 03:17:17 prox kdump-tools[769]: running makedumpfile -F -c -d 31 /proc/vmcore | compress > /var/crash/202107140317/dump-incomplete.
Jul 14 03:17:17 prox kdump-tools[787]: The kernel version is not supported.
Jul 14 03:17:17 prox kdump-tools[787]: The makedumpfile operation may be incomplete.

Search

Search

PVE Crashing After 6 to 7 Upgrade

spl1974

Member

spl1974

Member

spl1974

Member

Attachments

Lutris

Member