Proxmox random VM / host crash-reboot since migrating to version 7.

tlex

Member
Mar 9, 2021
103
14
23
43
Good day,
I've been clueless for the last 2-3 days since my Truenas vm keeps crashing randomly (with error logs) and sometimes the Proxmox7 host itself reboot (but without any error logs or at least I didn't know were to find them.) That was not happening or at least I can't recall when it happened last since now this is happening almost every 4-5 hours.

I have a PCI passthrough for that VM (SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]) with 8 disks attached configured in the vm to make a ZFS R2 volume with 1 spare.
The vm logs shows as follow that something seams to be related to the disks but when I ran individually smartctl -t long / smartctl -a on each of them I couldn't find any error. "All functions" and "ROM-bar" are disabled for that passthrough.

At that point I'm considering managing the zfs volume at the host itself to avoid the passthrough just to test if it will still crash but that would represent a good level of effort for me and the main reason why I did it that way initially was because I would get clear email notifications from Truenas when one of the disk was getting fragile / broken (that happened in the past) and I don't know if Proxmox itself can achieve that.

Any advice / recommendations appreciated :)

VM configuration:
Code:
root@pve:/etc/pve/qemu-server# cat 1000.conf
agent: 1
bios: ovmf
boot: order=scsi0;ide2
cores: 16
hostpci0: 0000:04:00.0,rombar=0
hotplug: disk,network
ide2: local:iso/TrueNAS-12.0-U2.1.iso,media=cdrom,size=917476K
memory: 16384
name: TrueNas
net0: virtio=DE:DA:F0:37:CF:C9,bridge=vmbr0,firewall=1
numa: 1
ostype: l26
protection: 1
scsi0: R1_1.6TB_SSD_EVO860:vm-1000-disk-0,cache=writeback,discard=on,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=zzz
sockets: 1
startup: order=2,up=30
vga: std
vmgenid: zzz

Here is some samples of the crash I get in truenas :

Code:
cat /data/crash/info.0
 
Dump header from device: /dev/da8p1
  Architecture: amd64
  Architecture Version: 4
  Dump Length: 602112
  Blocksize: 512
  Compression: none
  Dumptime: Sun Jul 18 22:09:22 2021
  Hostname: truenas.zzzzzz
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 12.2-RELEASE-p6 df578562304(HEAD) TRUENAS
  Panic String: general protection fault
  Dump Parity: 4158703732
  Bounds: 0
  Dump Status: good


cat /data/crash/info.1


Dump header from device: /dev/da6p1
  Architecture: amd64
  Architecture Version: 4
  Dump Length: 630784
  Blocksize: 512
  Compression: none
  Dumptime: Sat Jul 17 18:55:51 2021
  Hostname: truenas.zzzzzz
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 12.2-RELEASE-p6 df578562304(HEAD) TRUENAS
  Panic String: page fault
  Dump Parity: 1491579462
  Bounds: 1
  Dump Status: good


cat /data/crash/info.2
Dump header from device: /dev/da7p1
  Architecture: amd64
  Architecture Version: 4
  Dump Length: 624128
  Blocksize: 512
  Compression: none
  Dumptime: Sun Jul 18 01:45:36 2021
  Hostname: truenas.zzzzzz
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 12.2-RELEASE-p6 df578562304(HEAD) TRUENAS
  Panic String: page fault
  Dump Parity: 1334293062
  Bounds: 2
  Dump Status: good


cat /data/crash/info.3
Dump header from device: /dev/da8p1
  Architecture: amd64
  Architecture Version: 4
  Dump Length: 631296
  Blocksize: 512
  Compression: none
  Dumptime: Sun Jul 18 21:06:08 2021
  Hostname: truenas.zzzzzz
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 12.2-RELEASE-p6 df578562304(HEAD) TRUENAS
  Panic String: privileged instruction fault
  Dump Parity: 517715731
  Bounds: 4
  Dump Status: good

On the host itself, I can't find any error log.. in the case below, the host rebooted at around 00:40.

kernel.log :
Code:
...
Jul 18 21:27:07 pve kernel: [31256.199735] fwbr1000i0: port 2(tap1000i0) entered forwarding state
Jul 19 00:44:02 pve kernel: [    0.000000] Linux version 5.11.22-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.
2) #1 SMP PVE 5.11.22-2 (Fri, 02 Jul 2021 16:22:45 +0200) ()
...

syslog :
...
Code:
Jul 19 00:42:00 pve systemd[1]: Starting Proxmox VE replication runner...
Jul 19 00:42:01 pve systemd[1]: pvesr.service: Succeeded.
Jul 19 00:42:01 pve systemd[1]: Finished Proxmox VE replication runner.
-- Reboot --
Jul 19 00:44:00 pve kernel: Linux version 5.11.22-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.11.22-2 (Fri, 02 Jul 2021 16:22:45 +0200) ()
Jul 19 00:44:00 pve kernel: Command line: initrd=\EFI\proxmox\5.11.22-1-pve\initrd.img-5.11.22-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs
...

messages :
Code:
...
Jul 18 21:27:07 pve kernel: [31256.199735] fwbr1000i0: port 2(tap1000i0) entered forwarding state
Jul 19 00:44:02 pve kernel: [    0.000000] Linux version 5.11.22-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.11.22-2 (Fri, 02 Jul 2021 16:22:45 +0200) ()
...
 
I read somewhere that it might be related to iommu but I don't understand anything about or don't know what I could try but here is how iommu_groups are splitted :
find /sys/kernel/iommu_groups/ -type l


Code:
/sys/kernel/iommu_groups/17/devices/0000:05:00.0
/sys/kernel/iommu_groups/7/devices/0000:00:18.3
/sys/kernel/iommu_groups/7/devices/0000:00:18.1
/sys/kernel/iommu_groups/7/devices/0000:00:18.6
/sys/kernel/iommu_groups/7/devices/0000:00:18.4
/sys/kernel/iommu_groups/7/devices/0000:00:18.2
/sys/kernel/iommu_groups/7/devices/0000:00:18.0
/sys/kernel/iommu_groups/7/devices/0000:00:18.7
/sys/kernel/iommu_groups/7/devices/0000:00:18.5
/sys/kernel/iommu_groups/15/devices/0000:03:00.0
/sys/kernel/iommu_groups/5/devices/0000:0c:00.0
/sys/kernel/iommu_groups/5/devices/0000:00:08.0
/sys/kernel/iommu_groups/5/devices/0000:0b:00.2
/sys/kernel/iommu_groups/5/devices/0000:0b:00.0
/sys/kernel/iommu_groups/5/devices/0000:0c:00.1
/sys/kernel/iommu_groups/5/devices/0000:00:08.1
/sys/kernel/iommu_groups/5/devices/0000:0b:00.3
/sys/kernel/iommu_groups/5/devices/0000:0b:00.1
/sys/kernel/iommu_groups/5/devices/0000:0b:00.6
/sys/kernel/iommu_groups/5/devices/0000:00:08.2
/sys/kernel/iommu_groups/5/devices/0000:0b:00.4
/sys/kernel/iommu_groups/13/devices/0000:07:00.0
/sys/kernel/iommu_groups/13/devices/0000:02:09.0
/sys/kernel/iommu_groups/3/devices/0000:00:02.0
/sys/kernel/iommu_groups/11/devices/0000:02:06.0
/sys/kernel/iommu_groups/1/devices/0000:00:01.2
/sys/kernel/iommu_groups/18/devices/0000:09:00.0
/sys/kernel/iommu_groups/18/devices/0000:09:00.1
/sys/kernel/iommu_groups/8/devices/0000:01:00.0
/sys/kernel/iommu_groups/16/devices/0000:04:00.0 (thats the HBA)
/sys/kernel/iommu_groups/6/devices/0000:00:14.3
/sys/kernel/iommu_groups/6/devices/0000:00:14.0
/sys/kernel/iommu_groups/14/devices/0000:08:00.0
/sys/kernel/iommu_groups/14/devices/0000:02:0a.0
/sys/kernel/iommu_groups/4/devices/0000:00:02.1
/sys/kernel/iommu_groups/12/devices/0000:06:00.0
/sys/kernel/iommu_groups/12/devices/0000:02:08.0
/sys/kernel/iommu_groups/12/devices/0000:06:00.3
/sys/kernel/iommu_groups/12/devices/0000:06:00.1
/sys/kernel/iommu_groups/2/devices/0000:00:01.3
/sys/kernel/iommu_groups/10/devices/0000:02:02.0
/sys/kernel/iommu_groups/0/devices/0000:00:01.0
/sys/kernel/iommu_groups/19/devices/0000:0a:00.0
/sys/kernel/iommu_groups/9/devices/0000:02:01.0

I also read that I should probably blacklist that card but I don't know if it should or not (what's your advice?) I didn't have to do that when I was working Proxmox V6.
 
Blacklisting mpt3sas ended up working for me so far since my system didn't crash / reboot for the last 12 hours...
Surprising I didn't had to do so with v6.

Bash:
echo 'blacklist mpt3sas' >> /etc/modprobe.d/mpt3sas.conf
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!