VMs (imported from ESXi) loosing access to disks/SCSI bus under heavy load

bluethunder · Jan 21, 2025

Hi.

We have recently installed Proxmox 8.3.2 on a new Dell server. We have imported some VMs from one of our ESXi nodes which is going to be scrapped.

Everything went smooth during the import and on the few hours of operations but we have noticed that two out of eight VMs become stuck after 20-24 hours of operation. It looks like they are loosing access to their drives. We have restarted both of them and one was stuck few hours later again.

Below is the log from one of our VMs:

Code:

Jan 20 17:36:04 app-log kernel: scsi target2:0:6: No MSG IN phase after reselection
Jan 20 17:40:24 app-log kernel: sd 2:0:0:0: [sda] tag#335 ABORT operation started
Jan 20 17:40:24 app-log kernel: sd 2:0:0:0: ABORT operation timed-out.
Jan 20 17:40:24 app-log kernel: sd 2:0:0:0: [sda] tag#334 ABORT operation started
Jan 20 17:40:24 app-log kernel: sd 2:0:0:0: ABORT operation timed-out.
Jan 20 17:40:24 app-log kernel: sd 2:0:0:0: [sda] tag#333 ABORT operation started
Jan 20 17:40:24 app-log kernel: sd 2:0:0:0: ABORT operation timed-out.
Jan 20 17:40:24 app-log kernel: sd 2:0:0:0: [sda] tag#332 ABORT operation started
Jan 20 17:40:24 app-log kernel: sd 2:0:0:0: ABORT operation timed-out.
(...)
Jan 20 17:40:24 app-log kernel: sd 2:0:0:0: [sda] tag#335 DEVICE RESET operation started
Jan 20 17:40:24 app-log kernel: sd 2:0:0:0: DEVICE RESET operation timed-out.
Jan 20 17:40:24 app-log kernel: sd 2:0:1:0: [sdb] tag#339 DEVICE RESET operation started
Jan 20 17:40:24 app-log kernel: sd 2:0:1:0: DEVICE RESET operation timed-out.
(...)
Jan 20 17:40:24 app-log kernel: sd 2:0:6:0: BUS RESET operation complete.
Jan 20 17:40:24 app-log kernel: sd 2:0:6:0: Power-on or device reset occurred
Jan 20 17:40:24 app-log kernel: sd 2:0:0:0: Power-on or device reset occurred

This time the OS was able to handle this issue and VM kept working, but same errors occurred during next 4 hours which eventually resulted in system complete freeze.

Virtual SCSI controller: LSI 53C895A
QEMU agent is installed in VM and it's enabled in Proxmox.
No swap on the host.

No logs from hosts which indicate any problems.

Hardware below is DELL PERC HW RAID with RAID5 SSD disks.

fabian · Jan 21, 2025

could you please post
- pveversion -v
- the VM config in question
- /etc/pve/storage.cfg

do the issues correlate with load on the host/storage?

bluethunder · Jan 21, 2025

fabian said:
could you please post
- pveversion -v
- the VM config in question
- /etc/pve/storage.cfg

do the issues correlate with load on the host/storage?

pveversion -v:

Code:

proxmox-ve: 8.3.0 (running kernel: 6.8.12-6-pve)
pve-manager: 8.3.2 (running version: 8.3.2/3e76eec21c4a14a7)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-6
proxmox-kernel-6.8.12-6-pve-signed: 6.8.12-6
amd64-microcode: 3.20240820.1~deb12u1
ceph-fuse: 16.2.15+ds-0+deb12u1
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown: residual config
ifupdown2: 3.2.0-1+pmx11
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.3
pve-cluster: 8.0.10
pve-container: 5.2.3
pve-docs: 8.3.1
pve-edk2-firmware: not correctly installed
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-2
pve-ha-manager: 4.0.6
pve-i18n: 3.3.2
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.3
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1

VM config:

Code:

agent: 1
bios: seabios
boot: order=scsi0;scsi1;scsi2;scsi3;scsi4;scsi5;scsi6;scsi7;scsi8;scsi9
cores: 24
cpu: x86-64-v2-AES
memory: 32768
meta: creation-qemu=9.0.2,ctime=1737153872
name: app-log
net0: virtio=00:0c:29:eb:41:ec,bridge=vmbr0
onboot: 1
ostype: l26
scsi0: ssd-raid5:vm-122-disk-0,size=16G
scsi1: ssd-raid5:vm-122-disk-1,size=750G
scsi2: ssd-raid5:vm-122-disk-2,size=700G
scsi3: ssd-raid5:vm-122-disk-3,size=200G
scsi4: ssd-raid5:vm-122-disk-4,size=16G
scsi5: ssd-raid5:vm-122-disk-5,size=205G
scsi6: ssd-raid5:vm-122-disk-6,size=210G
scsi7: ssd-raid5:vm-122-disk-7,size=215G
scsi8: ssd-raid5:vm-122-disk-8,size=400G
scsi9: ssd-raid5:vm-122-disk-9,size=600G
scsihw: lsi
smbios1: uuid=564df0aa-afc4-aeb8-a415-fd6e29eb41ec
sockets: 1
vmgenid: 3c6bac6f-12b6-43dd-91c2-4c25ffcc6dbf

storage.cfg:

Code:

lvm: ssd-raid10
        vgname ssd-raid10
        content rootdir,images
        saferemove 0
        shared 0

dir: local
        path /var/lib/vz
        content iso,backup,vztmpl,snippets,rootdir,images
        prune-backups keep-all=1

lvm: ssd-raid5
        vgname ssd-raid5
        content rootdir,images
        saferemove 0
        shared 0

The issues do correlate with storage load, but this is THE VM which causes most of the load. This VM worked fine without any issues few days ago on ESXi node, which was running on very similar hardware (same HW RAID config, similar SSDs, only real difference was CPU manufacturer: was Intel, now it's AMD).

This VM has a Elasticsearch instance which collects log from our application. This instance gathers like 100-150M messages per day, which isn't very much to be honest (like 75-100G of data per day).

fabian · Jan 21, 2025

I'd try enabling io threads for the disks!

bluethunder · Jan 21, 2025

fabian said:
I'd try enabling io threads for the disks!

I did so and looks like it's working. Thanks!

We have found out that we can issue an query to Elasticsearch which searches across all of our data at that crashes VM each time. After changing to io threads it's working fine. Time will tell if this is permanent fix or not.

Is it possible to set "threads" as default option for all new disks created in Promox? Or we need to select it each time?

fabian · Jan 21, 2025

needs to be enabled for each disk, yes.

bluethunder · Jan 21, 2025

Unfortunately it crashed again. We have tried to set "Async IO" to "threads" which looked like it did improve the situation, but large ES query still crashes the VM (now it doesn't do this every time, but this is still unacceptable).

Any more ideas?

EDIT: Now we have set disk settings just like they are set for new VM: "IO Thread": checked, "Async IO": "uring", SCSI ctrl: VirtIO single. We will leave this for next two or three days and report back.

fabian · Jan 21, 2025

what is the LVM backing storage?

bluethunder · Jan 21, 2025

fabian said:
what is the LVM backing storage?

Below LVM there is PERC H755 controller with 6x SSD`s MTFDDAK1T9TGA-1B configured as RAID 5.

fabian · Jan 21, 2025

okay, so local

Search

Search

VMs (imported from ESXi) loosing access to disks/SCSI bus under heavy load

bluethunder

New Member

fabian

Proxmox Staff Member

bluethunder

New Member

fabian

Proxmox Staff Member

bluethunder

New Member

fabian

Proxmox Staff Member

bluethunder

New Member

fabian

Proxmox Staff Member

bluethunder

New Member

fabian

Proxmox Staff Member

We value your privacy