[SOLVED] FreeBSD VM's deinit issues under latest Proxmox

Sep 1, 2020
21
3
8
46
Hello,

Since upgrading from 6.4 to 7.1 we have experienced many issues with FreeBSD vm's. Please keep in mind that we had no issues on 6.4.

For example our dovecot FreeBSD 13 (latest patch release) server during busier times would start to deinit dovecot processes, and the load average jumps from below 1, to over 200. Nothing is actually using any CPU, so this definitely seems like disk io issue. Dovecot processes just keep on starting and never get killed, ending up with thousands of deinit processes. Only solution is to reboot the VM.

There are no errors logged anywhere, on proxmox host or freebsd vm.

Proxmox summary shows excessively high CPU usage and Disk IO, network is low:

cpu.pngdisk.pngnetwork.png

This happens on both raw and qcow2 VM's. I have tried switching from default io_uring to native and threads, as well as combinations of no cache and writeback using VirtIO SCSI single.

On FreeBSD vm, I have tried different time counters from HPET, TSC-low, to kvmclock.

And I've disabled balloon memory just in case.

I have also tried different CPU options from host, to actual host processor to kvm64.

This happens randomly, usually during busier times. Sometimes it happens within few hours, sometimes it takes days to happen.

I have also tried pve-kernel-5.13.19-1-pve and pve-kernel-5.15.7-1-pve.

I believe this has something to do with the issue that was occurring on Linux VM's with IO errors. pve-qemu-kvm_6.1.0-3 doesn't fix this issue on FreeBSD VMs.

vm config:

Code:
qm config 188
agent: 1
balloon: 0
boot: cdn
bootdisk: scsi0
cores: 24
cpu: host,flags=+aes
machine: q35
memory: 49152
name: garibaldi
net0: virtio=0A:06:9A:F4:7A:01,bridge=vmbr0,firewall=1,queues=8
numa: 0
onboot: 1
ostype: l26
scsi0: local-zfs:vm-188-disk-0,aio=threads,cache=writeback,discard=on,format=raw,size=256G,ssd=1
scsi1: storage:vm-188-disk-0,aio=threads,backup=0,cache=writeback,discard=on,format=raw,size=2T,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=4479ea4e-6825-42fd-bee4-1a194dacf635
sockets: 1
vmgenid: 9ead784d-701c-4817-8614-9cc019ebe2f6

pveversion:

Code:
pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-8 (running version: 7.1-8/5b267f33)
pve-kernel-5.15: 7.1-7
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.4: 6.4-11
pve-kernel-5.15.7-1-pve: 5.15.7-1
pve-kernel-5.15.5-1-pve: 5.15.5-1
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.4.157-1-pve: 5.4.157-1
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3


Also notice that the disk written is at almost 3.4TB, but net traffic in is only 6.7GB... So impossible to have written that much data:

Code:
qm status 188 --verbose
blockstat:
        scsi0:
                account_failed: 1
                account_invalid: 1
                failed_flush_operations: 0
                failed_rd_operations: 0
                failed_unmap_operations: 0
                failed_wr_operations: 0
                flush_operations: 1
                flush_total_time_ns: 269077
                idle_time_ns: 1478509445
                invalid_flush_operations: 0
                invalid_rd_operations: 0
                invalid_unmap_operations: 0
                invalid_wr_operations: 0
                rd_bytes: 12456033280
                rd_merged: 0
                rd_operations: 422602
                rd_total_time_ns: 145656193055
                timed_stats:
                unmap_bytes: 0
                unmap_merged: 0
                unmap_operations: 0
                unmap_total_time_ns: 0
                wr_bytes: 80045617152
                wr_highest_offset: 268511662080
                wr_merged: 0
                wr_operations: 1967992
                wr_total_time_ns: 226722352806
        scsi1:
                account_failed: 1
                account_invalid: 1
                failed_flush_operations: 0
                failed_rd_operations: 0
                failed_unmap_operations: 0
                failed_wr_operations: 0
                flush_operations: 1
                flush_total_time_ns: 382650
                idle_time_ns: 909307514
                invalid_flush_operations: 0
                invalid_rd_operations: 0
                invalid_unmap_operations: 0
                invalid_wr_operations: 0
                rd_bytes: 58119712768
                rd_merged: 0
                rd_operations: 1252234
                rd_total_time_ns: 383142803225
                timed_stats:
                unmap_bytes: 0
                unmap_merged: 0
                unmap_operations: 0
                unmap_total_time_ns: 0
                wr_bytes: 3310958258176
                wr_highest_offset: 2188121796608
                wr_merged: 0
                wr_operations: 100482026
                wr_total_time_ns: 11465880717734
cpus: 24
disk: 0
diskread: 70575746048
diskwrite: 3391003875328
maxdisk: 274877906944
maxmem: 51539607552
mem: 45357627372
name: garibaldi
netin: 6770949174
netout: 42921508171
nics:
        tap188i0:
                netin: 6770949174
                netout: 42921508171
pid: 3885990
proxmox-support:
        pbs-dirty-bitmap: 1
        pbs-dirty-bitmap-migration: 1
        pbs-dirty-bitmap-savevm: 1
        pbs-library-version: 1.2.0 (6e555bc73a7dcfb4d0b47355b958afd101ad27b5)
        pbs-masterkey: 1
        query-bitmap-info: 1
qmpstatus: running
running-machine: pc-q35-6.1+pve0
running-qemu: 6.1.0
status: running
uptime: 94978
vmid: 188
 
I have seen similar situations on my mailservers (exim + dovecot). Load gets high, queues are stuck. People can no longer fetch mail (not only do no new mail come in, but connections to dovecot just time out).
But the same issue occurs with other VMs (my build box for poudriere for instance).
I also tried different settings regarding cache and async io.

In 99% of cases, manually issuing a sync from inside the VM 'fixed' things and processes all of a sudden resumed actually doing stuff and processing their workloads.

It is as if the OS never gets around to actually flushing stuff to disk, filling up queues and buffers which causes userland processes to wait in vain until a timeout occurs?

Since I have not seen these issues with FreeBSD 12 based VMs yet (though the few I still have are not busy, so it may just be coincidence), my working assumption hab been that FreeBSD 13 introduced some change / optimization that somehow causes this.
But from what you describe, it was not an issue with older qemu/kvm?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!