PVE 9 + Alpine Linux 'freeze'

kraoc

Member
Oct 20, 2022
15
3
8
France
zogg.fr
Hello !

I have setup a new pve using virtiofs on multiple vm (mostly alpine and 1 debian).

On 1 vm I have something 'stall' or 'freeze'.

When it occurs, quest-agent can't operate, and my ssh don't respond to keyboard inputs.
I can shell into pve gui, but even there I can't issue a reboot.
I must do a STOP, then start again the VM… until new freeze.

I pasted my dmesg log here (30 days):
https://paste.zogg.fr/?512336db9d5015ed#EATin1MAseHQpZm1hbqoyTJjTYg6czSM7LsnMcVbAwhe

And package version here (30 days):
https://paste.zogg.fr/?4df8e153154de9ee#8ovkGgWr7G8DHdtJ8iFyx5jvSVUq5bgqrkQbwBSdc2Md

What I noticed in the logs are:
Code:
[ 2363.063143] VFS: Busy inodes after unmount of virtiofs (virtiofs)
[ 2363.063233] WARNING: CPU: 3 PID: 19836 at fs/super.c:650 generic_shutdown_super+0x11e/0x190

I need help to repair / correct what cause this :)

2025-11-03 : Edit : Since it seems not to be virtiofs related, I change the title.
 

Attachments

Last edited:
I forgot to mention that on my host pve I have things like this in my fstab

Code:
# 2025-10-26
# (disks)
UUID=e6300f71-32bc-459e-b947-e085f49d6596 /mnt/ssd1 ext4 defaults,errors=remount-ro,discard,relatime 0 2
#
# 2025-10-26
# (mount points aliases)
#
/mnt/ssd1/opt/docker /shares/docker none defaults,x-systemd.requires=/mnt/ssd1,bind 0 0

This is to create an single virtiofsd share folder.
And this folder structure is independant of where my folders/files are really stored :p
 
Hi,
are there any messages in the host's system journal around the time of the issue? Is the virtiofsd process still running on the host when the issue occurs? Please share the configuration of an Alpine VM and of a Debian VM with qm config ID replacing ID with the actual ID of the VM.
 
I didn't debug on host because all others vm works ok when this 1 crashed.
PVE gui / ssh are ok.

qm config ID (alpine)
Code:
agent: 1
bios: ovmf
boot: order=scsi0;net0;ide2
cores: 4
cpu: x86-64-v4
description: [REDACTED]
efidisk0: ssd1:104/vm-104-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
ide2: none,media=cdrom
machine: q35
memory: 10240
meta: creation-qemu=10.0.2,ctime=1757683113
name: raijin
net0: virtio=BC:24:11:FD:13:B0,bridge=vmbr0,firewall=1
numa: 1
onboot: 1
ostype: l26
scsi0: ssd1:104/vm-104-disk-1.qcow2,cache=writeback,iothread=1,size=64G
scsihw: virtio-scsi-single
smbios1: uuid=eccc4076-bd4e-4194-8b19-cb814da48bb0
sockets: 1
startup: order=6,up=60
tablet: 0
tags: alpine;ipv4
virtiofs0: SHARES,cache=always
vmgenid: ba1988d1-de4d-4705-aeeb-1d974e6e6dd9

I dig more in testing on this.
This VM is the one with the more containers.
I used docker-compose Template with common tmpfs.

I removed size on these… Seems too crash…
I'm testing with size set and 'it seems' not to crash… at this time :)

So now... I'm just trying to wait on normal usage to see if it appens again.
But if it's this case, I don't see the point...
 
I didn't debug on host because all others vm works ok when this 1 crashed.
It's still worth checking the logs. There might be relevant information there.

I removed size on these… Seems too crash…
I'm testing with size set and 'it seems' not to crash… at this time :)
What size are you talking about exactly?

Are you passing the same virtiofs directory to all VMs?
 
I setup a folder on pve with bind mount from different disks.
This allow to create a 'simple repo' aka base folder with all my needed mounts
This way I only have 1 viriofsd share & mount.
Yes passed on all my vm, only 1 get stuck.

What size are you talking about exactly?

When you declare a tmpfs in a docker container you can cap it's memory or not so defaults used (half system memory)

I left all the things live for hours... doing a stressfull backup and no bug appear.

Now... I'm unsure it's related to pve... If it happen next I'll take vm logs and pve logs to be more precise!
 
Is the time synchronized between VM and host?

The issue occurs here in the VM
Code:
[Tue Oct 28 05:01:11 2025] br-47e6e28694ca: port 1(veth2b6f3a5) entered disabled state
[Tue Oct 28 05:01:11 2025] vetha61686f: renamed from eth0
[Tue Oct 28 05:01:11 2025] br-47e6e28694ca: port 1(veth2b6f3a5) entered disabled state
[Tue Oct 28 05:01:11 2025] veth2b6f3a5 (unregistering): left allmulticast mode
[Tue Oct 28 05:01:11 2025] veth2b6f3a5 (unregistering): left promiscuous mode
[Tue Oct 28 05:01:11 2025] br-47e6e28694ca: port 1(veth2b6f3a5) entered disabled state
[Tue Oct 28 05:01:22 2025] ------------[ cut here ]------------
[Tue Oct 28 05:01:22 2025] VFS: Busy inodes after unmount of virtiofs (virtiofs)
[Tue Oct 28 05:01:22 2025] WARNING: CPU: 3 PID: 13420 at fs/super.c:650 generic_shutdown_super+0x11e/0x190
...
[Tue Oct 28 05:01:22 2025] CPU: 3 UID: 999 PID: 13420 Comm: node Not tainted 6.12.54-0-virt #1-Alpine
after something happens with a bridge/eth device, the command (I guess which does the unmount is node)

The host dmesg doesn't have anything around that time. Are other VMs running the same kernels and software? If the issue always occurs with this single VM, it might be a kernel bug there triggered by the specific workload.
 
PVE & VMs are on UTC timezone with same chronyd time sync.
So I guess they are time synchronized.

I setup a master alpine linux vm, then cloned multiple times.
The one having this problem is a cloned one.

All my others vm (same setup) works well and I haven't noticed this 'bug'.
All the vm are up to date with their packages including kernel.

I noticed
Code:
VFS: Busy inodes after unmount of virtiofs (virtiofs)
But this is not issued by 'human' command ;p

I had to disable harware offloading due to e1000 bug (caused pve to hang).
I did this using 'proxmox helper script'. No hang anymore until.

My VMs are mainly Alpine Linux + Docker containers.
I don't find any specific software on the vm when the 'bug' occur.

I understand how difficult is to debug this.
Has I didn't found any reccuring process… Except 'ressources usage' causing this.
But I don't notice any oom.

I'm open to suggestions :)
 
Hi, I think I got the same thing this morning too, one of my vm froze two times, and the only way to input anything in the console either ssh/xterm.js was to restart it. I'm also running pve 9, kernel 6.14.11-4-pve, and the VM is a debian 13 cloud init image also running docker. I'll try to take at logs next time.
 
Hi,
Hi, I think I got the same thing this morning too, one of my vm froze two times, and the only way to input anything in the console either ssh/xterm.js was to restart it. I'm also running pve 9, kernel 6.14.11-4-pve, and the VM is a debian 13 cloud init image also running docker. I'll try to take at logs next time.
why wait for the next time? You could look at the guest and host logs now to get an idea about what happened. Are you also using virtiofs? Please share the VM configuration.
 
Hi,

why wait for the next time? You could look at the guest and host logs now to get an idea about what happened. Are you also using virtiofs? Please share the VM configuration.
Hello, I was wrong and ssh is still working, the only thing that breaks sometime is xterm.js, vm config:
Code:
root@pve:~# qm config 106
agent: 1
boot: order=virtio0
cipassword: ****
ciuser: ****
cores: 8
cpu: EPYC-Genoa
description:
ide0: local-lvm:vm-106-cloudinit,media=cdrom,size=4M
ide2: none,media=cdrom
ipconfig0: ip=192.168.100.30/24,gw=192.168.100.1
memory: 24576
meta: creation-qemu=9.2.0,ctime=1748170135
name: vm-mine-pool
nameserver: 1.1.1.1 208.67.222.222 8.8.4.4
net0: virtio=BC:24:11:19:BA:10,bridge=vmbr1
numa: 0
onboot: 1
ostype: l26
scsihw: virtio-scsi-single
serial0: socket
smbios1: uuid=4320ea66-98aa-40f3-9f45-05322411d382
sockets: 1
virtio0: local-lvm:vm-106-disk-0,iothread=1,size=610G
vmgenid: 6d193546-ba7e-4f72-9924-5ce87b266dcf
 
Update…

Switch VM cpu to 'host' to dig some testing.

Code:
[Sat Nov  1 10:29:23 2025] watchdog: BUG: soft lockup - CPU#2 stuck for 23250s! [khugepaged:45]
[Sat Nov  1 10:29:23 2025] Modules linked in: tls nf_conntrack_netlink xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE bridge stp llc xt_set ip_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat x_tables nf_tables libcrc32c nfnetlink xfrm_user xfrm_algo zram lz4_compress overlay 8021q mrp tcp_bbr sch_fq nls_utf8 nls_cp437 vfat fat af_packet mousedev psmouse efi_pstore snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd bochs drm_vram_helper drm_ttm_helper ttm kvm_intel kvm irqbypass crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 sha1_generic aesni_intel gf128mul crypto_simd cryptd qemu_fw_cfg evdev button efivarfs virtio_net net_failover failover virtiofs fuse virtio_scsi virtio_balloon sr_mod cdrom crc32_pclmul uhci_hcd ehci_pci ehci_hcd simpledrm drm_shmem_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_kms_helper drm i2c_core drm_panel_orientation_quirks fb loop ext4 crc32c_generic crc32c_intel crc16 mbcache jbd2
[Sat Nov  1 10:29:23 2025]  usb_storage usbcore usb_common sd_mod
[Sat Nov  1 10:29:23 2025] CPU: 2 UID: 0 PID: 45 Comm: khugepaged Tainted: G      D W    L     6.12.55-0-virt #1-Alpine
[Sat Nov  1 10:29:23 2025] Tainted: [D]=DIE, [W]=WARN, [L]=SOFTLOCKUP

Code:
watchdog: BUG: soft lockup - CPU#2 stuck for 23250s! [khugepaged:45]

May be relate to memory allocation ?

Used allocator: mimalloc2-insecure (alpine linux)
On PVE I didn't used mimalloc.

Huge pages disabled:
Code:
vm.nr_hugepages=0
vm.nr_hugepages_mempolicy=0
vm.hugepages_treat_as_movable=0
vm.nr_overcommit_hugepages=0

I found no non-recommandations on using another allocator or disable hugepages ?

I'll try to revert to standard allocator and keep hugepages enabled... to see how it goes.
 

Attachments

Switch VM cpu to 'host' to dig some testing.
Code:
[Sat Nov  1 10:29:23 2025] watchdog: BUG: soft lockup - CPU#2 stuck for 23250s! [khugepaged:45]
The error message is completely different now and not directly related to VirtioFS.
Used allocator: mimalloc2-insecure (alpine linux)
Sounds like a potential candidate. Since you changed the CPU type, it could also be related to having CPU type host. Do you have the latest CPU microcode installed: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu ?
 
I think, with all debuging done on my side... It's not related to virtiofs.

Code:
Do you have the latest CPU microcode
Always :p
geek since 1983

Since I reverted to normal allocator, removed all my sysctl and switched back to x86-64-v4.
I'm now in a standard configuration without 'tuning'.

I'm checking to see how it evolves.
 
  • Like
Reactions: fiona