After proxmox updates: Random segmentation faults in AlmaLinux 8 VMs on Proxmox VE 9.1

Is there anything noticeable (besides the segfaults) in the guest system logs starting from the boot after the upgrades were applied? Since the disk format might play a role: IO or filesystem errors?
 
Hello

After converting the disk from qcow2 to raw on a VM that had a problem with segmentation faults

$ uptime
07:57:08 up 16:43, 1 users, load average: 0.05, 0.08, 0.02

No segmentation faults reported.

I can't say for sure if a conversion is necessary or if even a live migration of the disk will solve this problem.

I'm here if I can help with anything else.

Best regards,
A.H.
 
Is there anything noticeable (besides the segfaults) in the guest system logs starting from the boot after the upgrades were applied? Since the disk format might play a role: IO or filesystem errors?


I checked the affected AlmaLinux 8 guests for guest-side I/O or filesystem errors around the failures.

I did not find disk or filesystem errors in the guest logs.

The logs mainly showed the segmentation fault messages. I did not see messages such as Buffer I/O errors, `blk_update_request` errors, ext4 corruption reports, read-only remounts, virtio-scsi resets, or similar disk-related errors inside the guest.

I had also run filesystem checks and they did not report filesystem errors.

Important update: yesterday I converted the affected VMs from `qcow2` disks to `raw`, as recommended by @fever_wits. Since then, there have been no further segmentation faults on those AlmaLinux 8 VMs so far. This is now roughly from 17:00 yesterday until this morning.

So at the moment the strongest correlation I can see is with the virtual disk/image format path rather than with a damaged guest filesystem.
 
@PVasileff just wanted to assure, all of my vms that are affected are using qcow - so you might be onto something.

For me converting the disks is not an option and this would IMHO just confirm that this is related to QEMU kvm (11) and maybe to the issue i linked above. This all makes more and more sense/fit's together into a consistent feature.

To be honest, i'am seriousl disappointed to see Proxmox jumping on a major .0 release with its distribution - I cannot understand this decision. Not even in a Major of Proxmox, but rather just in a patch release of 9.1 - with all the respect, this is some very poor decision making.

What makes me wonder, i did not see such a move in the last 20 years of using proxmox - so what made you guys change how you work in this case?
 
@EugenMayer I can't agree with you regarding the statement about proxmox.

Currently thousands of people use proxmox, including me. I have VMs that are +4 years old. Personally I don't have this problem.

For this situation that you and PVasileff describe is a coincidence of several things at the same time, which is very difficult to observe and reproduce.

I have an assumption that qcow2 disks created a long time ago have this problem and that is why it is difficult to reproduce the problem.

Because, if my assumption is correct. During live migration of a disk from one disk array to another, the qcow2 disk itself and its metadata are also updated. As a result of which the problem disappears.

BUT this is just an assumption. I am NOT familiar with the problem in depth.

I have been using Proxmox since the end of version 1 or the beginning of version 2. (I don't remember, it was many years ago). People do everything possible to fix problems in response time.
But testing all possible combinations is very difficult, not to say impossible. There is no 100% tested software in the world for all possible user combinations.
If there were, there would be no forums like this one where users can report problems :)


Best regards,
A.H.
 
@fever_wits we can disagree on this, fine with that and that is a fair POV you have.

Please consider, that my issue with this move is rather, that it is done invisibly. This is not Proxmox 9.2, this is just a patch within 9.1. This makes it way harder for people to even understand something that changed in Proxmox that significantly, is the cause.

When i migrate from 9.1 to 9.2 i can expect things to go south and i have that as a mindset. I can better rollback, i can see it coming.

So my POV is, not that 'it could have been tested better' - yes your right, it is hard to test any combination. But rather to properly bind this in releases that can be seen, worked with, rolled back and traced better.

Look at how much time we spend in finding the cause. I was not away about a major qemu-kvm upgrade until i was told so yesterday. Nobody suspected this to be the cause, we rather spend time looking at kernels.

If you change something so significat in a hypervisor, you better limit any other changes so you can better attribute where problems coming from. Mixing so monay changes with major upgrades while not even announcing a minor or major release cut, is just something i did not ever see yet done by Proxmox and i would go as far as to say, IMHO this is a mistake and harms it's reputation (which i obviously love).

So take my criticism as tough love, not just ranting out some frustration. And by far do not take it as 'it is all bad here now and things like that'. I think prx came a long way, doing great even huge stuff and positive things all year long. But we can still try to critic some things, if they point out.
 
@EugenMayer - I'm breaking your point of view :)
I apologize if I sounded offensive or insulting.

I'll describe my thoughts out loud, I don't oblige anyone with them and they are NOT specifically directed at @EugenMayer. I'm just sharing.
If I understood correctly @EugenMayer would like to have Release Notes and/or Changelog.
And I tried to find Release Notes or Changelog and I couldn't. I only found:
https://pve.proxmox.com/wiki/Roadmap#Release_History
There might be a place/page where they are described, but I couldn't find it. I apologize in advance if there is such a place/page.

And yes, maybe @EugenMayer is right that this is a major change and it would be nice to have information.

Another question is how many administrators will read Release Notes and/or Changelog. But this is already a professional discipline.
Even with Release Notes and/or Changelog. Again, there is no guarantee that there won't be a problem.

I'll share something else. I'm a user on the one hand, and an administrator on the other.
There have been many cases when I make changes to the office infrastructure and clients say:
You changed this "X" and "Y" broke. In reality, what I have changed does not cause the client problem "Y'. But the client sees them as a change, blames the change.

What I am trying to say: Even if there are Release Notes and/or Changelog, there will still be dissatisfied people :).

Someone will not have read what is new or read but did not test in a test environment, and even if they have tested, it is still possible that they will not pass all the tests.

At least in my opinion, the only correct solution, and it depends on all users, is if there is a problem, to write :) And to submit the correct information, so that Proxmox can quickly find a solution.
And to have patience. Because Proxmox is quite complex in terms of configuration and dependencies.
From personal experience, I speak that in an attempt to fix problem X, I create problem Y, which apparently has no connection to problem X.

I think that right now we need to see what is the correct solution to the problem with Random segmentation faults, until a global solution is found.


Best regards,
A.H.
 
  • Like
Reactions: fiona
To be honest, i'am seriousl disappointed to see Proxmox jumping on a major .0 release with its distribution - I cannot understand this decision. Not even in a Major of Proxmox, but rather just in a patch release of 9.1 - with all the respect, this is some very poor decision making.

What makes me wonder, i did not see such a move in the last 20 years of using proxmox - so what made you guys change how you work in this case?
There was no change. We always picked up QEMU releases as they came in, moving them along to pve-test and pve-no-subscription before a Proxmox VE point release made them available in the enterprise repository as well. Also, QEMU does not have a concept of major releases in this way. The .0 is no different from the .1 or .2. The bump in front just happens once every year, regardless of the kind of changes.

EDIT: grammar fix
 
Last edited:
  • Like
Reactions: keeka and fabian
@fiona, can I or @PVasileff help with something to diagnose the problem faster?
If you consider it, later today I can provide direct access to minihv for the proxmox team to diagnose the problem.

This is just an idea/suggestion.

Just because I don't have a problem right now doesn't mean I won't have this problem in the future.

Best regards,
A.H.
 
I think I'm able to reproduce the issue somewhat reliably now, in a memory-constrained Debian 12 VM. It might be related to swap, many thanks to @t.lamprecht for that hunch! Will investigate further.
 
If you want to verify that it was the same issue for you, please test with the new pve-qemu-kvm=11.0.0-3 package version, which has the revert.
 
I think I'm able to reproduce the issue somewhat reliably now, in a memory-constrained Debian 12 VM. It might be related to swap, many thanks to @t.lamprecht for that hunch! Will investigate further.

fiona:
The problematic patch was identified and will be reverted for now, until we have a version which doesn't introduce this regression.
Upstream report: https://lore.kernel.org/qemu-devel/414848c6-3829-4120-b760-6db8d43c1ab5@proxmox.com/

This matches my affected AlmaLinux 8 VMs too.

The affected AlmaLinux 8 servers were actively using swap when the issue was happening. Their swap is configured as a swap file inside the existing ext4 filesystem, not as a separate swap partition.

My Debian VMs also have swap configured, but they were not actively using it in the same way.

So the swap / memory pressure angle fits my case as well.

Thank you @fiona and @t.lamprecht for localizing and reproducing the issue.
 
Last edited:
Hello,

I updated proxmox.
I restarted minihv.
Code:
root@minihv:~# pveversion -v | grep pve-qemu-kvm
pve-qemu-kvm: 11.0.0-3

I restored from backup VM, which was causing segmentation faults
I reduced RAM from 4GB to 2GB.
Apart from this change, I haven't made any other changes.
Here is the config file.

Code:
root@minihv:~# qm config 100
agents: 1
boot: order=ide2;scsi0;net0
cores: 1
cpu: kvm64
ide2: none, media=cdrom
memory: 2048
meta: creation-qemu=9.0.2,ctime=1738392200
name: ISPCONFIG-ALMA
net0: virtio=BC:24:11:9B:FC:7B,bridge=vmbr0
numa: 0
otype: l26
scsi0: storage:100/vm-100-disk-0.qcow2,discard=on,iothread=1,size=34G,ssd=1
scsi1: storage:100/vm-100-disk-1.qcow2,discard=on,iothread=1,size=100G
scsihw: virtio-scsi-single
smbios1: uuid=2a7b8afb-1dae-4405-92a8-5b66fbb59d78
sockets: 4
vmgenid: d67366ec-2dd0-4b57-8f46-c08d7bf2365b

It was done about 5 minutes ago. It should be done in about an hour, let's see if it has an effect.
If you suggest a solution how to test it, we will do it.