FYIs I've just discovered that after following the 'wget proxmox-backup-client and dpkg -i' solution to resume a full-upgrade/dist-upgrade, /run/network is no longer created during boot up due to other upgraded packages, causing ifreload -a to fail, in turn causing OVS networking to be broken...
Supermicro, 4 near-identical hosts though one was purchased much more recently. They should all be on latest bios or near to it, particularly the more recent purchase. Today was coincidentally my last day with this company though so don't have screenshots of the kernel stack trace on hand but...
Besides my issues on my home lab with gpu passthrough issues, this 5.15.30 kernel has now caused 2 of 4 of our commercial hosting dual socket EPYC servers to crash and dump kernel stracktraces repeatedly and we are forced to downgrade to restore any kind of working hosting environment on these...
Same problem here with 5.15.30 using AMD cpus and Nvidia cards. I've noticed that the mainline kernels 5.15.33 to .36 mention a lot of iommu and vfio changes/fixes but I'm unsure if any are relevent. I'm going to test 5.15.37 on my home lab this weekend and will report my findings to proxmox...
I've had to downgrade to 5.13 to get any kind of gpu passthrough working to my guest VMs again. On 5.15 the error messages below flood syslog *rapidly* to the point of filling the root partition with a multi-gigabyte sized /var/log/syslog
nb: this is with simplefb disabled as well as all other...
I've been having errors and breakage with gpu passthrough on all my hosts I've upgraded to 7.2 with kernel 5.15.30. Downgrading the pve servers to kernel 5.13.x fixes the issue for me. I'm doing some testing and will file a bug when I have some concrete data to give PVE devs.
I'll also note...
Regarding the efi boot issue, I can reproduce this reliably on other proxmox clusters using normal disk images.
Make a new vm, debian 10 iso, 32gb disk, ovmf 'bios' and q35 machine type.
Add a second 32gb disk image.
Boot VM.
In the debian installer select 'manual' partitioning.
Create a 500mb...
Yes, several times to be sure. The entries created by efibootmgr under linux weren't 'sticking' apparently with 'q35' as machine type, as well as only 1 drive showing up in early efi boot and grub was unable to assemble the mdadm raid device to read its main grub.cfg - unless exited back out to...
I'm hitting this issue now too. I thought I had fixed it by destroying and remaking the efi disk, but this apparently only works for 1 boot and then the problem manifests again. The *only* reliable workaround I've found is to manually change the machine type in the vm .conf from 'q53' to...
Hi @Fabio, @fabian,
This morning sanctuary dropped out of the cluster with a divide error, load would have been negligible at the time:
Aug 7 00:10:37 sanctuary kernel: [739314.684408] show_signal: 6 callbacks suppressed
Aug 7 00:10:37 sanctuary kernel: [739314.684410] traps...
Hi Apollon77, Marin Bernard's logs did show pmtud log entries, which lead me to suspect their issue was the same as mine - but it's possible there's multiple things at work here (in which case I apologize for any potential thread hijacking!). You might want to grep your logs for it to confirm...
I did try that the night before you posted the test version of libknet, and that morning the cluster was green and not reporting the PMTUD issue. I undid this change to test that version of libknet however
I think previously, on the original libknet version other hosts were flapping too. I can go back further in the logs to find out if needed. The workload on scramjet isn't very different in nature but it is higher, being an EPYC system with more ram than the other hosts it'll typically run more...
Hi Fabio, we have some hardware coming in this week for a new production cluster so more than happy to do whatever we can to help fix this issue.
Attached are the logs from all three hosts from 12am saturday morning until I manually restarted corosync on scramjet around midday. Scramjet is the...
This morning the cluster is green, no hosts marked as offline. I hope this means the specific issue with knet pmtud and crypto has been resolved, and the floating point exception I saw over the weekend was an anomaly. When I get to the office this morning I'll dig through logs for any sign of...
Aug 4 00:18:16 scramjet corosync[1472229]: [KNET ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 2 link 0 but the other node is not acknowledging packets of this size.
Aug 4 00:18:16 scramjet corosync[1472229]: [KNET ] pmtud: This can be...
Hi @fabian, sorry to report this version still has issues. I don't see the PMTUD issue in the logs today but one host still had issues keeping quorum, reported lost tokens for about an hour, (all of which might be a separate issue elsewhere) but then its kernel reported a crash in libknet.so...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.