pverados segfault

Kimmo · Aug 2, 2023

Is the fix (from this linux-mm thread) included in pve-kernel-6.2.16-5-pve? I've still seen SIGSEGVs for pverados after upgrading...

fiona · Aug 3, 2023

Hi,

Kimmo said:
Is the fix (from this linux-mm thread) included in pve-kernel-6.2.16-5-pve? I've still seen SIGSEGVs for pverados after upgrading...

no, it's not. That version was bumped even before the issue was debugged. It will come in via stable patches eventually or if we backport it. But it's only a cosmetic issue, no actual problems except the faulty messages.

Lephisto · Aug 9, 2023

fyi, still segfaults on 6.2.16-6-pve.

Code:

[  277.638986] pverados[19363]: segfault at 5588cfa55010 ip 00005588cb7fc09d sp 00007ffe2d1e6360 error 7 in perl[5588cb721000+195000] likely on CPU 28 (core 5, socket 0)
[  277.638999] Code: 0f 95 c2 c1 e2 05 08 55 00 41 83 47 08 01 48 8b 53 08 22 42 23 0f b6 c0 66 89 45 02 49 8b 07 8b 78 60 48 8b 70 48 44 8d 6f 01 <44> 89 68 60 41 83 fd 01 0f 8f 4d 04 00 00 48 8b 56 08 49 63 c5 48

fiona · Aug 9, 2023

Lephisto said:

fyi, still segfaults on 6.2.16-6-pve.

Code:

[  277.638986] pverados[19363]: segfault at 5588cfa55010 ip 00005588cb7fc09d sp 00007ffe2d1e6360 error 7 in perl[5588cb721000+195000] likely on CPU 28 (core 5, socket 0)
[  277.638999] Code: 0f 95 c2 c1 e2 05 08 55 00 41 83 47 08 01 48 8b 53 08 22 42 23 0f b6 c0 66 89 45 02 49 8b 07 8b 78 60 48 8b 70 48 44 8d 6f 01 <44> 89 68 60 41 83 fd 01 0f 8f 4d 04 00 00 48 8b 56 08 49 63 c5 48

Yes, it might take a while until the fix comes in via stable backports. It's not a crucial issue after all, only cosmetic.

Lephisto · Aug 9, 2023

Thanks for quick reply @fiona.

Can you explain why I only see this on Clusters, that have been upgraded all the way up from 6.x to 8 but not on Clusters that were born as 7.x? I am just curious.

//edit: sorry, I have to correct myself. i also see this on clusters that came from 7.x.

regards.

fiona · Aug 9, 2023

Lephisto said:
Thanks for quick reply @fiona.

Can you explain why I only see this on Clusters, that have been upgraded all the way up from 6.x to 8 but not on Clusters that were born as 7.x? I am just curious.

//edit: sorry, I have to correct myself. i also see this on clusters that came from 7.x.

regards.

The potential for the wrong logging is there in all kernels with this commit, i.e. starting from 6.2.16-4-pve. It is racy, so if you don't see it on certain machines, you might just be lucky.

matt. · Aug 24, 2023

6.2.16-8-pve still segfaults.

G0ldmember · Aug 24, 2023

matt. said:
6.2.16-8-pve still segfaults.

6.2.16-10-pve too - but no functional consequences indeed. Still monitoring is complaining about all the "segfaults" in the kern.log

Kimmo · Aug 31, 2023

With 6.2.16-10-pve I started additionally getting these messages:

Code:

2023-08-31T05:11:25.094116+03:00 kettu ceph-crash[559]: WARNING:ceph-crash:post
 /var/lib/ceph/crash/2023-08-31T02:00:53.493690Z_85057cb5-e910-495b-bf56-082c2af27a95
 as client.crash.kettu failed: Error initializing cluster client:
 ObjectNotFound('RADOS object not found (error calling conf_read_file)')

We are using a package called logcheck to stay aware of unexpected events on our systems. While this new message is easy to filter out, the original one is not, as it is a multi-line message, and we would rather not filter out SIGSEGV messages in general.

So while technically this is a "cosmetic" issue only, it is still impacting our daily ops, and we are looking forward to it being fixed.

fiona · Aug 31, 2023

Hi,

Kimmo said:
With 6.2.16-10-pve I started additionally getting these messages:

Code:

2023-08-31T05:11:25.094116+03:00 kettu ceph-crash[559]: WARNING:ceph-crash:post /var/lib/ceph/crash/2023-08-31T02:00:53.493690Z_85057cb5-e910-495b-bf56-082c2af27a95 as client.crash.kettu failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')

We are using a package called logcheck to stay aware of unexpected events on our systems. While this new message is easy to filter out, the original one is not, as it is a multi-line message, and we would rather not filter out SIGSEGV messages in general.

So while technically this is a "cosmetic" issue only, it is still impacting our daily ops, and we are looking forward to it being fixed.

FYI, we did backport the fix and it will be included in the next kernel version (the one after 6.2.16-10-pve): https://git.proxmox.com/?p=pve-kernel.git;a=commit;h=762b8cebe9fc4cc39f34808d2820a95ea13adfae

fiona · Aug 31, 2023

Kimmo said:
With 6.2.16-10-pve I started additionally getting these messages:

Code:

2023-08-31T05:11:25.094116+03:00 kettu ceph-crash[559]: WARNING:ceph-crash:post /var/lib/ceph/crash/2023-08-31T02:00:53.493690Z_85057cb5-e910-495b-bf56-082c2af27a95 as client.crash.kettu failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')

We are using a package called logcheck to stay aware of unexpected events on our systems. While this new message is easy to filter out, the original one is not, as it is a multi-line message, and we would rather not filter out SIGSEGV messages in general.

So while technically this is a "cosmetic" issue only, it is still impacting our daily ops, and we are looking forward to it being fixed.

Oh, sorry didn't realize this was a different issue at first. That has nothing to do with the kernel upgrade. Please see https://bugzilla.proxmox.com/show_bug.cgi?id=4759 for more information.

Kimmo · Aug 31, 2023

fiona said:
Oh, sorry didn't realize this was a different issue at first. That has nothing to do with the kernel upgrade.

Oh, neither did I. Thank you for following up, Fiona.

Ayush · Feb 3, 2024

I have a fresh installation of proxmox8 and I face same issue.

Ayush · Feb 3, 2024

Following is the error.
I was not able to use df command , when I start df it stuck in between .

2.743699] AppArmor: AppArmor Filesystem Enabled
[ 2.813210] ERST: Error Record Serialization Table (ERST) support is initialized.
[ 3.287863] RAS: Correctable Errors collector initialized.
[ 6.529780] EXT4-fs (dm-2): mounted filesystem 98630d6b-8864-4cdf-be53-bc0da31b6525 with ordered data mode. Quota mode: none.
[ 7.496684] ACPI Error: No handler for Region [SYSI] (0000000096bc81c9) [IPMI] (20221020/evregion-130)
[ 7.496789] ACPI Error: Region IPMI (ID=7) has no handler (20221020/exfldio-261)
[ 7.496894] ACPI Error: Aborting method \_SB.PMI0._GHL due to previous error (AE_NOT_EXIST) (20221020/psparse-529)
[ 7.496998] ACPI Error: Aborting method \_SB.PMI0._PMC due to previous error (AE_NOT_EXIST) (20221020/psparse-529)
[ 7.655095] ZFS: Loaded module v2.1.12-pve1, ZFS pool version 5000, ZFS filesystem version 5
[ 57.248652] usb 1-1.5: Failed to suspend device, error -71
[ 730.770999] pverados[8198]: segfault at 55b0f8c0a030 ip 000055b0f8c0a030 sp 00007ffdeebc9228 error 14 in perl[55b0f8bde000+195000] likely on CPU 61 (core 10, socket 1)
[ 750.313475] pverados[8280]: segfault at 55b0f8c0a030 ip 000055b0f8c0a030 sp 00007ffdeebc9228 error 14 in perl[55b0f8bde000+195000] likely on CPU 57 (core 8, socket 1)
[ 6950.517683] pverados[33951]: segfault at 55b0f8c0a030 ip 000055b0f8c0a030 sp 00007ffdeebc9228 error 14 in perl[55b0f8bde000+195000] likely on CPU 60 (core 10, socket 0)

fiona · Feb 5, 2024

Hi,

Ayush said:
I have a fresh installation of proxmox8 and I face same issue.

please make sure you are using a kernel >= 6.2.16-11, i.e. install upgrades:
https://pve.proxmox.com/wiki/Package_Repositories
https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#system_software_updates

Ayush · Feb 5, 2024

fiona said:
Hi,

please make sure you are using a kernel >= 6.2.16-11, i.e. install upgrades:
https://pve.proxmox.com/wiki/Package_Repositories
https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#system_software_updates

Thanks for replying :-

I check the running kernel version its shows following o/p.

root@171:~# uname -a
Linux 171 6.2.16-8-pve #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-8 (2023-08-02T12:17Z) x86_64 GNU/Linux

root@172:~# uname -a
Linux i172 6.2.16-8-pve #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-8 (2023-08-02T12:17Z) x86_64 GNU/Linux

root@173:~# uname -a
Linux 173 6.2.16-8-pve #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-8 (2023-08-02T12:17Z) x86_64 GNU/Linux

If i check it with correct command , I see that it is using 6.2.16-8 . It is a running cluster. If we have upupgade it to 6.2.16-11. What is the correct way to do this with out disturbing cluster.

fiona · Feb 5, 2024

Ayush said:
If i check it with correct command , I see that it is using 6.2.16-8 . It is a running cluster. If we have upupgade it to 6.2.16-11. What is the correct way to do this with out disturbing cluster.

Since you have three nodes, you can upgrade+reboot each node individually. Just make sure the reboot of a node is finished and all services e.g. for Ceph are started, before you reboot the next one.

Ayush · Feb 5, 2024

fiona said:
Since you have three nodes, you can upgrade+reboot each node individually. Just make sure the reboot of a node is finished and all services e.g. for Ceph are started, before you reboot the next one.

After the upgrade , it seem to be fixed and I upgraded to the latest version on all the 3 nodes of proxmox. ;-

171 6.5.11-8-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-8 (2024-01-30T12:27Z) x86_64 GNU/Linux

Search

Search

pverados segfault

Kimmo

Active Member

fiona

Proxmox Staff Member

Lephisto

Well-Known Member

fiona

Proxmox Staff Member

Lephisto

Well-Known Member

fiona

Proxmox Staff Member

matt.

Renowned Member

G0ldmember

Active Member

Kimmo

Active Member

fiona

Proxmox Staff Member

fiona

Proxmox Staff Member

Kimmo

Active Member

Ayush

Member

Ayush

Member

fiona

Proxmox Staff Member

Ayush

Member

fiona

Proxmox Staff Member

Ayush

Member

We value your privacy