Segmentation fault on updating initramfs

Vasi · Mar 2, 2025

I have installed proxmox 8.3.4, kernel - Linux 6.8.12-4-pve (2024-11-06T15:04Z)
I went through the tutorial for PCI(e) passthrough and updated the /etc/modules

Code:

# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.
# Parameters can be specified after the module name.

vfio
vfio_iommu_type1
vfio_pci

and when I do update-intiramfs -u -k all, I get the segmentation faults.

Code:

update-initramfs -u -k all
update-initramfs: Generating /boot/initrd.img-6.8.12-8-pve
Segmentation fault
Segmentation fault
Running hook script 'zz-proxmox-boot'..
Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace..
No /etc/kernel/proxmox-boot-uuids found, skipping ESP sync.
update-initramfs: Generating /boot/initrd.img-6.8.12-4-pve
modinfo: symbol lookup error: /lib/x86_64-linux-gnu/libcrypto.so.3: undefined symbol: EVP_ASYM_CIPHER_get0_provider, version OPENSSL_3.0.0
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Running hook script 'zz-proxmox-boot'..
Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace..
No /etc/kernel/proxmox-boot-uuids found, skipping ESP sync.

I also see some errors in journalctl . It says the following

Code:

Mar 02 16:29:59 pve systemd[1]: Starting zfs-import-scan.service - Import ZFS pools by device scanning...
Mar 02 16:29:59 pve systemd[1]: zfs-import-cache.service - Import ZFS pools by cache file was skipped becau>
Mar 02 16:29:59 pve zpool[25648]: cannot import 'data': pool was previously in use from another system.
Mar 02 16:29:59 pve zpool[25648]: Last accessed by truenas (hostid=6997e95e) at Sun Mar  2 15:30:52 2025
Mar 02 16:29:59 pve zpool[25648]: The pool can be imported, use 'zpool import -f' to import the pool.
Mar 02 16:29:59 pve systemd[1]: zfs-import-scan.service: Main process exited, code=exited, status=1/FAILURE
Mar 02 16:29:59 pve systemd[1]: zfs-import-scan.service: Failed with result 'exit-code'.
Mar 02 16:29:59 pve systemd[1]: Failed to start zfs-import-scan.service - Import ZFS pools by device scanning port-scan.service - Import ZFS pools by device scanning.

fabian · Mar 3, 2025

the segmentation fault should actually be visible in the journal as well, could you post that part?

in general, segmentation faults are very often caused by faulty hardware.. I'd check your memory!

Vasi · Mar 3, 2025

I did perform memtest86+ and it did pass a few weeks ago.

This is what I found in journalctl

Code:

journalctl | grep Segmentation
Mar 03 09:15:16 pve postmulti[1616]: Segmentation fault

fabian · Mar 3, 2025

it can also be the disks or CPU, memory is just the most common cause. the undefined symbol error also looks quite strange - libcrypto should definitely contain that symbol - could you try running "objdump --dynamic-syms /lib/x86_64-linux-gnu/libcrypto.so.3 | grep EVP_ASYM"?

Vasi · Mar 3, 2025

Code:

root@pve:~# objdump --dynamic-syms /lib/x86_64-linux-gnu/libcrypto.so.3 | grep EVP_ASYM
00000000001ec890 g    DF .text  0000000000000010  OPENSSL_3.0.0 EVP_ASYM_CIPHER_is_a
00000000001ec910 g    DF .text  0000000000000032  OPENSSL_3.0.0 EVP_ASYM_CIPHER_names_do_all
00000000001eba50 g    DF .text  000000000000000d  OPENSSL_3.0.0 EVP_ASYM_CIPHER_up_ref
00000000001ec9a0 g    DF .text  0000000000000047  OPENSSL_3.0.0 EVP_ASYM_CIPHER_settable_ctx_params
00000000001ec2e0 g    DF .text  000000000000003b  OPENSSL_3.0.0 EVP_ASYM_CIPHER_fetch
00000000001eba60 g    DF .text  0000000000000079  OPENSSL_3.0.0 EVP_ASYM_CIPHER_free
00000000001ec2d0 g    DF .text  0000000000000007  OPENSSL_3.0.0 EVP_ASYM_CIPHER_get0_provider
00000000001ec8b0 g    DF .text  0000000000000007  OPENSSL_3.0.0 EVP_ASYM_CIPHER_get0_name
00000000001ec8d0 g    DF .text  000000000000003b  OPENSSL_3.0.0 EVP_ASYM_CIPHER_do_all_provided
00000000001ec8c0 g    DF .text  0000000000000007  OPENSSL_3.0.0 EVP_ASYM_CIPHER_get0_description
00000000001ec950 g    DF .text  0000000000000047  OPENSSL_3.0.0 EVP_ASYM_CIPHER_gettable_ctx_params

Also, I have disabled ZFS services as I am using TRUENAS, and I no longer see the ZFS failing errors.

fabian · Mar 3, 2025

so it's there now, but wasn't there at upgrade time -> still very much smells like faulty hardware!

Vasi · Mar 3, 2025

I have 6 hard disks that are set up in raidz2 using TRUENAS.

The strange thing is, when I boot up TRUENAS, I no longer see these disks in the PVE disks section. Is this how it is supposed to be?

fabian · Mar 3, 2025

I don't know, but that seems unrelated to your original issue?

Vasi · Mar 3, 2025

Ohh, I did the update-initramfs -u -k allInitially, it caused segfaults again, and the system crashed.
And I see this error message in the journalctl on boot

Code:

Mar 03 09:59:37 pve pvedaemon[2777]: command '/usr/bin/termproxy 5900 --path /nodes/pve --perm Sys.Console /termproxy 5900 --path /nodes/pve --perm Sys.Console -- /bin/login -f root' failed: exit code 1
Mar 03 09:59:37 pve pvedaemon[1737]: <root@pam> end task UPID:pve:00000AD9:000039D4:67C56F6E:vncshell::root sk UPID:pve:00000AD9:000039D4:67C56F6E:vncshell::root@pam: command '/usr/bin/termproxy 5900 --path /nodes/pve --perm Sys.Console -- /bin/login -f root' failed: exit code 1

Is this related?

Also, I see a lot of these but in yellow.

Code:

Mar 03 10:00:38 pve kernel: overlayfs: fs on '/var/lib/docker/overlay2/l/E4E77ULL6ZQF6M7DNOJ2637V66' does n>
Mar 03 10:00:38 pve kernel: overlayfs: fs on '/var/lib/docker/overlay2/l/PQGBL3KSOEKSQAXOII6BT5DYUM' does n>
Mar 03 10:00:38 pve kernel: overlayfs: fs on '/var/lib/docker/overlay2/l/JX2FPVCP64R3FLHSAQU3YPJQ3R' does n>
Mar 03 10:00:38 pve kernel: overlayfs: fs on '/var/lib/docker/overlay2/l/7TIFZO2POMPNTQWOIJNH4GZCPS

fabian · Mar 3, 2025

your are mixing very different problems and questions, please focus on one thing at a time or it will be impossible to solve anything.

please first ensure all your hardware is okay:
- run a memtest
- check your disks status
- verify you are not running a known broken CPU such as the 13/14900K intel ones

Vasi · Mar 3, 2025

Okay. Sorry about that.

1. I will perform an memtest again.
2. Atleast my TRUENAS says, I don't have any errors according to smartctl. Anyother way to check?
3. I am running i5 9600k. I will perform a stress test on CPU as well.

fabian · Mar 3, 2025

could you also provide

- lsblk
- mount
- the config of your truenas VM?

Vasi · Mar 3, 2025

Code:

root@pve:~# lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sdg                            8:96   0 465.8G  0 disk
├─sdg1                         8:97   0  1007K  0 part
├─sdg2                         8:98   0     1G  0 part /boot/efi
└─sdg3                         8:99   0 464.8G  0 part
  ├─pve-swap                 252:0    0     8G  0 lvm  [SWAP]
  ├─pve-root                 252:1    0    96G  0 lvm  /
  ├─pve-data_tmeta           252:2    0   3.4G  0 lvm 
  │ └─pve-data-tpool         252:4    0 337.9G  0 lvm 
  │   ├─pve-data             252:5    0 337.9G  1 lvm 
  │   ├─pve-vm--100--disk--0 252:6    0    60G  0 lvm 
  │   ├─pve-vm--101--disk--0 252:7    0   250G  0 lvm 
  │   └─pve-vm--102--disk--0 252:8    0    32G  0 lvm 
  └─pve-data_tdata           252:3    0 337.9G  0 lvm 
    └─pve-data-tpool         252:4    0 337.9G  0 lvm 
      ├─pve-data             252:5    0 337.9G  1 lvm 
      ├─pve-vm--100--disk--0 252:6    0    60G  0 lvm 
      ├─pve-vm--101--disk--0 252:7    0   250G  0 lvm 
      └─pve-vm--102--disk--0 252:8    0    32G  0 lvm

Code:

root@pve:~# mount
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,relatime)
udev on /dev type devtmpfs (rw,nosuid,relatime,size=32726268k,nr_inodes=8181567,mode=755,inode64)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,noexec,relatime,size=6552020k,mode=755,inode64)
/dev/mapper/pve-root on / type ext4 (rw,relatime,errors=remount-ro)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,inode64)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k,inode64)
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=362)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
ramfs on /run/credentials/systemd-sysusers.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
ramfs on /run/credentials/systemd-tmpfiles-setup-dev.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
/dev/sdg2 on /boot/efi type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)
ramfs on /run/credentials/systemd-tmpfiles-setup.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,nosuid,nodev,noexec,relatime)
sunrpc on /run/rpc_pipefs type rpc_pipefs (rw,relatime)
ramfs on /run/credentials/systemd-sysctl.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
lxcfs on /var/lib/lxcfs type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=6552016k,nr_inodes=1638004,mode=700,inode64)

Config of TRUENAS VM

fabian · Mar 3, 2025

check the health of your /dev/sdg disk.. and please also post "lspci -vv"

Vasi · Mar 3, 2025

Code:

root@pve:~# smartctl --all /dev/sdg
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-8-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron Client SSDs
Device Model:     CT500MX500SSD1
Serial Number:    2416E8A79E43
LU WWN Device Id: 5 00a075 1e8a79e43
Firmware Version: M3CR046
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Mar  3 15:31:57 2025 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  30) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x0031) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       1262
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       128
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   099   099   000    Old_age   Always       -       23
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       55
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       54
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   074   064   000    Old_age   Always       -       26 (Min/Max 0/36)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_ECC_Cnt 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   099   099   001    Old_age   Offline      -       1
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       4803713163
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       54541074
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       37427572

SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 2

ATA Error Count: 0
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error -1 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 ec 00 00 00 00 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 00 00      00:00:00.000  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00      00:00:00.000  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00      00:00:00.000  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00      00:00:00.000  IDENTIFY DEVICE
  c8 00 00 00 00 00 00 00      00:00:00.000  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      1262         -
# 2  Extended offline    Completed without error       00%       128         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Here is the pastebin link for lspci -vv
lspci -vv

a0ii22 · Mar 3, 2025

Looks like you are passing through the whole SAS Controller and igpu, but not blacklisting the drivers on the host.

Vasi · Mar 3, 2025

Ohh, yea, I did not put any blacklist on the host.

I have updated the /etc/modprobe.d/pve-blacklist.conf

Code:

# This file contains a list of modules which are not supported by Proxmox VE

# nvidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb
blacklist nouveau
blacklist nvidia
blacklist radeon
blacklist mpt3sas

Updated and no errors.

Code:

update-initramfs -u
update-initramfs: Generating /boot/initrd.img-6.8.12-8-pve
Running hook script 'zz-proxmox-boot'..
Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace..
No /etc/kernel/proxmox-boot-uuids found, skipping ESP sync.

Is this the issue at the end?

a0ii22 · Mar 3, 2025

You may run into the issue again because your igpu is an intel and uses the i915 and you have also given that to the VM, but I would watch for more errors before I add that to the modprobe blacklist since it may break local console access (not the webui though).

Vasi · Mar 3, 2025

Hmm, I still have reboots when I turn on the TRUENAS VM. They are random. Once it reboots, I can no longer connect to the proxmox using web GUI. I need to do a shutdown and turn back again to access the web GUI. I don't see anything in journalctl on pve host.

For now, I have turned off the igpu passthrough.

Edit - I still have the random reboot coming from TRUENAS.

fabian · Mar 4, 2025

you could also try stress-testing the system using something like stress-ng (and maybe provide more details about the hardware)? it still seems very likely that it is a hardware issue given the symptoms..

Segmentation fault on updating initramfs

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

New Member

New Member

New Member

New Member

Proxmox Staff Member

We value your privacy