[SOLVED] Proxmox crashed with segfault (Help??)

AsankaG · Aug 29, 2019

Hi All,

Need some help with identifying the root cause of this issue. I'm not sure where to look for the root cause on this

Had to physically reboot the server to get this back on track.

Thanks a lot for your help.

Traces from syslog:

Code:

Aug 28 16:07:00 s1 systemd[1]: Starting Proxmox VE replication runner...
Aug 28 16:07:00 s1 systemd[1]: pvesr.service: Main process exited, code=killed, status=11/SEGV
Aug 28 16:07:00 s1 systemd[1]: pvesr.service: Failed with result 'signal'.
Aug 28 16:07:00 s1 systemd[1]: Failed to start Proxmox VE replication runner.
Aug 28 16:07:00 s1 kernel: [52555.769606] show_signal: 6 callbacks suppressed
Aug 28 16:07:00 s1 kernel: [52555.769607] traps: pvesr[20739] general protection fault ip:5655556454c4 sp:7ffbffffba78 error:0 in perl[565555586000+15d000]
Aug 28 16:07:03 s1 pvedaemon[17516]: <root@pam> starting task UPID:s1:0000510B:00503298:5D66A6A7:qmdestroy:111:root@pam:
Aug 28 16:07:03 s1 pvedaemon[20747]: destroy VM 111: UPID:s1:0000510B:00503298:5D66A6A7:qmdestroy:111:root@pam:
Aug 28 16:07:03 s1 pvedaemon[17513]: worker 17516 finished
Aug 28 16:07:03 s1 pvedaemon[17513]: starting 1 worker(s)
Aug 28 16:07:03 s1 pvedaemon[17513]: worker 20810 started
Aug 28 16:07:03 s1 kernel: [52558.265208] traps: pvedaemon worke[17516] general protection fault ip:56555564cfe8 sp:7ffbffffba70 error:0 in perl[565555586000+15d000]
Aug 28 16:07:47 s1 kernel: [52602.534948] traps: pvedaemon worke[17514] general protection fault ip:565555630589 sp:7ffbffffb9e0 error:0 in perl[565555586000+15d000]
Aug 28 16:07:47 s1 pvedaemon[17513]: worker 17514 finished
Aug 28 16:07:47 s1 pvedaemon[17513]: starting 1 worker(s)
Aug 28 16:07:47 s1 pvedaemon[17513]: worker 5749 started
Aug 28 16:08:00 s1 systemd[1]: Starting Proxmox VE replication runner...
Aug 28 16:08:00 s1 systemd[1]: pvesr.service: Succeeded.
Aug 28 16:08:00 s1 systemd[1]: Started Proxmox VE replication runner.
Aug 28 16:08:12 s1 kernel: [52627.055424] traps: pvedaemon worke[20810] general protection fault ip:56555564081d sp:7ffbffffbac0 error:0 in perl[565555586000+15d000]
Aug 28 16:08:12 s1 pvedaemon[17513]: worker 20810 finished
Aug 28 16:08:12 s1 pvedaemon[17513]: starting 1 worker(s)
Aug 28 16:08:12 s1 pvedaemon[17513]: worker 6064 started

Server Details:
Ryzen 3600
MSI B450 Carbon Pro AC
WD Green 240GB SSD for Boot
4x 1TB HDD on ZFS RAID 10
32GB DDR4 3000MHz Corsair LPX (2x 16GB)
nVidia 710 GPU

Code:

root@s1:~# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve)
pve-manager: 6.0-6 (running version: 6.0-6/c71f879f)
pve-kernel-5.0: 6.0-7
pve-kernel-helper: 6.0-7
pve-kernel-5.0.21-1-pve: 5.0.21-1
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.11-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-4
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-5
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2

Stoiko Ivanov · Aug 29, 2019

multiple segfaults in perl processes seems odd.

* Could you please run a memtest86 for 1-2 complete runs? (last time I had multiple random segfaults it was due to bad memory)
* You could also check whether all packages' files are correctly installed with `debsums` (`man debsums` after potentially installing the 'debsums' package)

Thanks!

AsankaG · Aug 29, 2019

Stoiko Ivanov said:
multiple segfaults in perl processes seems odd.

* Could you please run a memtest86 for 1-2 complete runs? (last time I had multiple random segfaults it was due to bad memory)
* You could also check whether all packages' files are correctly installed with `debsums` (`man debsums` after potentially installing the 'debsums' package)

Thanks!

Thanks for the reply, will run a memtest tomorrow when i get some downtime. In th emeantime I just got another sigfault. Appreciate any feedback on this.

Code:

Aug 29 14:18:50 s1 kernel: [43445.560333] general protection fault: 0000 [#1] SMP NOPTI
Aug 29 14:18:50 s1 kernel: [43445.560356] CPU: 7 PID: 951 Comm: z_wr_iss Tainted: P           O      5.0.21-1-pve #1
Aug 29 14:18:50 s1 kernel: [43445.560374] Hardware name: Micro-Star International Co., Ltd. MS-7B85/B450 GAMING PRO CARBON AC (MS-7B85), BIOS 1.93 08/19/2019
Aug 29 14:18:50 s1 kernel: [43445.560432] RIP: 0010:zio_ready+0x1df/0x450 [zfs]
Aug 29 14:18:50 s1 kernel: [43445.560444] Code: da 00 00 00 e9 76 01 00 00 49 8b 86 30 01 00 00 4d 8b 24 04 4c 39 65 c8 0f 84 c4 01 00 00 49 29 c4 4d 85 e4 0f 84 e1 00 00 00 <49> 8b 04 24 48 89 45 d0 4d 8d af c0 03 00 00 41 8b 5e 70 4c 89 ef
Aug 29 14:18:50 s1 kernel: [43445.560483] RSP: 0018:ffffa31dc8157d60 EFLAGS: 00010282
Aug 29 14:18:50 s1 kernel: [43445.560495] RAX: 0000000000000010 RBX: ffff910fed594c68 RCX: 0000000000000000
Aug 29 14:18:50 s1 kernel: [43445.560511] RDX: ffff91157843c500 RSI: ffff910fed5949e0 RDI: ffff910fed594c88
Aug 29 14:18:50 s1 kernel: [43445.560526] RBP: ffffa31dc8157da8 R08: ffff910fdd52c360 R09: ffff910fdd52c360
Aug 29 14:18:50 s1 kernel: [43445.560541] R10: 0000000000000000 R11: 0000000000000000 R12: f7ff910fed5949d0
Aug 29 14:18:50 s1 kernel: [43445.560557] R13: ffff910fed594c88 R14: ffff910fed5948a8 R15: ffff9114db0c09b0
Aug 29 14:18:50 s1 kernel: [43445.560573] FS:  0000000000000000(0000) GS:ffff91157e7c0000(0000) knlGS:0000000000000000
Aug 29 14:18:50 s1 kernel: [43445.560590] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 29 14:18:50 s1 kernel: [43445.560603] CR2: 0000000000000030 CR3: 00000007b088a000 CR4: 0000000000340ee0
Aug 29 14:18:50 s1 kernel: [43445.560618] Call Trace:
Aug 29 14:18:50 s1 kernel: [43445.560633]  ? taskq_member+0x18/0x30 [spl]
Aug 29 14:18:50 s1 kernel: [43445.560678]  zio_execute+0x99/0xf0 [zfs]
Aug 29 14:18:50 s1 kernel: [43445.560694]  taskq_thread+0x310/0x500 [spl]
Aug 29 14:18:50 s1 kernel: [43445.560708]  ? wake_up_q+0x80/0x80
Aug 29 14:18:50 s1 kernel: [43445.560750]  ? zio_taskq_member.isra.11.constprop.16+0x70/0x70 [zfs]
Aug 29 14:18:50 s1 kernel: [43445.560770]  kthread+0x120/0x140
Aug 29 14:18:50 s1 kernel: [43445.560782]  ? task_done+0xb0/0xb0 [spl]
Aug 29 14:18:50 s1 kernel: [43445.560794]  ? __kthread_parkme+0x70/0x70
Aug 29 14:18:50 s1 kernel: [43445.561241]  ret_from_fork+0x22/0x40
Aug 29 14:18:50 s1 kernel: [43445.561690] Modules linked in: tcp_diag inet_diag veth ebtable_filter ebtables ip_set ip6table_filter ip6_tables iptable_filter bpfilter softdog nfnetlink_log nfnetlink edac_mce_amd kvm_amd nls_iso8859_1 kvm irqbypass zfs(PO) zunicode(PO) zlua(PO) snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec btusb snd_hda_core btrtl btbcm snd_hwdep aesni_intel btintel snd_pcm bluetooth aes_x86_64 snd_timer crypto_simd cryptd snd ecdh_generic joydev glue_helper input_leds soundcore ccp wmi_bmof pcspkr mac_hid zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c usbmouse hid_generic usbkbd usbhid hid i2c_piix4 r8169 igb realtek i2c_algo_bit ahci dca libahci wmi
Aug 29 14:18:50 s1 kernel: [43445.561721]  gpio_amdpt gpio_generic
Aug 29 14:18:50 s1 kernel: [43445.566918] ---[ end trace decf0768eb369969 ]---
Aug 29 14:18:50 s1 kernel: [43445.567410] RIP: 0010:zio_ready+0x1df/0x450 [zfs]
Aug 29 14:18:50 s1 kernel: [43445.567846] Code: da 00 00 00 e9 76 01 00 00 49 8b 86 30 01 00 00 4d 8b 24 04 4c 39 65 c8 0f 84 c4 01 00 00 49 29 c4 4d 85 e4 0f 84 e1 00 00 00 <49> 8b 04 24 48 89 45 d0 4d 8d af c0 03 00 00 41 8b 5e 70 4c 89 ef
Aug 29 14:18:50 s1 kernel: [43445.569184] RSP: 0018:ffffa31dc8157d60 EFLAGS: 00010282
Aug 29 14:18:50 s1 kernel: [43445.569625] RAX: 0000000000000010 RBX: ffff910fed594c68 RCX: 0000000000000000
Aug 29 14:18:50 s1 kernel: [43445.570072] RDX: ffff91157843c500 RSI: ffff910fed5949e0 RDI: ffff910fed594c88
Aug 29 14:18:50 s1 kernel: [43445.570512] RBP: ffffa31dc8157da8 R08: ffff910fdd52c360 R09: ffff910fdd52c360
Aug 29 14:18:50 s1 kernel: [43445.570952] R10: 0000000000000000 R11: 0000000000000000 R12: f7ff910fed5949d0
Aug 29 14:18:50 s1 kernel: [43445.571394] R13: ffff910fed594c88 R14: ffff910fed5948a8 R15: ffff9114db0c09b0
Aug 29 14:18:50 s1 kernel: [43445.571836] FS:  0000000000000000(0000) GS:ffff91157e7c0000(0000) knlGS:0000000000000000
Aug 29 14:18:50 s1 kernel: [43445.572286] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 29 14:18:50 s1 kernel: [43445.572726] CR2: 0000000000000030 CR3: 00000007b088a000 CR4: 0000000000340ee0

AsankaG · Aug 29, 2019

Could this be ZFS related? I had to reset the server and I now have this error?

Code:

root@s1:~# zpool status -v
  pool: storage
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        storage     ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sdc     ONLINE       0     0    36
            sdd     ONLINE       0     0    36

errors: Permanent errors have been detected in the following files:

        storage/vmstorage/vm-111-disk-1:<0x1>

Stoiko Ivanov · Aug 30, 2019

Hmm - the stacktrace indeed happens in ZFS - but since it seems to happen on multiple independent places (perl-daemons from PVE - disk access for ZFS and the checksum errors I would really suggest to run memtest and check the hardware (are all cables properly connected) for a while before going further with the analysis

I hope this helps!

AsankaG · Aug 30, 2019

Stoiko Ivanov said:
Hmm - the stacktrace indeed happens in ZFS - but since it seems to happen on multiple independent places (perl-daemons from PVE - disk access for ZFS and the checksum errors I would really suggest to run memtest and check the hardware (are all cables properly connected) for a while before going further with the analysis

I hope this helps!

Thanks for this. I ran a memtest86 test today and the test passed without any issues. However, I was not able to load the memtest86+ app from the boot menu and used a bootable USB to get the test done.

Just to be on the safe side, i reduced teh RAM clock speeds to 2400 MHz from 3000MHz and removed a LAN card and a M2 SSD that was plugged in.

Will report back if this helps.

Is there anything i should look into on ZFS side?

AsankaG · Sep 1, 2019

Update: After the above changes, No more issues.

Thanks for all the help you guys gave me

Stoiko Ivanov · Sep 2, 2019

Glad you found a solution!

Please mark the thread as 'SOLVED' so that others know what to expect.
Thanks!

Search

Search

[SOLVED] Proxmox crashed with segfault (Help??)

AsankaG

Member

Stoiko Ivanov

Proxmox Staff Member

AsankaG

Member

AsankaG

Member

Stoiko Ivanov

Proxmox Staff Member

AsankaG

Member

AsankaG

Member

Stoiko Ivanov

Proxmox Staff Member