Proxmox Partial crash

max.nolent

Member
Aug 14, 2020
12
0
6
26
Hello,

In 4 Days, i has 2 crash one of my proxmox node. The only that the server give is a kernel error :


Aug 29 22:09:45 proxmox-003 kernel: [24096734.686913] _copy_to_iter+0x2ed/0x410
Aug 29 22:09:45 proxmox-003 kernel: [24096734.687047] ? _raw_spin_unlock_bh+0x1e/0x20
Aug 29 22:09:45 proxmox-003 kernel: [24096734.687111] ? tcp_recvmsg+0x4d3/0xc70
Aug 29 22:09:45 proxmox-003 kernel: [24096734.687186] xs_sock_recvmsg.constprop.32+0x2c/0x50 [sunrpc]
Aug 29 22:09:45 proxmox-003 kernel: [24096734.689992] xs_stream_data_receive_workfn+0x15/0x20 [sunrpc]
Aug 29 22:09:45 proxmox-003 kernel: [24096734.692668] kthread+0x120/0x140
Aug 29 22:09:45 proxmox-003 kernel: [24096734.695304] ret_from_fork+0x35/0x40
Aug 29 22:09:45 proxmox-003 kernel: [24096734.709860] ---[ end trace 85cfb4375017e7a9 ]---
Aug 29 22:09:45 proxmox-003 kernel: [24096734.777748] R10: 0000000000000000 R11: ffff8b1cc051dcc0 R12: ffff984c0d73bdf8
Aug 29 22:09:45 proxmox-003 kernel: [24096734.780914] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 29 22:09:45 proxmox-003 kernel: [24096734.781977] CR2: 00007ff5ff22f25a CR3: 0000000cdf80e004 CR4: 00000000001626e0



Second Crash :

Sep 2 14:57:21 proxmox-003 kernel: [289883.906752] RBP: ffffa3ab4f3d3a90 R08: ffff8aabc1ac4186 R09: 0000000000000000
Sep 2 14:57:21 proxmox-003 kernel: [289883.907009] simple_copy_to_iter+0x2f/0x40
Sep 2 14:57:21 proxmox-003 kernel: [289883.907056] ? skb_kill_datagram+0x70/0x70
Sep 2 14:57:21 proxmox-003 kernel: [289883.908025] tcp_recvmsg+0x230/0xc70
Sep 2 14:57:21 proxmox-003 kernel: [289883.909793] ? _cond_resched+0x19/0x30
Sep 2 14:57:21 proxmox-003 kernel: [289883.911508] ? tcp_recvmsg+0x4d3/0xc70
Sep 2 14:57:21 proxmox-003 kernel: [289883.913161] sock_recvmsg+0x43/0x50
Sep 2 14:57:21 proxmox-003 kernel: [289883.914851] xs_read_stream_request.constprop.30+0x2c0/0x430 [sunrpc]
Sep 2 14:57:21 proxmox-003 kernel: [289883.916547] xs_stream_data_receive_workfn+0x15/0x20 [sunrpc]
Sep 2 14:57:21 proxmox-003 kernel: [289883.918229] worker_thread+0x34/0x400
Sep 2 14:57:21 proxmox-003 kernel: [289883.919902] ? process_one_work+0x410/0x410
Sep 2 14:57:21 proxmox-003 kernel: [289883.921548] ret_from_fork+0x35/0x40
Sep 2 14:57:21 proxmox-003 kernel: [289883.922377] ghash_clmulni_intel aesni_intel aes_x86_64 fb_sys_fops crypto_simd cryptd syscopyarea sysfillrect glue_helper joydev input_leds sysimgbl
t intel_cstate intel_rapl_perf dcdbas pcspkr zfs(PO) zunicode(PO) zlua(PO) mei_me mei mxm_wmi ipmi_si ipmi_devintf ipmi_msghandler mac_hid acpi_power_meter zcommon(PO) znvpair(PO) zavl(PO)
icp(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi sunrpc scsi_transport_iscsi ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_p
q dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c hid_generic usbkbd usbmouse usbhid hid uas usb_storage tg3 ixgbe lpc_ich xfrm_algo dca mdio megaraid_sas ahci libahci wmi
Sep 2 14:57:21 proxmox-003 kernel: [289883.936284] ---[ end trace 32b2f8d2ee97c76d ]---
Sep 2 14:57:21 proxmox-003 kernel: [289883.980485] RIP: 0010:memcpy_erms+0x6/0x10
Sep 2 14:57:21 proxmox-003 kernel: [289883.981656] Code: ff ff ff 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3>
a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe
Sep 2 14:57:21 proxmox-003 kernel: [289883.986030] RDX: 0000000000000b3a RSI: ffff8aabc1ac4186 RDI: 6c80005671d47000
Sep 2 14:57:21 proxmox-003 kernel: [289883.987131] RBP: ffffa3ab4f3d3a90 R08: ffff8aabc1ac4186 R09: 0000000000000000
Sep 2 14:57:21 proxmox-003 kernel: [289883.989327] R13: 0000000000000b3a R14: 0000000000000b3a R15: 0000000000000b3a
Sep 2 14:57:21 proxmox-003 kernel: [289883.990407] FS: 0000000000000000(0000) GS:ffff8aac1fb40000(0000) knlGS:0000000000000000
Sep 2 14:57:21 proxmox-003 kernel: [289883.992567] CR2: 00007f1ddb2489a0 CR3: 000000157ba0e002 CR4: 00000000001626e0

When proxmox crash, some VM and Container are still Up but the menu are grey.

My VM are on shared storage, i check and have no network error.

I don't know where to search to find the probleme.

I thank you in advance for the help you will be able to give me.

Version :

proxmox-ve: 6.0-2 (running kernel: 5.0.21-5-pve)
pve-manager: 6.0-15 (running version: 6.0-15/52b91481)
pve-kernel-helper: 6.0-12
pve-kernel-5.0: 6.0-11
pve-kernel-4.15: 5.4-9
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.4.134-1-pve: 4.4.134-112
pve-kernel-4.4.35-1-pve: 4.4.35-77
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-4
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-8
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-11
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-9
pve-cluster: 6.0-9
pve-container: 3.0-13
pve-docs: 6.0-9
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-8
pve-firmware: 3.0-4
pve-ha-manager: 3.0-5
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.1-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-16
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2
 
Hi,

this cash is related to the network stack.
So potential problems are located in
  • Network HW
  • Network driver
  • Memory
What network services do you use on the host? NFS/SMB is also network related.
 
I use NFS version 3 on Synology. For the storage, i have dedicate network card 10Gbs link on 10Gbs switch

Network HW = Hardware ?

Memory is good because i was able to connect on ssh during the crash.

With your message, i think it come from network problem but my drac don't show any problem. Do you know any tools to check the hardware of network card ?
 
Last night, i check all my disk on my nas and find a disk with 40 000 error, could it bee the origin of the errors ?
 
Last night, i check all my disk on my nas and find a disk with 40 000 error, could it bee the origin of the errors ?

yes it's very likely if that many errors came up in smart check
 
Hello,

I replace my disk and still getting crash :

Sep 6 15:35:09 proxmox-003 kernel: [346391.027160] RAX: 567b8de2ff600000 RBX: ffff9399d208ae00 RCX: 0000000000000b3a
Sep 6 15:35:09 proxmox-003 kernel: [346391.027669] inet_recvmsg+0x5c/0xd0
Sep 6 15:35:09 proxmox-003 kernel: [346391.027718] xs_sock_recvmsg.constprop.32+0x2c/0x50 [sunrpc]
Sep 6 15:35:09 proxmox-003 kernel: [346391.027792] xs_stream_data_receive+0x2f5/0x470 [sunrpc]
Sep 6 15:35:09 proxmox-003 kernel: [346391.027854] process_one_work+0x20f/0x410
Sep 6 15:35:09 proxmox-003 kernel: [346391.028678] kthread+0x120/0x140
Sep 6 15:35:09 proxmox-003 kernel: [346391.030268] ? __kthread_parkme+0x70/0x70
Sep 6 15:35:09 proxmox-003 kernel: [346391.031806] Modules linked in: nf_log_ipv4 nf_log_common xt_LOG xt_recent iptable_nat nf_nat_ipv4 nf_nat xt_comment ipt_REJECT nf_reject_ipv4 xt_addrtype xt_mark iptable_mangle nf_conntrack_ftp nf_
conntrack_sane nf_conntrack_tftp nf_conntrack_irc nf_conntrack_sip nf_conntrack_snmp ts_kmp nf_conntrack_amanda nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_h323 nf_conntrack_netbios_ns nf_conntrack_broadcast xt_tcpudp xt_CT xt_
multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 veth mpt3sas raid_class scsi_transport_sas mptctl mptbase nfsv3 nfs_acl nfs lockd grace fscache ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6
_tables sctp iptable_filter bpfilter binfmt_misc dell_rbu bonding softdog nfnetlink_log nfnetlink dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypas
s mgag200 ipmi_ssif ttm drm_kms_helper crct10dif_pclmul crc32_pclmul drm i2c_algo_bit
Sep 6 15:35:09 proxmox-003 kernel: [346391.044926] ---[ end trace e96f6d1058044bd2 ]---
Sep 6 15:35:09 proxmox-003 kernel: [346391.089338] RIP: 0010:memcpy_erms+0x6/0x10
Sep 6 15:35:09 proxmox-003 kernel: [346391.090453] Code: ff ff ff 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20
72 7e 40 38 fe
Sep 6 15:35:09 proxmox-003 kernel: [346391.100035] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 6 15:35:09 proxmox-003 kernel: [346391.101111] CR2: 0000000000000030 CR3: 00000003db40e006 CR4: 00000000001626e0
 
pveversion -v shows outdated versions, please update your installation first and see if the problem still occurs.

if you're having problems updating - then probably you missed to configure the repositories [0]

[0]: https://pve.proxmox.com/wiki/Package_Repositories
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!