Proxmox Partial crash

max.nolent

Member
Aug 14, 2020
12
0
6
27
Hello,

In 4 Days, i has 2 crash one of my proxmox node. The only that the server give is a kernel error :


Aug 29 22:09:45 proxmox-003 kernel: [24096734.686913] _copy_to_iter+0x2ed/0x410
Aug 29 22:09:45 proxmox-003 kernel: [24096734.687047] ? _raw_spin_unlock_bh+0x1e/0x20
Aug 29 22:09:45 proxmox-003 kernel: [24096734.687111] ? tcp_recvmsg+0x4d3/0xc70
Aug 29 22:09:45 proxmox-003 kernel: [24096734.687186] xs_sock_recvmsg.constprop.32+0x2c/0x50 [sunrpc]
Aug 29 22:09:45 proxmox-003 kernel: [24096734.689992] xs_stream_data_receive_workfn+0x15/0x20 [sunrpc]
Aug 29 22:09:45 proxmox-003 kernel: [24096734.692668] kthread+0x120/0x140
Aug 29 22:09:45 proxmox-003 kernel: [24096734.695304] ret_from_fork+0x35/0x40
Aug 29 22:09:45 proxmox-003 kernel: [24096734.709860] ---[ end trace 85cfb4375017e7a9 ]---
Aug 29 22:09:45 proxmox-003 kernel: [24096734.777748] R10: 0000000000000000 R11: ffff8b1cc051dcc0 R12: ffff984c0d73bdf8
Aug 29 22:09:45 proxmox-003 kernel: [24096734.780914] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 29 22:09:45 proxmox-003 kernel: [24096734.781977] CR2: 00007ff5ff22f25a CR3: 0000000cdf80e004 CR4: 00000000001626e0



Second Crash :

Sep 2 14:57:21 proxmox-003 kernel: [289883.906752] RBP: ffffa3ab4f3d3a90 R08: ffff8aabc1ac4186 R09: 0000000000000000
Sep 2 14:57:21 proxmox-003 kernel: [289883.907009] simple_copy_to_iter+0x2f/0x40
Sep 2 14:57:21 proxmox-003 kernel: [289883.907056] ? skb_kill_datagram+0x70/0x70
Sep 2 14:57:21 proxmox-003 kernel: [289883.908025] tcp_recvmsg+0x230/0xc70
Sep 2 14:57:21 proxmox-003 kernel: [289883.909793] ? _cond_resched+0x19/0x30
Sep 2 14:57:21 proxmox-003 kernel: [289883.911508] ? tcp_recvmsg+0x4d3/0xc70
Sep 2 14:57:21 proxmox-003 kernel: [289883.913161] sock_recvmsg+0x43/0x50
Sep 2 14:57:21 proxmox-003 kernel: [289883.914851] xs_read_stream_request.constprop.30+0x2c0/0x430 [sunrpc]
Sep 2 14:57:21 proxmox-003 kernel: [289883.916547] xs_stream_data_receive_workfn+0x15/0x20 [sunrpc]
Sep 2 14:57:21 proxmox-003 kernel: [289883.918229] worker_thread+0x34/0x400
Sep 2 14:57:21 proxmox-003 kernel: [289883.919902] ? process_one_work+0x410/0x410
Sep 2 14:57:21 proxmox-003 kernel: [289883.921548] ret_from_fork+0x35/0x40
Sep 2 14:57:21 proxmox-003 kernel: [289883.922377] ghash_clmulni_intel aesni_intel aes_x86_64 fb_sys_fops crypto_simd cryptd syscopyarea sysfillrect glue_helper joydev input_leds sysimgbl
t intel_cstate intel_rapl_perf dcdbas pcspkr zfs(PO) zunicode(PO) zlua(PO) mei_me mei mxm_wmi ipmi_si ipmi_devintf ipmi_msghandler mac_hid acpi_power_meter zcommon(PO) znvpair(PO) zavl(PO)
icp(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi sunrpc scsi_transport_iscsi ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_p
q dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c hid_generic usbkbd usbmouse usbhid hid uas usb_storage tg3 ixgbe lpc_ich xfrm_algo dca mdio megaraid_sas ahci libahci wmi
Sep 2 14:57:21 proxmox-003 kernel: [289883.936284] ---[ end trace 32b2f8d2ee97c76d ]---
Sep 2 14:57:21 proxmox-003 kernel: [289883.980485] RIP: 0010:memcpy_erms+0x6/0x10
Sep 2 14:57:21 proxmox-003 kernel: [289883.981656] Code: ff ff ff 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3>
a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe
Sep 2 14:57:21 proxmox-003 kernel: [289883.986030] RDX: 0000000000000b3a RSI: ffff8aabc1ac4186 RDI: 6c80005671d47000
Sep 2 14:57:21 proxmox-003 kernel: [289883.987131] RBP: ffffa3ab4f3d3a90 R08: ffff8aabc1ac4186 R09: 0000000000000000
Sep 2 14:57:21 proxmox-003 kernel: [289883.989327] R13: 0000000000000b3a R14: 0000000000000b3a R15: 0000000000000b3a
Sep 2 14:57:21 proxmox-003 kernel: [289883.990407] FS: 0000000000000000(0000) GS:ffff8aac1fb40000(0000) knlGS:0000000000000000
Sep 2 14:57:21 proxmox-003 kernel: [289883.992567] CR2: 00007f1ddb2489a0 CR3: 000000157ba0e002 CR4: 00000000001626e0

When proxmox crash, some VM and Container are still Up but the menu are grey.

My VM are on shared storage, i check and have no network error.

I don't know where to search to find the probleme.

I thank you in advance for the help you will be able to give me.

Version :

proxmox-ve: 6.0-2 (running kernel: 5.0.21-5-pve)
pve-manager: 6.0-15 (running version: 6.0-15/52b91481)
pve-kernel-helper: 6.0-12
pve-kernel-5.0: 6.0-11
pve-kernel-4.15: 5.4-9
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.4.134-1-pve: 4.4.134-112
pve-kernel-4.4.35-1-pve: 4.4.35-77
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-4
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-8
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-11
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-9
pve-cluster: 6.0-9
pve-container: 3.0-13
pve-docs: 6.0-9
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-8
pve-firmware: 3.0-4
pve-ha-manager: 3.0-5
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.1-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-16
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2
 
Hi,

this cash is related to the network stack.
So potential problems are located in
  • Network HW
  • Network driver
  • Memory
What network services do you use on the host? NFS/SMB is also network related.
 
I use NFS version 3 on Synology. For the storage, i have dedicate network card 10Gbs link on 10Gbs switch

Network HW = Hardware ?

Memory is good because i was able to connect on ssh during the crash.

With your message, i think it come from network problem but my drac don't show any problem. Do you know any tools to check the hardware of network card ?
 
Last night, i check all my disk on my nas and find a disk with 40 000 error, could it bee the origin of the errors ?
 
Last night, i check all my disk on my nas and find a disk with 40 000 error, could it bee the origin of the errors ?

yes it's very likely if that many errors came up in smart check
 
Hello,

I replace my disk and still getting crash :

Sep 6 15:35:09 proxmox-003 kernel: [346391.027160] RAX: 567b8de2ff600000 RBX: ffff9399d208ae00 RCX: 0000000000000b3a
Sep 6 15:35:09 proxmox-003 kernel: [346391.027669] inet_recvmsg+0x5c/0xd0
Sep 6 15:35:09 proxmox-003 kernel: [346391.027718] xs_sock_recvmsg.constprop.32+0x2c/0x50 [sunrpc]
Sep 6 15:35:09 proxmox-003 kernel: [346391.027792] xs_stream_data_receive+0x2f5/0x470 [sunrpc]
Sep 6 15:35:09 proxmox-003 kernel: [346391.027854] process_one_work+0x20f/0x410
Sep 6 15:35:09 proxmox-003 kernel: [346391.028678] kthread+0x120/0x140
Sep 6 15:35:09 proxmox-003 kernel: [346391.030268] ? __kthread_parkme+0x70/0x70
Sep 6 15:35:09 proxmox-003 kernel: [346391.031806] Modules linked in: nf_log_ipv4 nf_log_common xt_LOG xt_recent iptable_nat nf_nat_ipv4 nf_nat xt_comment ipt_REJECT nf_reject_ipv4 xt_addrtype xt_mark iptable_mangle nf_conntrack_ftp nf_
conntrack_sane nf_conntrack_tftp nf_conntrack_irc nf_conntrack_sip nf_conntrack_snmp ts_kmp nf_conntrack_amanda nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_h323 nf_conntrack_netbios_ns nf_conntrack_broadcast xt_tcpudp xt_CT xt_
multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 veth mpt3sas raid_class scsi_transport_sas mptctl mptbase nfsv3 nfs_acl nfs lockd grace fscache ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6
_tables sctp iptable_filter bpfilter binfmt_misc dell_rbu bonding softdog nfnetlink_log nfnetlink dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypas
s mgag200 ipmi_ssif ttm drm_kms_helper crct10dif_pclmul crc32_pclmul drm i2c_algo_bit
Sep 6 15:35:09 proxmox-003 kernel: [346391.044926] ---[ end trace e96f6d1058044bd2 ]---
Sep 6 15:35:09 proxmox-003 kernel: [346391.089338] RIP: 0010:memcpy_erms+0x6/0x10
Sep 6 15:35:09 proxmox-003 kernel: [346391.090453] Code: ff ff ff 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20
72 7e 40 38 fe
Sep 6 15:35:09 proxmox-003 kernel: [346391.100035] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 6 15:35:09 proxmox-003 kernel: [346391.101111] CR2: 0000000000000030 CR3: 00000003db40e006 CR4: 00000000001626e0
 
pveversion -v shows outdated versions, please update your installation first and see if the problem still occurs.

if you're having problems updating - then probably you missed to configure the repositories [0]

[0]: https://pve.proxmox.com/wiki/Package_Repositories