Host crashes doing backup

Mar 12, 2022
11
0
6
69
Hi,

I am running PVE8 with the latest 6.2.16 kernel. Up to this time being very stable. One of the three hosts crashes during backup. I am using the Proxmox Backup Server also at the latest version.
The logs just before the crash are

Sep 23 20:30:00 sv4 pvescheduler[2297021]: <root@pam> starting task UPID:sv4:00230CBE:00CFB210:650F8308:vzdump::root@pam:
Sep 23 20:30:00 sv4 pvescheduler[2297022]: INFO: starting new backup job: vzdump 10190 10020 1018 1004 10170 10210 10010 1008 10180 1010 1011 1020 1009 --storage pbs --compress zstd --mailnotification always --mode snapshot --mailto feisal@qualitylabworks.com --notes-template '{{guestname}}-{{vmid}}' --quiet 1 --prune-backups 'keep-last=28'
Sep 23 20:30:00 sv4 pvescheduler[2297022]: INFO: Starting Backup of VM 1008 (lxc)
Sep 23 20:30:00 sv4 kernel: EXT4-fs (dm-6): mounted filesystem ccd67d72-4662-4ccc-8fa1-7a4cfb4bc7fc with ordered data mode. Quota mode: none.
Sep 23 20:30:01 sv4 pmxcfs[2021]: [status] notice: received log
Sep 23 20:30:03 sv4 pmxcfs[2021]: [status] notice: received log


The kernel is Linux 6.2.16-12-pve #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-12 and
pre manager version is pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-12-pve)

This crash has been happening consistently over the past few days.
 
Hi,
you can try to SSH from another host to the crashing one, run journalctl -f and trigger/wait for the crash. It might be that the relevant log messages are not persisted to disk and if you are lucky you can see them live.

You could also try booting an older kernel to see whether the issue is present there or not.

Is there anything special about that node compared to the others?
 
Hi,

I switched to an earlier kernel 6.2.16-6-pve and will monitor today during backups. The node is a Dell R830 at the latest BIOS, the other two machines are Lenovo servers.
This is the first time I have had problems. At first I thought of maybe memory errors as I had a bad stick when I first got the server, I ran memtest86 for 12hrs and did not find any errors.
 
I used the 6.2.16-6-pve kernel and one LXC being backed
Hi,
you can try to SSH from another host to the crashing one, run journalctl -f and trigger/wait for the crash. It might be that the relevant log messages are not persisted to disk and if you are lucky you can see them live.

You could also try booting an older kernel to see whether the issue is present there or not.

Is there anything special about that node compared to the others?
I tried the older kernel and one LXC, backups worked and no crash. When I added a second LXC to the backup list, it crashed. I read somewhere there was a bug with backup and too many file handles but cannot find it anywhere. For now I am running VMs that don't require backups.
 
I used the 6.2.16-6-pve kernel and one LXC being backed

I tried the older kernel and one LXC, backups worked and no crash. When I added a second LXC to the backup list, it crashed. I read somewhere there was a bug with backup and too many file handles but cannot find it anywhere. For now I am running VMs that don't require backups.
Do you maybe mean: https://bugzilla.proxmox.com/show_bug.cgi?id=4507 ? That cannot crash the host, but only affect the guest. Did you get any output from the second node running journalctl -f via SSH?
 
Syslog output, as backup starts:
Code:
Nov 01 23:06:29 pve pvestatd[1808]: got timeout
Nov 01 23:06:29 pve pvestatd[1808]: unable to activate storage 'truenas.local' - directory '/mnt/pve/truenas.local' does not exist or is unreachable
Just a guess, but maybe the network storage is overloaded and causes a hang? Hanging NFS mounts can cause a lot of trouble on the host system. I'd also check the load and logs on the storage server during the time of the backup. You could try configuring a bandwidth limit for the backup.

Then the kernel panic (I think):
Code:
Nov 01 23:06:33 pve kernel: ------------[ cut here ]------------
Nov 01 23:06:33 pve kernel: NETDEV WATCHDOG: enp5s0 (igc): transmit queue 2 timed out
Nov 01 23:06:33 pve kernel: WARNING: CPU: 15 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x23a/0x250
Nov 01 23:06:33 pve kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 xt_conntrack nft_chain_nat nfs lockd grace fscache netfs xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat overlay cfg80211 veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables bonding tls softdog sunrpc nfnetlink_log nfnetlink binfmt_misc snd_hda_codec_hdmi snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof i915 snd_sof_utils intel_rapl_msr snd_soc_hdac_hda intel_rapl_common snd_hda_ext_core drm_buddy ttm x86_pkg_temp_thermal snd_soc_acpi_intel_match intel_powerclamp snd_soc_acpi drm_display_helper coretemp cec soundwire_bus rc_core snd_soc_core drm_kms_helper snd_compress kvm_intel ac97_bus i2c_algo_bit kvm snd_pcm_dmaengine irqbypass crct10dif_pclmul
Nov 01 23:06:33 pve kernel:  polyval_clmulni syscopyarea snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi polyval_generic sysfillrect ghash_clmulni_intel snd_usb_audio asus_nb_wmi mei_pxp mei_hdcp asus_wmi snd_usbmidi_lib snd_rawmidi snd_hda_codec sha512_ssse3 snd_hda_core aesni_intel ledtrig_audio snd_hwdep snd_seq_device sysimgblt mei_me mei mc cmdlinepart spi_nor mtd snd_pcm snd_timer pmt_telemetry pmt_class snd soundcore sparse_keymap crypto_simd cryptd rapl intel_cstate intel_vsec platform_profile mac_hid pcspkr wmi_bmof acpi_pad acpi_tad zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq simplefb dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c hid_generic usbhid hid xhci_pci xhci_pci_renesas nvme crc32_pclmul thunderbolt xhci_hcd igc spi_intel_pci i2c_i801 nvme_core intel_lpss_pci ahci spi_intel i2c_smbus intel_lpss libahci
Nov 01 23:06:33 pve kernel:  nvme_common idma64 video wmi pinctrl_alderlake
Nov 01 23:06:33 pve kernel: CPU: 15 PID: 0 Comm: swapper/15 Tainted: P           O       6.2.16-14-pve #1
Nov 01 23:06:33 pve kernel: Hardware name: ASUS System Product Name/ROG STRIX Z690-I GAMING WIFI, BIOS 0229 09/01/2021
Nov 01 23:06:33 pve kernel: RIP: 0010:dev_watchdog+0x23a/0x250
Nov 01 23:06:33 pve kernel: Code: 00 e9 2b ff ff ff 48 89 df c6 05 ac 5d 7d 01 01 e8 bb 08 f8 ff 44 89 f1 48 89 de 48 c7 c7 90 87 60 9a 48 89 c2 e8 56 91 30 ff <0f> 0b e9 1c ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00
Nov 01 23:06:33 pve kernel: RSP: 0018:ffffb6c4804fce38 EFLAGS: 00010246
Nov 01 23:06:33 pve kernel: RAX: 0000000000000000 RBX: ffff986ad5b42000 RCX: 0000000000000000
Nov 01 23:06:33 pve kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Nov 01 23:06:33 pve kernel: RBP: ffffb6c4804fce68 R08: 0000000000000000 R09: 0000000000000000
Nov 01 23:06:33 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff986ad5b424c8
Nov 01 23:06:33 pve kernel: R13: ffff986ad5b4241c R14: 0000000000000002 R15: 0000000000000000
Nov 01 23:06:33 pve kernel: FS:  0000000000000000(0000) GS:ffff98724d7c0000(0000) knlGS:0000000000000000
Nov 01 23:06:33 pve kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 01 23:06:33 pve kernel: CR2: 00002a313d54113c CR3: 0000000359c10000 CR4: 0000000000750ee0
Nov 01 23:06:33 pve kernel: PKRU: 55555554
Nov 01 23:06:33 pve kernel: Call Trace:
Nov 01 23:06:33 pve kernel:  <IRQ>
Nov 01 23:06:33 pve kernel:  ? show_regs+0x6d/0x80
Nov 01 23:06:33 pve kernel:  ? __warn+0x89/0x160
Nov 01 23:06:33 pve kernel:  ? dev_watchdog+0x23a/0x250
Nov 01 23:06:33 pve kernel:  ? report_bug+0x17e/0x1b0
Nov 01 23:06:33 pve kernel:  ? irq_work_queue+0x2f/0x70
Nov 01 23:06:33 pve kernel:  ? handle_bug+0x46/0x90
Nov 01 23:06:33 pve kernel:  ? exc_invalid_op+0x18/0x80
Nov 01 23:06:33 pve kernel:  ? asm_exc_invalid_op+0x1b/0x20
Nov 01 23:06:33 pve kernel:  ? dev_watchdog+0x23a/0x250
Nov 01 23:06:33 pve kernel:  ? dev_watchdog+0x23a/0x250
Nov 01 23:06:33 pve kernel:  ? __pfx_dev_watchdog+0x10/0x10
Nov 01 23:06:33 pve kernel:  call_timer_fn+0x29/0x160
Nov 01 23:06:33 pve kernel:  ? __pfx_dev_watchdog+0x10/0x10
Nov 01 23:06:33 pve kernel:  __run_timers+0x259/0x310
Nov 01 23:06:33 pve kernel:  run_timer_softirq+0x1d/0x40
Nov 01 23:06:33 pve kernel:  __do_softirq+0xd6/0x346
Nov 01 23:06:33 pve kernel:  ? hrtimer_interrupt+0x11f/0x250
Nov 01 23:06:33 pve kernel:  __irq_exit_rcu+0xa2/0xd0
Nov 01 23:06:33 pve kernel:  irq_exit_rcu+0xe/0x20
Nov 01 23:06:33 pve kernel:  sysvec_apic_timer_interrupt+0x92/0xd0
Nov 01 23:06:33 pve kernel:  </IRQ>
Nov 01 23:06:33 pve kernel:  <TASK>
Nov 01 23:06:33 pve kernel:  asm_sysvec_apic_timer_interrupt+0x1b/0x20
Nov 01 23:06:33 pve kernel: RIP: 0010:cpuidle_enter_state+0xde/0x6f0
Nov 01 23:06:33 pve kernel: Code: 12 77 66 e8 f4 64 4a ff 8b 53 04 49 89 c7 0f 1f 44 00 00 31 ff e8 22 6d 49 ff 80 7d d0 00 0f 85 eb 00 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 12 02 00 00 4d 63 ee 49 83 fd 09 0f 87 c7 04 00 00
Nov 01 23:06:33 pve kernel: RSP: 0018:ffffb6c4801ebe38 EFLAGS: 00000246
Nov 01 23:06:33 pve kernel: RAX: 0000000000000000 RBX: ffffd6c47fdc0a00 RCX: 0000000000000000
Nov 01 23:06:33 pve kernel: RDX: 000000000000000f RSI: 0000000000000000 RDI: 0000000000000000
Nov 01 23:06:33 pve kernel: RBP: ffffb6c4801ebe88 R08: 0000000000000000 R09: 0000000000000000
Nov 01 23:06:33 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9b0c3a40
Nov 01 23:06:33 pve kernel: R13: 0000000000000004 R14: 0000000000000004 R15: 0000e4846dd28a9c
Nov 01 23:06:33 pve kernel:  ? cpuidle_enter_state+0xce/0x6f0
Nov 01 23:06:33 pve kernel:  cpuidle_enter+0x2e/0x50
Nov 01 23:06:33 pve kernel:  do_idle+0x216/0x2a0
Nov 01 23:06:33 pve kernel:  cpu_startup_entry+0x1d/0x20
Nov 01 23:06:33 pve kernel:  start_secondary+0x122/0x160
Nov 01 23:06:33 pve kernel:  secondary_startup_64_no_verify+0xe5/0xeb
Nov 01 23:06:33 pve kernel:  </TASK>
Nov 01 23:06:33 pve kernel: ---[ end trace 0000000000000000 ]---
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: Register Dump
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: Register Name   Value
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: CTRL            081c0641
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: STATUS          40280683
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: CTRL_EXT        10000040
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: MDIC            180a3800
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: ICR             00000081
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RCTL            0440803a
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RDLEN[0-3]      00001000 00001000 00001000 00001000
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RDH[0-3]        0000004b 000000ce 000000b0 000000b5
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RDT[0-3]        0000004a 000000cd 000000af 000000b4
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RXDCTL[0-3]     02040808 02040808 02040808 02040808
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RDBAL[0-3]      0a475000 0aef7000 0ada4000 1e095000
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RDBAH[0-3]      00000001 00000001 00000001 00000001
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TCTL            a503f0fa
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TDBAL[0-3]      0aed9000 0e87c000 0e893000 0a474000
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TDBAH[0-3]      00000001 00000001 00000001 00000001
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TDLEN[0-3]      00001000 00001000 00001000 00001000
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TDH[0-3]        00000062 000000d2 00000092 00000064
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TDT[0-3]        00000062 000000d2 00000092 00000064
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TXDCTL[0-3]     02100108 02100108 02100108 02100108
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: Reset adapter
Nov 01 23:06:33 pve kernel: vmbr0: port 1(enp5s0) entered disabled state
Nov 01 23:06:35 pve corosync[1787]:   [KNET  ] link: host: 3 link: 0 is down
Nov 01 23:06:35 pve corosync[1787]:   [KNET  ] link: host: 1 link: 0 is down
Nov 01 23:06:35 pve corosync[1787]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 01 23:06:35 pve corosync[1787]:   [KNET  ] host: host: 3 has no active links
Nov 01 23:06:35 pve corosync[1787]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 01 23:06:35 pve corosync[1787]:   [KNET  ] host: host: 1 has no active links
Nov 01 23:06:36 pve corosync[1787]:   [TOTEM ] Token has not been received in 2737 ms
That seems to be related to the igc driver. I suggest upgrading to a more recent kernel (i.e. >= proxmox-kernel-6.2.16-16-pve) with fixes in that module: https://forum.proxmox.com/threads/kernel-bug-at-lib-dynamic_queue_limits-c-27.130935/post-595565 Could also be that this is the root cause of the issue.