Host crashes doing backup

Mar 12, 2022
8
0
1
67
Hi,

I am running PVE8 with the latest 6.2.16 kernel. Up to this time being very stable. One of the three hosts crashes during backup. I am using the Proxmox Backup Server also at the latest version.
The logs just before the crash are

Sep 23 20:30:00 sv4 pvescheduler[2297021]: <root@pam> starting task UPID:sv4:00230CBE:00CFB210:650F8308:vzdump::root@pam:
Sep 23 20:30:00 sv4 pvescheduler[2297022]: INFO: starting new backup job: vzdump 10190 10020 1018 1004 10170 10210 10010 1008 10180 1010 1011 1020 1009 --storage pbs --compress zstd --mailnotification always --mode snapshot --mailto feisal@qualitylabworks.com --notes-template '{{guestname}}-{{vmid}}' --quiet 1 --prune-backups 'keep-last=28'
Sep 23 20:30:00 sv4 pvescheduler[2297022]: INFO: Starting Backup of VM 1008 (lxc)
Sep 23 20:30:00 sv4 kernel: EXT4-fs (dm-6): mounted filesystem ccd67d72-4662-4ccc-8fa1-7a4cfb4bc7fc with ordered data mode. Quota mode: none.
Sep 23 20:30:01 sv4 pmxcfs[2021]: [status] notice: received log
Sep 23 20:30:03 sv4 pmxcfs[2021]: [status] notice: received log


The kernel is Linux 6.2.16-12-pve #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-12 and
pre manager version is pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-12-pve)

This crash has been happening consistently over the past few days.
 
Hi,
you can try to SSH from another host to the crashing one, run journalctl -f and trigger/wait for the crash. It might be that the relevant log messages are not persisted to disk and if you are lucky you can see them live.

You could also try booting an older kernel to see whether the issue is present there or not.

Is there anything special about that node compared to the others?
 
Hi,

I switched to an earlier kernel 6.2.16-6-pve and will monitor today during backups. The node is a Dell R830 at the latest BIOS, the other two machines are Lenovo servers.
This is the first time I have had problems. At first I thought of maybe memory errors as I had a bad stick when I first got the server, I ran memtest86 for 12hrs and did not find any errors.
 
I used the 6.2.16-6-pve kernel and one LXC being backed
Hi,
you can try to SSH from another host to the crashing one, run journalctl -f and trigger/wait for the crash. It might be that the relevant log messages are not persisted to disk and if you are lucky you can see them live.

You could also try booting an older kernel to see whether the issue is present there or not.

Is there anything special about that node compared to the others?
I tried the older kernel and one LXC, backups worked and no crash. When I added a second LXC to the backup list, it crashed. I read somewhere there was a bug with backup and too many file handles but cannot find it anywhere. For now I am running VMs that don't require backups.
 
I used the 6.2.16-6-pve kernel and one LXC being backed

I tried the older kernel and one LXC, backups worked and no crash. When I added a second LXC to the backup list, it crashed. I read somewhere there was a bug with backup and too many file handles but cannot find it anywhere. For now I am running VMs that don't require backups.
Do you maybe mean: https://bugzilla.proxmox.com/show_bug.cgi?id=4507 ? That cannot crash the host, but only affect the guest. Did you get any output from the second node running journalctl -f via SSH?
 
Same thing happening here.

Last backup log shows:
INFO: starting new backup job: vzdump 100 --notes-template '{{guestname}}' --compress zstd --quiet 1 --prune-backups 'keep-last=28' --mode snapshot --storage truenas.local --mailnotification always
INFO: Starting Backup of VM 100 (lxc)
INFO: Backup started at 2023-11-01 22:30:01
INFO: status = running
INFO: CT Name: xxxxxx-backend
INFO: including mount point rootfs ('/') in backup
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
INFO: creating vzdump archive '/mnt/pve/truenas.local/dump/vzdump-lxc-100-2023_11_01-22_30_00.tar.zst'

A hard reset is required as the hypervisor stops responding.

I don't see any logs being stored for the hypervisor during this time.

Any recommendations at all? We were planning a move to ECC RAM as I'm wondering if the issue is due to non ECC?

Weird thing is this happens sporadically, so once every 4-5 container backup will cause this to happen.

A remote reset brings it back up and the container usually completes the backup task.
 
Syslog output, just before backup:

Code:
Nov 01 22:57:43 pve kernel: pcieport 0000:00:1d.4: AER: Multiple Corrected error received: 0000:00:1d.4
Nov 01 22:57:43 pve kernel: pcieport 0000:00:1d.4: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Nov 01 22:57:43 pve kernel: pcieport 0000:00:1d.4:   device [8086:7ab4] error status/mask=00000001/00002000
Nov 01 22:57:43 pve kernel: pcieport 0000:00:1d.4:    [ 0] RxErr                  (First)
Nov 01 22:57:43 pve kernel: pcieport 0000:00:1d.4: AER: Multiple Corrected error received: 0000:00:1d.4
Nov 01 22:57:43 pve kernel: pcieport 0000:00:1d.4: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Nov 01 22:57:43 pve kernel: pcieport 0000:00:1d.4:   device [8086:7ab4] error status/mask=00000001/00002000
Nov 01 22:57:43 pve kernel: pcieport 0000:00:1d.4:    [ 0] RxErr                  (First)
Nov 01 22:58:02 pve pvestatd[1808]: status update time (5.927 seconds)
Nov 01 22:58:57 pve nfsidmap[1802584]: nss_getpwnam: name 'root@local' does not map into domain 'xxx.local'
Nov 01 22:58:57 pve nfsidmap[1802586]: nss_name_to_gid: name 'storage@local' does not map into domain 'xxx.local'
Nov 01 22:59:04 pve pvestatd[1808]: status update time (7.356 seconds)
 
Syslog output, as backup starts:
Code:
Nov 01 23:00:03 pve sshd[1803512]: Accepted publickey for root from 192.168.3.30 port 32808 ssh2: 
Nov 01 23:00:03 pve sshd[1803512]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Nov 01 23:00:03 pve systemd[1]: Created slice user-0.slice - User Slice of UID 0.
Nov 01 23:00:03 pve systemd[1]: Starting user-runtime-dir@0.service - User Runtime Directory /run/user/0...
Nov 01 23:00:03 pve systemd-logind[1336]: New session 3654 of user root.
Nov 01 23:00:04 pve systemd[1]: Finished user-runtime-dir@0.service - User Runtime Directory /run/user/0.
Nov 01 23:00:04 pve systemd[1]: Starting user@0.service - User Manager for UID 0...
Nov 01 23:00:04 pve (systemd)[1803515]: pam_unix(systemd-user:session): session opened for user root(uid=0) by (uid=0)
Nov 01 23:00:04 pve systemd[1803515]: Queued start job for default target default.target.
Nov 01 23:00:04 pve systemd[1803515]: Created slice app.slice - User Application Slice.
Nov 01 23:00:04 pve systemd[1803515]: Reached target paths.target - Paths.
Nov 01 23:00:04 pve systemd[1803515]: Reached target timers.target - Timers.
Nov 01 23:00:04 pve systemd[1803515]: Listening on dirmngr.socket - GnuPG network certificate management daemon.
Nov 01 23:00:04 pve systemd[1803515]: Listening on gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
Nov 01 23:00:04 pve systemd[1803515]: Listening on gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
Nov 01 23:00:04 pve systemd[1803515]: Listening on gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
Nov 01 23:00:04 pve systemd[1803515]: Listening on gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
Nov 01 23:00:04 pve systemd[1803515]: Reached target sockets.target - Sockets.
Nov 01 23:00:04 pve systemd[1803515]: Reached target basic.target - Basic System.
Nov 01 23:00:04 pve systemd[1803515]: Reached target default.target - Main User Target.
Nov 01 23:00:04 pve systemd[1803515]: Startup finished in 286ms.
Nov 01 23:00:04 pve systemd[1]: Started user@0.service - User Manager for UID 0.
Nov 01 23:00:04 pve systemd[1]: Started session-3654.scope - Session 3654 of User root.
Nov 01 23:00:04 pve sshd[1803512]: pam_env(sshd:session): deprecated reading of user environment enabled
Nov 01 23:00:04 pve sshd[1803512]: Received disconnect from 192.168.3.30 port 32808:11: disconnected by user
Nov 01 23:00:04 pve sshd[1803512]: Disconnected from user root 192.168.3.30 port 32808
Nov 01 23:00:04 pve sshd[1803512]: pam_unix(sshd:session): session closed for user root
Nov 01 23:00:04 pve systemd[1]: session-3654.scope: Deactivated successfully.
Nov 01 23:00:04 pve systemd-logind[1336]: Session 3654 logged out. Waiting for processes to exit.
Nov 01 23:00:04 pve systemd-logind[1336]: Removed session 3654.
Nov 01 23:00:04 pve sshd[1803538]: Accepted publickey for root from 192.168.3.30 port 32820 ssh2: 
Nov 01 23:00:04 pve sshd[1803538]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Nov 01 23:00:04 pve systemd-logind[1336]: New session 3656 of user root.
Nov 01 23:00:05 pve systemd[1]: Started session-3656.scope - Session 3656 of User root.
Nov 01 23:00:05 pve sshd[1803538]: pam_env(sshd:session): deprecated reading of user environment enabled
Nov 01 23:00:05 pve sshd[1803538]: Received disconnect from 192.168.3.30 port 32820:11: disconnected by user
Nov 01 23:00:05 pve sshd[1803538]: Disconnected from user root 192.168.3.30 port 32820
Nov 01 23:00:05 pve sshd[1803538]: pam_unix(sshd:session): session closed for user root
Nov 01 23:00:05 pve systemd[1]: session-3656.scope: Deactivated successfully.
Nov 01 23:00:05 pve systemd-logind[1336]: Session 3656 logged out. Waiting for processes to exit.
Nov 01 23:00:05 pve systemd-logind[1336]: Removed session 3656.
Nov 01 23:00:05 pve sshd[1803545]: Accepted publickey for root from 192.168.3.30 port 39608 ssh2: 
Nov 01 23:00:05 pve sshd[1803545]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Nov 01 23:00:05 pve systemd-logind[1336]: New session 3657 of user root.
Nov 01 23:00:05 pve systemd[1]: Started session-3657.scope - Session 3657 of User root.
Nov 01 23:00:05 pve sshd[1803545]: pam_env(sshd:session): deprecated reading of user environment enabled
Nov 01 23:00:05 pve sshd[1803545]: Received disconnect from 192.168.3.30 port 39608:11: disconnected by user
Nov 01 23:00:05 pve sshd[1803545]: Disconnected from user root 192.168.3.30 port 39608
Nov 01 23:00:05 pve sshd[1803545]: pam_unix(sshd:session): session closed for user root
Nov 01 23:00:05 pve systemd[1]: session-3657.scope: Deactivated successfully.
Nov 01 23:00:05 pve systemd-logind[1336]: Session 3657 logged out. Waiting for processes to exit.
Nov 01 23:00:05 pve systemd-logind[1336]: Removed session 3657.
Nov 01 23:00:05 pve sshd[1803663]: Accepted publickey for root from 192.168.3.30 port 39620 ssh2:
Nov 01 23:00:05 pve sshd[1803663]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Nov 01 23:00:05 pve systemd-logind[1336]: New session 3658 of user root.
Nov 01 23:00:06 pve systemd[1]: Started session-3658.scope - Session 3658 of User root.
Nov 01 23:00:06 pve sshd[1803663]: pam_env(sshd:session): deprecated reading of user environment enabled
Nov 01 23:00:06 pve sshd[1803663]: Received disconnect from 192.168.3.30 port 39620:11: disconnected by user
Nov 01 23:00:06 pve sshd[1803663]: Disconnected from user root 192.168.3.30 port 39620
Nov 01 23:00:06 pve sshd[1803663]: pam_unix(sshd:session): session closed for user root
Nov 01 23:00:06 pve systemd[1]: session-3658.scope: Deactivated successfully.
Nov 01 23:00:06 pve systemd-logind[1336]: Session 3658 logged out. Waiting for processes to exit.
Nov 01 23:00:06 pve systemd-logind[1336]: Removed session 3658.
Nov 01 23:00:12 pve kernel: pcieport 0000:00:06.0: AER: Corrected error received: 0000:01:00.0
Nov 01 23:00:12 pve kernel: nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Nov 01 23:00:12 pve kernel: nvme 0000:01:00.0:   device [144d:a80a] error status/mask=00000001/0000e000
Nov 01 23:00:12 pve kernel: nvme 0000:01:00.0:    [ 0] RxErr                  (First)
Nov 01 23:00:15 pve pvestatd[1808]: status update time (8.655 seconds)
Nov 01 23:00:16 pve systemd[1]: Stopping user@0.service - User Manager for UID 0...
Nov 01 23:00:16 pve systemd[1803515]: Activating special unit exit.target...
Nov 01 23:00:16 pve systemd[1803515]: Stopped target default.target - Main User Target.
Nov 01 23:00:16 pve systemd[1803515]: Stopped target basic.target - Basic System.
Nov 01 23:00:16 pve systemd[1803515]: Stopped target paths.target - Paths.
Nov 01 23:00:16 pve systemd[1803515]: Stopped target sockets.target - Sockets.
Nov 01 23:00:16 pve systemd[1803515]: Stopped target timers.target - Timers.
Nov 01 23:00:16 pve systemd[1803515]: Closed dirmngr.socket - GnuPG network certificate management daemon.
Nov 01 23:00:16 pve systemd[1803515]: Closed gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
Nov 01 23:00:16 pve systemd[1803515]: Closed gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
Nov 01 23:00:16 pve systemd[1803515]: Closed gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
Nov 01 23:00:16 pve systemd[1803515]: Closed gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
Nov 01 23:00:16 pve systemd[1803515]: Removed slice app.slice - User Application Slice.
Nov 01 23:00:16 pve systemd[1803515]: Reached target shutdown.target - Shutdown.
Nov 01 23:00:16 pve systemd[1803515]: Finished systemd-exit.service - Exit the Session.
Nov 01 23:00:16 pve systemd[1803515]: Reached target exit.target - Exit the Session.
Nov 01 23:00:16 pve systemd[1]: user@0.service: Deactivated successfully.
Nov 01 23:00:16 pve systemd[1]: Stopped user@0.service - User Manager for UID 0.
Nov 01 23:00:16 pve systemd[1]: Stopping user-runtime-dir@0.service - User Runtime Directory /run/user/0...
Nov 01 23:00:16 pve systemd[1]: run-user-0.mount: Deactivated successfully.
Nov 01 23:00:16 pve systemd[1]: user-runtime-dir@0.service: Deactivated successfully.
Nov 01 23:00:16 pve systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.
Nov 01 23:00:16 pve systemd[1]: Removed slice user-0.slice - User Slice of UID 0.
Nov 01 23:00:16 pve systemd[1]: user-0.slice: Consumed 1.067s CPU time.
Nov 01 23:01:26 pve pvestatd[1808]: status update time (9.780 seconds)
Nov 01 23:02:32 pve pvestatd[1808]: status update time (6.373 seconds)
Nov 01 23:06:29 pve pvestatd[1808]: got timeout
Nov 01 23:06:29 pve pvestatd[1808]: unable to activate storage 'truenas.local' - directory '/mnt/pve/truenas.local' does not exist or is unreachable
 
Then the kernel panic (I think):
Code:
Nov 01 23:06:33 pve kernel: ------------[ cut here ]------------
Nov 01 23:06:33 pve kernel: NETDEV WATCHDOG: enp5s0 (igc): transmit queue 2 timed out
Nov 01 23:06:33 pve kernel: WARNING: CPU: 15 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x23a/0x250
Nov 01 23:06:33 pve kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 xt_conntrack nft_chain_nat nfs lockd grace fscache netfs xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat overlay cfg80211 veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables bonding tls softdog sunrpc nfnetlink_log nfnetlink binfmt_misc snd_hda_codec_hdmi snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof i915 snd_sof_utils intel_rapl_msr snd_soc_hdac_hda intel_rapl_common snd_hda_ext_core drm_buddy ttm x86_pkg_temp_thermal snd_soc_acpi_intel_match intel_powerclamp snd_soc_acpi drm_display_helper coretemp cec soundwire_bus rc_core snd_soc_core drm_kms_helper snd_compress kvm_intel ac97_bus i2c_algo_bit kvm snd_pcm_dmaengine irqbypass crct10dif_pclmul
Nov 01 23:06:33 pve kernel:  polyval_clmulni syscopyarea snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi polyval_generic sysfillrect ghash_clmulni_intel snd_usb_audio asus_nb_wmi mei_pxp mei_hdcp asus_wmi snd_usbmidi_lib snd_rawmidi snd_hda_codec sha512_ssse3 snd_hda_core aesni_intel ledtrig_audio snd_hwdep snd_seq_device sysimgblt mei_me mei mc cmdlinepart spi_nor mtd snd_pcm snd_timer pmt_telemetry pmt_class snd soundcore sparse_keymap crypto_simd cryptd rapl intel_cstate intel_vsec platform_profile mac_hid pcspkr wmi_bmof acpi_pad acpi_tad zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq simplefb dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c hid_generic usbhid hid xhci_pci xhci_pci_renesas nvme crc32_pclmul thunderbolt xhci_hcd igc spi_intel_pci i2c_i801 nvme_core intel_lpss_pci ahci spi_intel i2c_smbus intel_lpss libahci
Nov 01 23:06:33 pve kernel:  nvme_common idma64 video wmi pinctrl_alderlake
Nov 01 23:06:33 pve kernel: CPU: 15 PID: 0 Comm: swapper/15 Tainted: P           O       6.2.16-14-pve #1
Nov 01 23:06:33 pve kernel: Hardware name: ASUS System Product Name/ROG STRIX Z690-I GAMING WIFI, BIOS 0229 09/01/2021
Nov 01 23:06:33 pve kernel: RIP: 0010:dev_watchdog+0x23a/0x250
Nov 01 23:06:33 pve kernel: Code: 00 e9 2b ff ff ff 48 89 df c6 05 ac 5d 7d 01 01 e8 bb 08 f8 ff 44 89 f1 48 89 de 48 c7 c7 90 87 60 9a 48 89 c2 e8 56 91 30 ff <0f> 0b e9 1c ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00
Nov 01 23:06:33 pve kernel: RSP: 0018:ffffb6c4804fce38 EFLAGS: 00010246
Nov 01 23:06:33 pve kernel: RAX: 0000000000000000 RBX: ffff986ad5b42000 RCX: 0000000000000000
Nov 01 23:06:33 pve kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Nov 01 23:06:33 pve kernel: RBP: ffffb6c4804fce68 R08: 0000000000000000 R09: 0000000000000000
Nov 01 23:06:33 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff986ad5b424c8
Nov 01 23:06:33 pve kernel: R13: ffff986ad5b4241c R14: 0000000000000002 R15: 0000000000000000
Nov 01 23:06:33 pve kernel: FS:  0000000000000000(0000) GS:ffff98724d7c0000(0000) knlGS:0000000000000000
Nov 01 23:06:33 pve kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 01 23:06:33 pve kernel: CR2: 00002a313d54113c CR3: 0000000359c10000 CR4: 0000000000750ee0
Nov 01 23:06:33 pve kernel: PKRU: 55555554
Nov 01 23:06:33 pve kernel: Call Trace:
Nov 01 23:06:33 pve kernel:  <IRQ>
Nov 01 23:06:33 pve kernel:  ? show_regs+0x6d/0x80
Nov 01 23:06:33 pve kernel:  ? __warn+0x89/0x160
Nov 01 23:06:33 pve kernel:  ? dev_watchdog+0x23a/0x250
Nov 01 23:06:33 pve kernel:  ? report_bug+0x17e/0x1b0
Nov 01 23:06:33 pve kernel:  ? irq_work_queue+0x2f/0x70
Nov 01 23:06:33 pve kernel:  ? handle_bug+0x46/0x90
Nov 01 23:06:33 pve kernel:  ? exc_invalid_op+0x18/0x80
Nov 01 23:06:33 pve kernel:  ? asm_exc_invalid_op+0x1b/0x20
Nov 01 23:06:33 pve kernel:  ? dev_watchdog+0x23a/0x250
Nov 01 23:06:33 pve kernel:  ? dev_watchdog+0x23a/0x250
Nov 01 23:06:33 pve kernel:  ? __pfx_dev_watchdog+0x10/0x10
Nov 01 23:06:33 pve kernel:  call_timer_fn+0x29/0x160
Nov 01 23:06:33 pve kernel:  ? __pfx_dev_watchdog+0x10/0x10
Nov 01 23:06:33 pve kernel:  __run_timers+0x259/0x310
Nov 01 23:06:33 pve kernel:  run_timer_softirq+0x1d/0x40
Nov 01 23:06:33 pve kernel:  __do_softirq+0xd6/0x346
Nov 01 23:06:33 pve kernel:  ? hrtimer_interrupt+0x11f/0x250
Nov 01 23:06:33 pve kernel:  __irq_exit_rcu+0xa2/0xd0
Nov 01 23:06:33 pve kernel:  irq_exit_rcu+0xe/0x20
Nov 01 23:06:33 pve kernel:  sysvec_apic_timer_interrupt+0x92/0xd0
Nov 01 23:06:33 pve kernel:  </IRQ>
Nov 01 23:06:33 pve kernel:  <TASK>
Nov 01 23:06:33 pve kernel:  asm_sysvec_apic_timer_interrupt+0x1b/0x20
Nov 01 23:06:33 pve kernel: RIP: 0010:cpuidle_enter_state+0xde/0x6f0
Nov 01 23:06:33 pve kernel: Code: 12 77 66 e8 f4 64 4a ff 8b 53 04 49 89 c7 0f 1f 44 00 00 31 ff e8 22 6d 49 ff 80 7d d0 00 0f 85 eb 00 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 12 02 00 00 4d 63 ee 49 83 fd 09 0f 87 c7 04 00 00
Nov 01 23:06:33 pve kernel: RSP: 0018:ffffb6c4801ebe38 EFLAGS: 00000246
Nov 01 23:06:33 pve kernel: RAX: 0000000000000000 RBX: ffffd6c47fdc0a00 RCX: 0000000000000000
Nov 01 23:06:33 pve kernel: RDX: 000000000000000f RSI: 0000000000000000 RDI: 0000000000000000
Nov 01 23:06:33 pve kernel: RBP: ffffb6c4801ebe88 R08: 0000000000000000 R09: 0000000000000000
Nov 01 23:06:33 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9b0c3a40
Nov 01 23:06:33 pve kernel: R13: 0000000000000004 R14: 0000000000000004 R15: 0000e4846dd28a9c
Nov 01 23:06:33 pve kernel:  ? cpuidle_enter_state+0xce/0x6f0
Nov 01 23:06:33 pve kernel:  cpuidle_enter+0x2e/0x50
Nov 01 23:06:33 pve kernel:  do_idle+0x216/0x2a0
Nov 01 23:06:33 pve kernel:  cpu_startup_entry+0x1d/0x20
Nov 01 23:06:33 pve kernel:  start_secondary+0x122/0x160
Nov 01 23:06:33 pve kernel:  secondary_startup_64_no_verify+0xe5/0xeb
Nov 01 23:06:33 pve kernel:  </TASK>
Nov 01 23:06:33 pve kernel: ---[ end trace 0000000000000000 ]---
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: Register Dump
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: Register Name   Value
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: CTRL            081c0641
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: STATUS          40280683
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: CTRL_EXT        10000040
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: MDIC            180a3800
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: ICR             00000081
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RCTL            0440803a
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RDLEN[0-3]      00001000 00001000 00001000 00001000
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RDH[0-3]        0000004b 000000ce 000000b0 000000b5
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RDT[0-3]        0000004a 000000cd 000000af 000000b4
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RXDCTL[0-3]     02040808 02040808 02040808 02040808
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RDBAL[0-3]      0a475000 0aef7000 0ada4000 1e095000
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RDBAH[0-3]      00000001 00000001 00000001 00000001
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TCTL            a503f0fa
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TDBAL[0-3]      0aed9000 0e87c000 0e893000 0a474000
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TDBAH[0-3]      00000001 00000001 00000001 00000001
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TDLEN[0-3]      00001000 00001000 00001000 00001000
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TDH[0-3]        00000062 000000d2 00000092 00000064
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TDT[0-3]        00000062 000000d2 00000092 00000064
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TXDCTL[0-3]     02100108 02100108 02100108 02100108
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: Reset adapter
Nov 01 23:06:33 pve kernel: vmbr0: port 1(enp5s0) entered disabled state
Nov 01 23:06:35 pve corosync[1787]:   [KNET  ] link: host: 3 link: 0 is down
Nov 01 23:06:35 pve corosync[1787]:   [KNET  ] link: host: 1 link: 0 is down
Nov 01 23:06:35 pve corosync[1787]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 01 23:06:35 pve corosync[1787]:   [KNET  ] host: host: 3 has no active links
Nov 01 23:06:35 pve corosync[1787]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 01 23:06:35 pve corosync[1787]:   [KNET  ] host: host: 1 has no active links
Nov 01 23:06:36 pve corosync[1787]:   [TOTEM ] Token has not been received in 2737 ms
 
I have obfuscated a few details but that should be relatively straightforward to follow.
 
Syslog output, as backup starts:
Code:
Nov 01 23:06:29 pve pvestatd[1808]: got timeout
Nov 01 23:06:29 pve pvestatd[1808]: unable to activate storage 'truenas.local' - directory '/mnt/pve/truenas.local' does not exist or is unreachable
Just a guess, but maybe the network storage is overloaded and causes a hang? Hanging NFS mounts can cause a lot of trouble on the host system. I'd also check the load and logs on the storage server during the time of the backup. You could try configuring a bandwidth limit for the backup.

Then the kernel panic (I think):
Code:
Nov 01 23:06:33 pve kernel: ------------[ cut here ]------------
Nov 01 23:06:33 pve kernel: NETDEV WATCHDOG: enp5s0 (igc): transmit queue 2 timed out
Nov 01 23:06:33 pve kernel: WARNING: CPU: 15 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x23a/0x250
Nov 01 23:06:33 pve kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 xt_conntrack nft_chain_nat nfs lockd grace fscache netfs xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat overlay cfg80211 veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables bonding tls softdog sunrpc nfnetlink_log nfnetlink binfmt_misc snd_hda_codec_hdmi snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof i915 snd_sof_utils intel_rapl_msr snd_soc_hdac_hda intel_rapl_common snd_hda_ext_core drm_buddy ttm x86_pkg_temp_thermal snd_soc_acpi_intel_match intel_powerclamp snd_soc_acpi drm_display_helper coretemp cec soundwire_bus rc_core snd_soc_core drm_kms_helper snd_compress kvm_intel ac97_bus i2c_algo_bit kvm snd_pcm_dmaengine irqbypass crct10dif_pclmul
Nov 01 23:06:33 pve kernel:  polyval_clmulni syscopyarea snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi polyval_generic sysfillrect ghash_clmulni_intel snd_usb_audio asus_nb_wmi mei_pxp mei_hdcp asus_wmi snd_usbmidi_lib snd_rawmidi snd_hda_codec sha512_ssse3 snd_hda_core aesni_intel ledtrig_audio snd_hwdep snd_seq_device sysimgblt mei_me mei mc cmdlinepart spi_nor mtd snd_pcm snd_timer pmt_telemetry pmt_class snd soundcore sparse_keymap crypto_simd cryptd rapl intel_cstate intel_vsec platform_profile mac_hid pcspkr wmi_bmof acpi_pad acpi_tad zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq simplefb dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c hid_generic usbhid hid xhci_pci xhci_pci_renesas nvme crc32_pclmul thunderbolt xhci_hcd igc spi_intel_pci i2c_i801 nvme_core intel_lpss_pci ahci spi_intel i2c_smbus intel_lpss libahci
Nov 01 23:06:33 pve kernel:  nvme_common idma64 video wmi pinctrl_alderlake
Nov 01 23:06:33 pve kernel: CPU: 15 PID: 0 Comm: swapper/15 Tainted: P           O       6.2.16-14-pve #1
Nov 01 23:06:33 pve kernel: Hardware name: ASUS System Product Name/ROG STRIX Z690-I GAMING WIFI, BIOS 0229 09/01/2021
Nov 01 23:06:33 pve kernel: RIP: 0010:dev_watchdog+0x23a/0x250
Nov 01 23:06:33 pve kernel: Code: 00 e9 2b ff ff ff 48 89 df c6 05 ac 5d 7d 01 01 e8 bb 08 f8 ff 44 89 f1 48 89 de 48 c7 c7 90 87 60 9a 48 89 c2 e8 56 91 30 ff <0f> 0b e9 1c ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00
Nov 01 23:06:33 pve kernel: RSP: 0018:ffffb6c4804fce38 EFLAGS: 00010246
Nov 01 23:06:33 pve kernel: RAX: 0000000000000000 RBX: ffff986ad5b42000 RCX: 0000000000000000
Nov 01 23:06:33 pve kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Nov 01 23:06:33 pve kernel: RBP: ffffb6c4804fce68 R08: 0000000000000000 R09: 0000000000000000
Nov 01 23:06:33 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff986ad5b424c8
Nov 01 23:06:33 pve kernel: R13: ffff986ad5b4241c R14: 0000000000000002 R15: 0000000000000000
Nov 01 23:06:33 pve kernel: FS:  0000000000000000(0000) GS:ffff98724d7c0000(0000) knlGS:0000000000000000
Nov 01 23:06:33 pve kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 01 23:06:33 pve kernel: CR2: 00002a313d54113c CR3: 0000000359c10000 CR4: 0000000000750ee0
Nov 01 23:06:33 pve kernel: PKRU: 55555554
Nov 01 23:06:33 pve kernel: Call Trace:
Nov 01 23:06:33 pve kernel:  <IRQ>
Nov 01 23:06:33 pve kernel:  ? show_regs+0x6d/0x80
Nov 01 23:06:33 pve kernel:  ? __warn+0x89/0x160
Nov 01 23:06:33 pve kernel:  ? dev_watchdog+0x23a/0x250
Nov 01 23:06:33 pve kernel:  ? report_bug+0x17e/0x1b0
Nov 01 23:06:33 pve kernel:  ? irq_work_queue+0x2f/0x70
Nov 01 23:06:33 pve kernel:  ? handle_bug+0x46/0x90
Nov 01 23:06:33 pve kernel:  ? exc_invalid_op+0x18/0x80
Nov 01 23:06:33 pve kernel:  ? asm_exc_invalid_op+0x1b/0x20
Nov 01 23:06:33 pve kernel:  ? dev_watchdog+0x23a/0x250
Nov 01 23:06:33 pve kernel:  ? dev_watchdog+0x23a/0x250
Nov 01 23:06:33 pve kernel:  ? __pfx_dev_watchdog+0x10/0x10
Nov 01 23:06:33 pve kernel:  call_timer_fn+0x29/0x160
Nov 01 23:06:33 pve kernel:  ? __pfx_dev_watchdog+0x10/0x10
Nov 01 23:06:33 pve kernel:  __run_timers+0x259/0x310
Nov 01 23:06:33 pve kernel:  run_timer_softirq+0x1d/0x40
Nov 01 23:06:33 pve kernel:  __do_softirq+0xd6/0x346
Nov 01 23:06:33 pve kernel:  ? hrtimer_interrupt+0x11f/0x250
Nov 01 23:06:33 pve kernel:  __irq_exit_rcu+0xa2/0xd0
Nov 01 23:06:33 pve kernel:  irq_exit_rcu+0xe/0x20
Nov 01 23:06:33 pve kernel:  sysvec_apic_timer_interrupt+0x92/0xd0
Nov 01 23:06:33 pve kernel:  </IRQ>
Nov 01 23:06:33 pve kernel:  <TASK>
Nov 01 23:06:33 pve kernel:  asm_sysvec_apic_timer_interrupt+0x1b/0x20
Nov 01 23:06:33 pve kernel: RIP: 0010:cpuidle_enter_state+0xde/0x6f0
Nov 01 23:06:33 pve kernel: Code: 12 77 66 e8 f4 64 4a ff 8b 53 04 49 89 c7 0f 1f 44 00 00 31 ff e8 22 6d 49 ff 80 7d d0 00 0f 85 eb 00 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 12 02 00 00 4d 63 ee 49 83 fd 09 0f 87 c7 04 00 00
Nov 01 23:06:33 pve kernel: RSP: 0018:ffffb6c4801ebe38 EFLAGS: 00000246
Nov 01 23:06:33 pve kernel: RAX: 0000000000000000 RBX: ffffd6c47fdc0a00 RCX: 0000000000000000
Nov 01 23:06:33 pve kernel: RDX: 000000000000000f RSI: 0000000000000000 RDI: 0000000000000000
Nov 01 23:06:33 pve kernel: RBP: ffffb6c4801ebe88 R08: 0000000000000000 R09: 0000000000000000
Nov 01 23:06:33 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9b0c3a40
Nov 01 23:06:33 pve kernel: R13: 0000000000000004 R14: 0000000000000004 R15: 0000e4846dd28a9c
Nov 01 23:06:33 pve kernel:  ? cpuidle_enter_state+0xce/0x6f0
Nov 01 23:06:33 pve kernel:  cpuidle_enter+0x2e/0x50
Nov 01 23:06:33 pve kernel:  do_idle+0x216/0x2a0
Nov 01 23:06:33 pve kernel:  cpu_startup_entry+0x1d/0x20
Nov 01 23:06:33 pve kernel:  start_secondary+0x122/0x160
Nov 01 23:06:33 pve kernel:  secondary_startup_64_no_verify+0xe5/0xeb
Nov 01 23:06:33 pve kernel:  </TASK>
Nov 01 23:06:33 pve kernel: ---[ end trace 0000000000000000 ]---
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: Register Dump
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: Register Name   Value
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: CTRL            081c0641
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: STATUS          40280683
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: CTRL_EXT        10000040
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: MDIC            180a3800
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: ICR             00000081
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RCTL            0440803a
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RDLEN[0-3]      00001000 00001000 00001000 00001000
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RDH[0-3]        0000004b 000000ce 000000b0 000000b5
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RDT[0-3]        0000004a 000000cd 000000af 000000b4
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RXDCTL[0-3]     02040808 02040808 02040808 02040808
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RDBAL[0-3]      0a475000 0aef7000 0ada4000 1e095000
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: RDBAH[0-3]      00000001 00000001 00000001 00000001
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TCTL            a503f0fa
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TDBAL[0-3]      0aed9000 0e87c000 0e893000 0a474000
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TDBAH[0-3]      00000001 00000001 00000001 00000001
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TDLEN[0-3]      00001000 00001000 00001000 00001000
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TDH[0-3]        00000062 000000d2 00000092 00000064
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TDT[0-3]        00000062 000000d2 00000092 00000064
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: TXDCTL[0-3]     02100108 02100108 02100108 02100108
Nov 01 23:06:33 pve kernel: igc 0000:05:00.0 enp5s0: Reset adapter
Nov 01 23:06:33 pve kernel: vmbr0: port 1(enp5s0) entered disabled state
Nov 01 23:06:35 pve corosync[1787]:   [KNET  ] link: host: 3 link: 0 is down
Nov 01 23:06:35 pve corosync[1787]:   [KNET  ] link: host: 1 link: 0 is down
Nov 01 23:06:35 pve corosync[1787]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 01 23:06:35 pve corosync[1787]:   [KNET  ] host: host: 3 has no active links
Nov 01 23:06:35 pve corosync[1787]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 01 23:06:35 pve corosync[1787]:   [KNET  ] host: host: 1 has no active links
Nov 01 23:06:36 pve corosync[1787]:   [TOTEM ] Token has not been received in 2737 ms
That seems to be related to the igc driver. I suggest upgrading to a more recent kernel (i.e. >= proxmox-kernel-6.2.16-16-pve) with fixes in that module: https://forum.proxmox.com/threads/kernel-bug-at-lib-dynamic_queue_limits-c-27.130935/post-595565 Could also be that this is the root cause of the issue.
 
I'll give an update a go. See if that fixes it.

NAS does have plenty of performance but that's something I'll investigate too.

Will report back!
 
Oh not using FUSE mounts at all. Just completed an update to the latest kernel. I'll see how we get on. Will report back (if we're good for a week+ I'm happy :D).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!