'kernel: Call Trace:' on my PVE NAS when running a backup job for my other node

May 9, 2023
5
0
1
  • Both nodes are in a cluster, which seems to work fine.
  • I have configured a backup from my main node to my NAS node in 'Datacentre'.
  • Both nodes are running 'Linux 6.2.16-6-pve #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-7 (2023-08-01T11:23Z)'
Recently (this means the backup has worked without fault) have been throwing up the following:

Using TurnKey fileserver on my NAS node, the SMB shares (I have three) randomly go offline, causing multiple zed delay entries in the syslog for my ZFS pool involving all disks within the pool I use for my backup job. Both ZFS pools on my NAS node have no faults and look to be ok, I scrubbed both multiple times, but the same problem exists.

The backup job for all my VMs & containers completes successfully, but there is something wrong somewhere.

Code:
Aug 18 11:46:58 nas kernel:  <TASK>
Aug 18 11:46:58 nas kernel:  __schedule+0x402/0x1510
Aug 18 11:46:58 nas kernel:  ? try_to_unlazy+0x60/0xe0
Aug 18 11:46:58 nas kernel:  ? terminate_walk+0x65/0x100
Aug 18 11:46:58 nas kernel:  ? path_parentat+0x49/0x90
Aug 18 11:46:58 nas kernel:  schedule+0x63/0x110
Aug 18 11:46:58 nas kernel:  rwsem_down_write_slowpath+0x373/0x710
Aug 18 11:46:58 nas kernel:  down_write+0x65/0x90
Aug 18 11:46:58 nas kernel:  do_unlinkat+0x1df/0x310
Aug 18 11:46:58 nas kernel:  __x64_sys_unlink+0x42/0x70
Aug 18 11:46:58 nas kernel:  do_syscall_64+0x5b/0x90
Aug 18 11:46:58 nas kernel:  ? syscall_exit_to_user_mode+0x29/0x50
Aug 18 11:46:58 nas kernel:  ? do_syscall_64+0x67/0x90
Aug 18 11:46:58 nas kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc
Aug 18 11:46:58 nas kernel: RIP: 0033:0x7f81c7c2a0c7
Aug 18 11:46:58 nas kernel: RSP: 002b:00007ffe4c95c9e8 EFLAGS: 00000297 ORIG_RAX: 0000000000000057
Aug 18 11:46:58 nas kernel: RAX: ffffffffffffffda RBX: 0000556b88f1e280 RCX: 00007f81c7c2a0c7
Aug 18 11:46:58 nas kernel: RDX: 0000000000000000 RSI: 00007f81c733baa5 RDI: 00007ffe4c95c9f0
Aug 18 11:46:58 nas kernel: RBP: 00007ffe4c95c9f0 R08: 0000000000000000 R09: 00007ffe4c95c870
Aug 18 11:46:58 nas kernel: R10: 00000000000027a2 R11: 0000000000000297 R12: 00007f81c733b870
Aug 18 11:46:58 nas kernel: R13: 0000556b88f17eb0 R14: 00007f81c7d63070 R15: 0000556b88f1e280
Aug 18 11:46:58 nas kernel:  </TASK>
Aug 18 11:46:58 nas kernel: INFO: task smbd:172788 blocked for more than 120 seconds.
Aug 18 11:46:58 nas kernel:       Tainted: P           O       6.2.16-6-pve #1
Aug 18 11:46:58 nas kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 18 11:46:58 nas kernel: task:smbd            state:D stack:0     pid:172788 ppid:7398   flags:0x00000000
Aug 18 11:46:58 nas kernel: Call Trace:
Aug 18 11:46:58 nas kernel:  <TASK>
Aug 18 11:46:58 nas kernel:  __schedule+0x402/0x1510
Aug 18 11:46:58 nas kernel:  ? try_to_unlazy+0x60/0xe0
Aug 18 11:46:58 nas kernel:  ? terminate_walk+0x65/0x100
Aug 18 11:46:58 nas kernel:  ? path_parentat+0x49/0x90
Aug 18 11:46:58 nas kernel:  schedule+0x63/0x110
Aug 18 11:46:58 nas kernel:  rwsem_down_write_slowpath+0x373/0x710
Aug 18 11:46:58 nas kernel:  down_write+0x65/0x90
Aug 18 11:46:58 nas kernel:  do_unlinkat+0x1df/0x310
Aug 18 11:46:58 nas kernel:  __x64_sys_unlink+0x42/0x70
Aug 18 11:46:58 nas kernel:  do_syscall_64+0x5b/0x90
Aug 18 11:46:58 nas kernel:  ? exit_to_user_mode_prepare+0x39/0x190
Aug 18 11:46:58 nas kernel:  ? syscall_exit_to_user_mode+0x29/0x50
Aug 18 11:46:58 nas kernel:  ? do_syscall_64+0x67/0x90
Aug 18 11:46:58 nas kernel:  ? exit_to_user_mode_prepare+0x39/0x190
Aug 18 11:46:58 nas kernel:  ? irqentry_exit_to_user_mode+0x9/0x20
Aug 18 11:46:58 nas kernel:  ? irqentry_exit+0x43/0x50
Aug 18 11:46:58 nas kernel:  ? exc_page_fault+0x91/0x1b0
Aug 18 11:46:58 nas kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc
Aug 18 11:46:58 nas kernel: RIP: 0033:0x7f81c7c2a0c7
Aug 18 11:46:58 nas kernel: RSP: 002b:00007ffe4c95c9e8 EFLAGS: 00000297 ORIG_RAX: 0000000000000057
Aug 18 11:46:58 nas kernel: RAX: ffffffffffffffda RBX: 0000556b88f1e280 RCX: 00007f81c7c2a0c7
Aug 18 11:46:58 nas kernel: RDX: 0000000000000000 RSI: 00007f81c733baa5 RDI: 00007ffe4c95c9f0
Aug 18 11:46:58 nas kernel: RBP: 00007ffe4c95c9f0 R08: 0000000000000000 R09: 00007ffe4c95c870
Aug 18 11:46:58 nas kernel: R10: 00000000000027a3 R11: 0000000000000297 R12: 00007f81c733b870
Aug 18 11:46:58 nas kernel: R13: 0000556b88f17eb0 R14: 00007f81c7d63070 R15: 0000556b88f1e280
Aug 18 11:46:58 nas kernel:  </TASK>
 
Update: I get a lot of the following when the backup job is running:

Code:
Aug 18 12:35:22 nas zed[219055]: eid=2597 class=deadman pool=[redacted] vdev=[redacted] size=40960 offset=149106790400 priority=3 err=0 flags=0x180880 delay=10638111ms bookmark=1157:3955:0:24002
Aug 18 12:35:22 nas zed[219058]: eid=2598 class=deadman pool=[redacted] vdev=[redacted] size=45056 offset=149106745344 priority=3 err=0 flags=0x180880 delay=10638074ms bookmark=1157:3955:0:24001
Aug 18 12:35:22 nas zed[219061]: eid=2599 class=deadman pool=[redacted] vdev=[redacted] size=40960 offset=149106704384 priority=3 err=0 flags=0x180880 delay=10638178ms bookmark=1157:3955:0:24000
Aug 18 12:35:22 nas zed[219064]: eid=2600 class=deadman pool=[redacted] vdev=[redacted] size=45056 offset=149106700288 priority=3 err=0 flags=0x180880 delay=10638178ms bookmark=1157:3955:0:24000
Aug 18 12:35:22 nas zed[219068]: eid=2601 class=deadman pool=[redacted] vdev=[redacted] size=45056 offset=149106700288 priority=3 err=0 flags=0x180880 delay=335ms bookmark=1157:3955:0:24000
Aug 18 12:35:22 nas zed[219069]: eid=2602 class=deadman pool=[redacted] vdev=[redacted] size=45056 offset=149106659328 priority=3 err=0 flags=0x180880 delay=10638166ms bookmark=1157:3955:0:23999
Aug 18 12:35:22 nas zed[219071]: eid=2603 class=deadman pool=[redacted] vdev=[redacted] size=45056 offset=149106655232 priority=3 err=0 flags=0x180880 delay=10638087ms bookmark=1157:3955:0:23999
Aug 18 12:35:22 nas zed[219073]: eid=2604 class=deadman pool=[redacted] vdev=[redacted] size=40960 offset=149106393088 priority=3 err=0 flags=0x180880 delay=337ms bookmark=1157:3955:0:23993
Aug 18 12:35:29 nas pvestatd[3981]: status update time (16.504 seconds)
Aug 18 12:35:39 nas pvestatd[3981]: storage '[redacted]' is not online
Aug 18 12:35:47 nas pvestatd[3981]: status update time (18.537 seconds)
Aug 18 12:36:01 nas pvestatd[3981]: status update time (13.493 seconds)
Aug 18 12:36:11 nas pvestatd[3981]: status update time (10.748 seconds)
Aug 18 12:36:21 nas pvestatd[3981]: status update time (9.204 seconds)
Aug 18 12:36:23 nas zed[2694]: Missed 19 events
Aug 18 12:36:23 nas zed[2694]: Bumping queue length to 2147483647
Aug 18 12:36:23 nas zed[2694]: Missed 66 events
Aug 18 12:36:23 nas zed[2694]: Bumping queue length to 2147483647
Aug 18 12:36:23 nas zed[2694]: Missed 5 events
Aug 18 12:36:23 nas zed[2694]: Bumping queue length to 2147483647
Aug 18 12:36:23 nas zed[219836]: eid=2605 class=deadman pool=[redacted] vdev=[redacted] size=45056 offset=149281435648 priority=3 err=0 flags=0x380880 bookmark=1157:3955:0:27950
Aug 18 12:36:23 nas zed[219840]: eid=2606 class=deadman pool=[redacted] vdev=[redacted] size=176128 offset=149281349632 priority=3 err=0 flags=0x40080c80 delay=10699427ms
Aug 18 12:36:23 nas zed[219843]: eid=2607 class=deadman pool=[redacted] vdev=[redacted] size=45056 offset=149280022528 priority=3 err=0 flags=0x380880 bookmark=1157:3955:0:27918
Aug 18 12:36:23 nas zed[219846]: eid=2608 class=deadman pool=[redacted] vdev=[redacted] size=442368 offset=149279625216 priority=3 err=0 flags=0x40080c80 delay=863ms
Aug 18 12:36:23 nas zed[219852]: eid=2610 class=deadman pool=[redacted] vdev=[redacted] size=176128 offset=149281349632 priority=3 err=0 flags=0x40080c80 delay=10699427ms
Aug 18 12:36:23 nas zed[219855]: eid=2611 class=deadman pool=[redacted] vdev=[redacted] size=45056 offset=149279977472 priority=3 err=0 flags=0x380880 bookmark=1157:3955:0:27917
Aug 18 12:36:23 nas zed[219850]: eid=2609 class=deadman pool=[redacted] vdev=[redacted] size=40960 offset=149281394688 priority=3 err=0 flags=0x380880 bookmark=1157:3955:0:27949
Aug 18 12:36:23 nas zed[219857]: eid=2612 class=deadman pool=[redacted] vdev=[redacted] size=442368 offset=149279625216 priority=3 err=0 flags=0x40080c80 delay=863ms
Aug 18 12:36:23 nas zed[219862]: eid=2614 class=deadman pool=[redacted] vdev=[redacted] size=176128 offset=149281349632 priority=3 err=0 flags=0x40080c80 delay=10699427ms
Aug 18 12:36:23 nas zed[219863]: eid=2613 class=deadman pool=[redacted] vdev=[redacted] size=45056 offset=149281349632 priority=3 err=0 flags=0x380880 bookmark=1157:3955:0:27948
 
Hello,

Thank you for the outputs!

I would check if there is high I/O during the backup in your monitor or on Proxmox VE WebUI ->Datacenter->{NodeName}->Summary->CPU Usage. If yes - I would set a bandwidth limits. You can set bandwidth limit in `/etc/vzdump.conf` file.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!