[SOLVED] Node reboot during (local) backup

baerm

New Member
Oct 24, 2016
6
0
1
itronic.at
Hi everyone,
during a nightly local backup (destination: zfs mirror with two slow 5TB-HDDs), we encoutered a reboot from the proxmox node doing the backup, but only once in a week period.

proxmox version:
Code:
proxmox-ve: 4.3-71 (running kernel: 4.4.21-1-pve)
pve-manager: 4.3-9 (running version: 4.3-9/f7c6f0cd)
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-92
pve-firmware: 1.1-10
libpve-common-perl: 4.0-79
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-68
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.3-12
pve-qemu-kvm: 2.7.0-4
pve-container: 1.0-80
pve-firewall: 2.0-31
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.5-1
lxcfs: 2.0.4-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
ceph: 0.94.9-1~bpo80+1

log file (last lines before reboot):
Code:
Dec  8 02:07:52  kernel: [144239.678991] INFO: task txg_sync:5337 blocked for more than 120 seconds.
Dec  8 02:07:52  kernel: [144239.679014]  Tainted: P  O  4.4.21-1-pve #1
Dec  8 02:07:52  kernel: [144239.679030] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  8 02:07:52  kernel: [144239.679052] txg_sync  D ffff882033b8fac8  0  5337  2 0x00000000
Dec  8 02:07:52  kernel: [144239.679056]  ffff882033b8fac8 ffffffff810abc6d ffff881038e8d280 ffff88203416c4c0
Dec  8 02:07:52  kernel: [144239.679058]  ffff882033b90000 ffff88103f9d7180 7fffffffffffffff ffff880e636d6ff8
Dec  8 02:07:52  kernel: [144239.679060]  0000000000000001 ffff882033b8fae0 ffffffff81850555 0000000000000000
Dec  8 02:07:52  kernel: [144239.679061] Call Trace:
Dec  8 02:07:52  kernel: [144239.679069]  [<ffffffff810abc6d>] ? ttwu_do_activate.constprop.89+0x5d/0x70
Dec  8 02:07:52  kernel: [144239.679073]  [<ffffffff81850555>] schedule+0x35/0x80
Dec  8 02:07:52  kernel: [144239.679075]  [<ffffffff81853785>] schedule_timeout+0x235/0x2d0
Dec  8 02:07:52  kernel: [144239.679076]  [<ffffffff810acb02>] ? default_wake_function+0x12/0x20
Dec  8 02:07:52  kernel: [144239.679080]  [<ffffffff810c3a82>] ? __wake_up_common+0x52/0x90
Dec  8 02:07:52  kernel: [144239.679082]  [<ffffffff810f5a0c>] ? ktime_get+0x3c/0xb0
Dec  8 02:07:52  kernel: [144239.679084]  [<ffffffff8184fa4b>] io_schedule_timeout+0xbb/0x140
Dec  8 02:07:52  kernel: [144239.679096]  [<ffffffffc005ccb3>] cv_wait_common+0xb3/0x130 [spl]
Dec  8 02:07:52  kernel: [144239.679098]  [<ffffffff810c4100>] ? wait_woken+0x90/0x90
Dec  8 02:07:52  kernel: [144239.679101]  [<ffffffffc005cd88>] __cv_wait_io+0x18/0x20 [spl]
Dec  8 02:07:52  kernel: [144239.679142]  [<ffffffffc03ef99f>] zio_wait+0x10f/0x1f0 [zfs]
Dec  8 02:07:52  kernel: [144239.679162]  [<ffffffffc037aff8>] dsl_pool_sync+0xb8/0x450 [zfs]
Dec  8 02:07:52  kernel: [144239.679185]  [<ffffffffc0393629>] spa_sync+0x369/0xb20 [zfs]
Dec  8 02:07:52  kernel: [144239.679186]  [<ffffffff810acb02>] ? default_wake_function+0x12/0x20
Dec  8 02:07:52  kernel: [144239.679211]  [<ffffffffc03a6974>] txg_sync_thread+0x3c4/0x610 [zfs]
Dec  8 02:07:52  kernel: [144239.679212]  [<ffffffff810ac6a9>] ? try_to_wake_up+0x49/0x400
Dec  8 02:07:52  kernel: [144239.679235]  [<ffffffffc03a65b0>] ? txg_sync_stop+0xe0/0xe0 [zfs]
Dec  8 02:07:52  kernel: [144239.679239]  [<ffffffffc0057e9a>] thread_generic_wrapper+0x7a/0x90 [spl]
Dec  8 02:07:52  kernel: [144239.679242]  [<ffffffffc0057e20>] ? __thread_exit+0x20/0x20 [spl]
Dec  8 02:07:52  kernel: [144239.679245]  [<ffffffff810a0fba>] kthread+0xea/0x100
Dec  8 02:07:52  kernel: [144239.679246]  [<ffffffff810a0ed0>] ? kthread_park+0x60/0x60
Dec  8 02:07:52  kernel: [144239.679248]  [<ffffffff81854a0f>] ret_from_fork+0x3f/0x70
Dec  8 02:07:52  kernel: [144239.679249]  [<ffffffff810a0ed0>] ? kthread_park+0x60/0x60
Dec  8 02:10:06  kernel: [144373.649848] fwbr5003i0: port 2(tap5003i0) entered disabled state
Dec  8 02:10:06  kernel: [144373.672039] fwbr5003i0: port 1(fwln5003i0) entered disabled state
Dec  8 02:10:06  kernel: [144373.672185] vmbr0: port 10(fwpr5003p0) entered disabled state
Dec  8 02:10:06  kernel: [144373.672353] device fwln5003i0 left promiscuous mode
Dec  8 02:10:06  kernel: [144373.672355] fwbr5003i0: port 1(fwln5003i0) entered disabled state
Dec  8 02:10:06  kernel: [144373.685646] device fwpr5003p0 left promiscuous mode
Dec  8 02:10:06  kernel: [144373.685649] vmbr0: port 10(fwpr5003p0) entered disabled state
Dec  8 02:10:08  qm[34072]: <root@pam> update VM 5004: -lock backup
Dec  8 02:10:09  kernel: [144376.428205] device tap5004i0 entered promiscuous mode
Dec  8 02:10:09  kernel: [144376.448460] device fwln5004i0 entered promiscuous mode
Dec  8 02:10:09  kernel: [144376.448499] fwbr5004i0: port 1(fwln5004i0) entered forwarding state
Dec  8 02:10:09  kernel: [144376.448506] fwbr5004i0: port 1(fwln5004i0) entered forwarding state
Dec  8 02:10:09  kernel: [144376.450811] device fwpr5004p0 entered promiscuous mode
Dec  8 02:10:09  kernel: [144376.450839] vmbr0: port 10(fwpr5004p0) entered forwarding state
Dec  8 02:10:09  kernel: [144376.450850] vmbr0: port 10(fwpr5004p0) entered forwarding state
Dec  8 02:10:09  kernel: [144376.452995] fwbr5004i0: port 2(tap5004i0) entered forwarding state
Dec  8 02:10:09  kernel: [144376.453003] fwbr5004i0: port 2(tap5004i0) entered forwarding state
Dec  8 02:11:02  kernel: [144429.104984] fwbr5004i0: port 2(tap5004i0) entered disabled state
Dec  8 02:11:02  kernel: [144429.123397] fwbr5004i0: port 1(fwln5004i0) entered disabled state
Dec  8 02:11:02  kernel: [144429.123543] vmbr0: port 10(fwpr5004p0) entered disabled state
Dec  8 02:11:02  kernel: [144429.123694] device fwln5004i0 left promiscuous mode
Dec  8 02:11:02  kernel: [144429.123696] fwbr5004i0: port 1(fwln5004i0) entered disabled state
Dec  8 02:11:02  kernel: [144429.141034] device fwpr5004p0 left promiscuous mode
Dec  8 02:11:02  kernel: [144429.141037] vmbr0: port 10(fwpr5004p0) entered disabled state
Dec  8 02:11:04 qm[34435]: <root@pam> update VM 5005: -lock backup

The txg_sync error seems to be zfs related, but i couldn't find any useful hints how that could be the source for a restart.

Any help or hints appreciated!

Best regards,

Matthias
 
I see lots of problems like yours on the forums.

You can try and:
1. Limit the amount of memory ZFS can use on module level and update initrd.
2. Reduce swapiness to 10 or even lower via sysctl.
3. Disable SWAP on ZFS ZVOLS.

I have moved the SWAP for host and guests to separate partition / disk with ext4 and kept it's usage to a minimal to get a stable ZFS install.
 
  • Like
Reactions: baerm
Thank you very much, I did search the forum, but did not realize the link between local backup and zfs swapping. But yes, we also do run proxmox on a zfs mirror, so swapping could be the issue.
Will try to change the swappiness-value (zfs arc is already limited) and see if this helps.

Best regards,
Matthias
 
But most importantly move SWAP away from ZFS. SWAP on ZFS does not work (without _potentially_ generating high load).
Even if other people say SWAP on ZFS should / does work, it's often the cause of high load in the real world. I would tell ProxMox guys to create a separate non ZFS partition in installer when ZFS is used, but they would not listen.
 
Also use zram-config from e.g. Ubuntu to have more swap that is not on disk.

Have you enabled deduplication on your pool (zpool list)?
 
We ended up disabling swap completely and moving away from a zfs pool for the backup storage to a mdadm raid. Since a few weeks now, no more problems during backup.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!