Hi everyone,
during a nightly local backup (destination: zfs mirror with two slow 5TB-HDDs), we encoutered a reboot from the proxmox node doing the backup, but only once in a week period.
proxmox version:
log file (last lines before reboot):
The txg_sync error seems to be zfs related, but i couldn't find any useful hints how that could be the source for a restart.
Any help or hints appreciated!
Best regards,
Matthias
during a nightly local backup (destination: zfs mirror with two slow 5TB-HDDs), we encoutered a reboot from the proxmox node doing the backup, but only once in a week period.
proxmox version:
Code:
proxmox-ve: 4.3-71 (running kernel: 4.4.21-1-pve)
pve-manager: 4.3-9 (running version: 4.3-9/f7c6f0cd)
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-92
pve-firmware: 1.1-10
libpve-common-perl: 4.0-79
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-68
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.3-12
pve-qemu-kvm: 2.7.0-4
pve-container: 1.0-80
pve-firewall: 2.0-31
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.5-1
lxcfs: 2.0.4-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
ceph: 0.94.9-1~bpo80+1
log file (last lines before reboot):
Code:
Dec 8 02:07:52 kernel: [144239.678991] INFO: task txg_sync:5337 blocked for more than 120 seconds.
Dec 8 02:07:52 kernel: [144239.679014] Tainted: P O 4.4.21-1-pve #1
Dec 8 02:07:52 kernel: [144239.679030] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 8 02:07:52 kernel: [144239.679052] txg_sync D ffff882033b8fac8 0 5337 2 0x00000000
Dec 8 02:07:52 kernel: [144239.679056] ffff882033b8fac8 ffffffff810abc6d ffff881038e8d280 ffff88203416c4c0
Dec 8 02:07:52 kernel: [144239.679058] ffff882033b90000 ffff88103f9d7180 7fffffffffffffff ffff880e636d6ff8
Dec 8 02:07:52 kernel: [144239.679060] 0000000000000001 ffff882033b8fae0 ffffffff81850555 0000000000000000
Dec 8 02:07:52 kernel: [144239.679061] Call Trace:
Dec 8 02:07:52 kernel: [144239.679069] [<ffffffff810abc6d>] ? ttwu_do_activate.constprop.89+0x5d/0x70
Dec 8 02:07:52 kernel: [144239.679073] [<ffffffff81850555>] schedule+0x35/0x80
Dec 8 02:07:52 kernel: [144239.679075] [<ffffffff81853785>] schedule_timeout+0x235/0x2d0
Dec 8 02:07:52 kernel: [144239.679076] [<ffffffff810acb02>] ? default_wake_function+0x12/0x20
Dec 8 02:07:52 kernel: [144239.679080] [<ffffffff810c3a82>] ? __wake_up_common+0x52/0x90
Dec 8 02:07:52 kernel: [144239.679082] [<ffffffff810f5a0c>] ? ktime_get+0x3c/0xb0
Dec 8 02:07:52 kernel: [144239.679084] [<ffffffff8184fa4b>] io_schedule_timeout+0xbb/0x140
Dec 8 02:07:52 kernel: [144239.679096] [<ffffffffc005ccb3>] cv_wait_common+0xb3/0x130 [spl]
Dec 8 02:07:52 kernel: [144239.679098] [<ffffffff810c4100>] ? wait_woken+0x90/0x90
Dec 8 02:07:52 kernel: [144239.679101] [<ffffffffc005cd88>] __cv_wait_io+0x18/0x20 [spl]
Dec 8 02:07:52 kernel: [144239.679142] [<ffffffffc03ef99f>] zio_wait+0x10f/0x1f0 [zfs]
Dec 8 02:07:52 kernel: [144239.679162] [<ffffffffc037aff8>] dsl_pool_sync+0xb8/0x450 [zfs]
Dec 8 02:07:52 kernel: [144239.679185] [<ffffffffc0393629>] spa_sync+0x369/0xb20 [zfs]
Dec 8 02:07:52 kernel: [144239.679186] [<ffffffff810acb02>] ? default_wake_function+0x12/0x20
Dec 8 02:07:52 kernel: [144239.679211] [<ffffffffc03a6974>] txg_sync_thread+0x3c4/0x610 [zfs]
Dec 8 02:07:52 kernel: [144239.679212] [<ffffffff810ac6a9>] ? try_to_wake_up+0x49/0x400
Dec 8 02:07:52 kernel: [144239.679235] [<ffffffffc03a65b0>] ? txg_sync_stop+0xe0/0xe0 [zfs]
Dec 8 02:07:52 kernel: [144239.679239] [<ffffffffc0057e9a>] thread_generic_wrapper+0x7a/0x90 [spl]
Dec 8 02:07:52 kernel: [144239.679242] [<ffffffffc0057e20>] ? __thread_exit+0x20/0x20 [spl]
Dec 8 02:07:52 kernel: [144239.679245] [<ffffffff810a0fba>] kthread+0xea/0x100
Dec 8 02:07:52 kernel: [144239.679246] [<ffffffff810a0ed0>] ? kthread_park+0x60/0x60
Dec 8 02:07:52 kernel: [144239.679248] [<ffffffff81854a0f>] ret_from_fork+0x3f/0x70
Dec 8 02:07:52 kernel: [144239.679249] [<ffffffff810a0ed0>] ? kthread_park+0x60/0x60
Dec 8 02:10:06 kernel: [144373.649848] fwbr5003i0: port 2(tap5003i0) entered disabled state
Dec 8 02:10:06 kernel: [144373.672039] fwbr5003i0: port 1(fwln5003i0) entered disabled state
Dec 8 02:10:06 kernel: [144373.672185] vmbr0: port 10(fwpr5003p0) entered disabled state
Dec 8 02:10:06 kernel: [144373.672353] device fwln5003i0 left promiscuous mode
Dec 8 02:10:06 kernel: [144373.672355] fwbr5003i0: port 1(fwln5003i0) entered disabled state
Dec 8 02:10:06 kernel: [144373.685646] device fwpr5003p0 left promiscuous mode
Dec 8 02:10:06 kernel: [144373.685649] vmbr0: port 10(fwpr5003p0) entered disabled state
Dec 8 02:10:08 qm[34072]: <root@pam> update VM 5004: -lock backup
Dec 8 02:10:09 kernel: [144376.428205] device tap5004i0 entered promiscuous mode
Dec 8 02:10:09 kernel: [144376.448460] device fwln5004i0 entered promiscuous mode
Dec 8 02:10:09 kernel: [144376.448499] fwbr5004i0: port 1(fwln5004i0) entered forwarding state
Dec 8 02:10:09 kernel: [144376.448506] fwbr5004i0: port 1(fwln5004i0) entered forwarding state
Dec 8 02:10:09 kernel: [144376.450811] device fwpr5004p0 entered promiscuous mode
Dec 8 02:10:09 kernel: [144376.450839] vmbr0: port 10(fwpr5004p0) entered forwarding state
Dec 8 02:10:09 kernel: [144376.450850] vmbr0: port 10(fwpr5004p0) entered forwarding state
Dec 8 02:10:09 kernel: [144376.452995] fwbr5004i0: port 2(tap5004i0) entered forwarding state
Dec 8 02:10:09 kernel: [144376.453003] fwbr5004i0: port 2(tap5004i0) entered forwarding state
Dec 8 02:11:02 kernel: [144429.104984] fwbr5004i0: port 2(tap5004i0) entered disabled state
Dec 8 02:11:02 kernel: [144429.123397] fwbr5004i0: port 1(fwln5004i0) entered disabled state
Dec 8 02:11:02 kernel: [144429.123543] vmbr0: port 10(fwpr5004p0) entered disabled state
Dec 8 02:11:02 kernel: [144429.123694] device fwln5004i0 left promiscuous mode
Dec 8 02:11:02 kernel: [144429.123696] fwbr5004i0: port 1(fwln5004i0) entered disabled state
Dec 8 02:11:02 kernel: [144429.141034] device fwpr5004p0 left promiscuous mode
Dec 8 02:11:02 kernel: [144429.141037] vmbr0: port 10(fwpr5004p0) entered disabled state
Dec 8 02:11:04 qm[34435]: <root@pam> update VM 5005: -lock backup
The txg_sync error seems to be zfs related, but i couldn't find any useful hints how that could be the source for a restart.
Any help or hints appreciated!
Best regards,
Matthias