I have two ProxMox servers I am testing for clients that regularly crash while doing their respective nightly backup. The Boxes are nearly identical both with 32 gigs of ram and only one or two VMs per machine. There is a SSD drive purely handling the swap file I added as a troubleshooting step. Backups are happening to a USB3 external drive.
Letting the backup run to local storage will always crash the system. Backing up to the USB might give me a week of up time then eventually crash.
Often the VMs need to be restarted by hand after performing
qm stop 100
qm unlock 100
qm start 100
Thoughts, and if there's additional configs or output that would be helpful, please let me know.
Recorded from kern.log
system info:
pveversion
Letting the backup run to local storage will always crash the system. Backing up to the USB might give me a week of up time then eventually crash.
Often the VMs need to be restarted by hand after performing
qm stop 100
qm unlock 100
qm start 100
Thoughts, and if there's additional configs or output that would be helpful, please let me know.
Recorded from kern.log
Code:
Sep 12 21:51:26 ccikvm01 kernel: [777529.260996] INFO: task txg_sync:1399 blocked for more than 120 seconds.
Sep 12 21:51:26 ccikvm01 kernel: [777529.261007] Tainted: P O 4.4.15-1-pve #1
Sep 12 21:51:26 ccikvm01 kernel: [777529.261021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 12 21:51:26 ccikvm01 kernel: [777529.261033] txg_sync D ffff88083f3c3aa8 0 1399 2 0x00000000
Sep 12 21:51:26 ccikvm01 kernel: [777529.261050] ffff88083f3c3aa8 ffff88083f3c3a88 ffff88084b688ec0 ffff8808402d6740
Sep 12 21:51:26 ccikvm01 kernel: [777529.261051] ffff88083f3c4000 ffff88086fc97180 7fffffffffffffff ffff8807beeb4c28
Sep 12 21:51:26 ccikvm01 kernel: [777529.261052] 0000000000000001 ffff88083f3c3ac0 ffffffff8184d945 0000000000000000
Sep 12 21:51:26 ccikvm01 kernel: [777529.261054] Call Trace:
Sep 12 21:51:26 ccikvm01 kernel: [777529.261058] [<ffffffff8184d945>] schedule+0x35/0x80
Sep 12 21:51:26 ccikvm01 kernel: [777529.261059] [<ffffffff81850b85>] schedule_timeout+0x235/0x2d0
Sep 12 21:51:26 ccikvm01 kernel: [777529.261062] [<ffffffff810b4d61>] ? wakeup_preempt_entity.isra.58+0x41/0x50
Sep 12 21:51:26 ccikvm01 kernel: [777529.261064] [<ffffffff8102d736>] ? __switch_to+0x256/0x5c0
Sep 12 21:51:26 ccikvm01 kernel: [777529.261066] [<ffffffff8184ce3b>] io_schedule_timeout+0xbb/0x140
Sep 12 21:51:26 ccikvm01 kernel: [777529.261071] [<ffffffffc00c7d7c>] cv_wait_common+0xbc/0x140 [spl]
Sep 12 21:51:26 ccikvm01 kernel: [777529.261073] [<ffffffff810c4000>] ? wait_woken+0x90/0x90
Sep 12 21:51:26 ccikvm01 kernel: [777529.261076] [<ffffffffc00c7e58>] __cv_wait_io+0x18/0x20 [spl]
Sep 12 21:51:26 ccikvm01 kernel: [777529.261114] [<ffffffffc022cc50>] zio_wait+0x120/0x200 [zfs]
Sep 12 21:51:26 ccikvm01 kernel: [777529.261130] [<ffffffffc01b5eb8>] dsl_pool_sync+0xb8/0x440 [zfs]
Sep 12 21:51:26 ccikvm01 kernel: [777529.261147] [<ffffffffc01cecb9>] spa_sync+0x369/0xb30 [zfs]
Sep 12 21:51:26 ccikvm01 kernel: [777529.261149] [<ffffffff810ac9f2>] ? default_wake_function+0x12/0x20
Sep 12 21:51:26 ccikvm01 kernel: [777529.261168] [<ffffffffc01e2a74>] txg_sync_thread+0x3e4/0x6a0 [zfs]
Sep 12 21:51:26 ccikvm01 kernel: [777529.261169] [<ffffffff810ac599>] ? try_to_wake_up+0x49/0x400
Sep 12 21:51:26 ccikvm01 kernel: [777529.261187] [<ffffffffc01e2690>] ? txg_sync_stop+0xf0/0xf0 [zfs]
Sep 12 21:51:26 ccikvm01 kernel: [777529.261190] [<ffffffffc00c2e9a>] thread_generic_wrapper+0x7a/0x90 [spl]
Sep 12 21:51:26 ccikvm01 kernel: [777529.261192] [<ffffffffc00c2e20>] ? __thread_exit+0x20/0x20 [spl]
Sep 12 21:51:26 ccikvm01 kernel: [777529.261193] [<ffffffff810a0eba>] kthread+0xea/0x100
Sep 12 21:51:26 ccikvm01 kernel: [777529.261194] [<ffffffff810a0dd0>] ? kthread_park+0x60/0x60
Sep 12 21:51:26 ccikvm01 kernel: [777529.261196] [<ffffffff81851e0f>] ret_from_fork+0x3f/0x70
Sep 12 21:51:26 ccikvm01 kernel: [777529.261197] [<ffffffff810a0dd0>] ? kthread_park+0x60/0x60
system info:
Code:
root@ccikvm01:/var/log# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
rpool 2.72T 1.66T 1.05T - 49% 61% 1.00x ONLINE -
root@ccikvm01:/var/log# zpool status
pool: rpool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda2 ONLINE 0 0 0
sdb2 ONLINE 0 0 0
errors: No known data errors
root@ccikvm01:/var/log# free
total used free shared buffers cached
Mem: 32904992 30333112 2571880 55980 1792 105504
-/+ buffers/cache: 30225816 2679176
Swap: 33554428 0 33554428
root@ccikvm01:/var/log#
pveversion
Code:
proxmox-ve: 4.2-60 (running kernel: 4.4.15-1-pve)
pve-manager: 4.2-17 (running version: 4.2-17/e1400248)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.15-1-pve: 4.4.15-60
lvm2: 2.02.116-pve2
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-43
qemu-server: 4.0-85
pve-firmware: 1.1-8
libpve-common-perl: 4.0-72
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-56
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6-1
pve-container: 1.0-72
pve-firewall: 2.0-29
pve-ha-manager: 1.0-33
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.3-4
lxcfs: 2.0.2-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5.7-pve10~bpo80