Proxmox 4.1 nodes crash after/during backup task

amcmillen

New Member
Apr 15, 2016
4
0
1
36
I have a three node Proxmox cluster in my lab currently, the hardware specifications are completely identical, running Proxmox 4.1 with all of the latest updates installed (with reboot performed post-upgrade).

The hardware specs:
DL380 G7
LSI HBA
2x 300G 10K SAS in ZFS Raid 1 (setup as "local" storage)
6x 300G 10K SAS for Ceph OSD (setup for distributed VM storage)
96 GB RAM
Firmware is up-to-date on all 3 nodes.

The Ceph cluster is healthy, each node is serving OSDs and MON functionality. I have two VMs running on the nodes, live migration and HA appears to work successfully. The combined RAM allocation for the VMs is only 12 GB (less than 4% of the cluster capacity).

When I attempt to backup either of the VMs to the "local" storage device (Mode: Snapshot, Compression: LZO (fast)), the backup job either stalls at around 40% or completes successfully. However, the node performing the backup then reboots itself.

Has anyone experienced this behavior before? I thought at first it was bad hardware, but I was able to reproduce the issue 100% of the time on all 3 nodes.
 
Last edited:
Here is some output I managed to get from /var/log/messages on one of the nodes when the crash occurs:
Code:
Apr 19 12:36:49 prox01 kernel: [ 8277.691092] txg_sync        D ffff88181f956a00     0  1726      2 0x00000000
Apr 19 12:36:49 prox01 kernel: [ 8277.691096]  ffff8815b3b8fa18 0000000000000046 ffff880be8c48c80 ffff8817e2f20000
Apr 19 12:36:49 prox01 kernel: [ 8277.691099]  ffff8815b3b8fa48 ffff8815b3b90000 ffff88181f956a00 7fffffffffffffff
Apr 19 12:36:49 prox01 kernel: [ 8277.691101]  ffff8801a8eef988 0000000000000001 ffff8815b3b8fa38 ffffffff81806967
Apr 19 12:36:49 prox01 kernel: [ 8277.691103] Call Trace:
Apr 19 12:36:49 prox01 kernel: [ 8277.691111]  [<ffffffff81806967>] schedule+0x37/0x80
Apr 19 12:36:49 prox01 kernel: [ 8277.691115]  [<ffffffff81809bd1>] schedule_timeout+0x201/0x2a0
Apr 19 12:36:49 prox01 kernel: [ 8277.691121]  [<ffffffff810bd682>] ? __wake_up_common+0x52/0x90
Apr 19 12:36:49 prox01 kernel: [ 8277.691127]  [<ffffffff8101e299>] ? read_tsc+0x9/0x10
Apr 19 12:36:49 prox01 kernel: [ 8277.691131]  [<ffffffff81805f5b>] io_schedule_timeout+0xbb/0x140
Apr 19 12:36:49 prox01 kernel: [ 8277.691170]  [<ffffffffc02b1950>] ? zio_taskq_member.isra.6+0x80/0x80 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691177]  [<ffffffffc0093dea>] cv_wait_common+0xba/0x140 [spl]
Apr 19 12:36:49 prox01 kernel: [ 8277.691179]  [<ffffffff810bdd30>] ? wait_woken+0x90/0x90
Apr 19 12:36:49 prox01 kernel: [ 8277.691184]  [<ffffffffc0093ec8>] __cv_wait_io+0x18/0x20 [spl]
Apr 19 12:36:49 prox01 kernel: [ 8277.691213]  [<ffffffffc02b3bc0>] zio_wait+0x120/0x200 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691234]  [<ffffffffc023c6c8>] dsl_pool_sync+0xb8/0x450 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691257]  [<ffffffffc02577cc>] spa_sync+0x36c/0xb40 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691259]  [<ffffffff810bdd46>] ? autoremove_wake_function+0x16/0x40
Apr 19 12:36:49 prox01 kernel: [ 8277.691284]  [<ffffffffc02694ea>] txg_sync_thread+0x3ea/0x6a0 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691287]  [<ffffffff8101e8c9>] ? sched_clock+0x9/0x10
Apr 19 12:36:49 prox01 kernel: [ 8277.691312]  [<ffffffffc0269100>] ? txg_delay+0x180/0x180 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691316]  [<ffffffffc008eeea>] thread_generic_wrapper+0x7a/0x90 [spl]
Apr 19 12:36:49 prox01 kernel: [ 8277.691319]  [<ffffffffc008ee70>] ? __thread_exit+0x20/0x20 [spl]
Apr 19 12:36:49 prox01 kernel: [ 8277.691322]  [<ffffffff8109b1fa>] kthread+0xea/0x100
Apr 19 12:36:49 prox01 kernel: [ 8277.691325]  [<ffffffff8109b110>] ? kthread_create_on_node+0x1f0/0x1f0
Apr 19 12:36:49 prox01 kernel: [ 8277.691327]  [<ffffffff8180af1f>] ret_from_fork+0x3f/0x70
Apr 19 12:36:49 prox01 kernel: [ 8277.691329]  [<ffffffff8109b110>] ? kthread_create_on_node+0x1f0/0x1f0
Apr 19 12:36:49 prox01 kernel: [ 8277.691620] ceph-mon        D ffff880befcd6a00     0  3767   3599 0x00000000
Apr 19 12:36:49 prox01 kernel: [ 8277.691623]  ffff880b79acf958 0000000000000082 ffff880be8c89900 ffff8817e4417080
Apr 19 12:36:49 prox01 kernel: [ 8277.691625]  0000000000000202 ffff880b79ad0000 ffff880be883a220 ffff880be883a248
Apr 19 12:36:49 prox01 kernel: [ 8277.691626]  ffff880be883a370 0000000000000000 ffff880b79acf978 ffffffff81806967
Apr 19 12:36:49 prox01 kernel: [ 8277.691628] Call Trace:
Apr 19 12:36:49 prox01 kernel: [ 8277.691630]  [<ffffffff81806967>] schedule+0x37/0x80
Apr 19 12:36:49 prox01 kernel: [ 8277.691635]  [<ffffffffc0093e39>] cv_wait_common+0x109/0x140 [spl]
Apr 19 12:36:49 prox01 kernel: [ 8277.691637]  [<ffffffff810bdd30>] ? wait_woken+0x90/0x90
Apr 19 12:36:49 prox01 kernel: [ 8277.691642]  [<ffffffffc0093e85>] __cv_wait+0x15/0x20 [spl]
Apr 19 12:36:49 prox01 kernel: [ 8277.691667]  [<ffffffffc0268f1a>] txg_wait_open+0xba/0x100 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691687]  [<ffffffffc0239748>] ? dsl_dir_tempreserve_clear+0x138/0x150 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691704]  [<ffffffffc0223649>] dmu_tx_wait+0x389/0x3a0 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691721]  [<ffffffffc02236fb>] dmu_tx_assign+0x9b/0x4f0 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691724]  [<ffffffffc008cb23>] ? spl_kmem_zalloc+0xa3/0x180 [spl]
Apr 19 12:36:49 prox01 kernel: [ 8277.691753]  [<ffffffffc02a5660>] ? zfs_iput_async+0x70/0x70 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691768]  [<ffffffffc02145c3>] dmu_sync_late_arrival+0x53/0x140 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691779]  [<ffffffffc0206dd7>] ? arc_released+0x67/0x90 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691794]  [<ffffffffc0214a2b>] ? dmu_write_policy+0xcb/0x370 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691808]  [<ffffffffc0215037>] dmu_sync+0x367/0x440 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691821]  [<ffffffffc020d678>] ? dbuf_read+0x678/0x850 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691834]  [<ffffffffc020dde2>] ? dbuf_hold_impl+0x82/0xa0 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691851]  [<ffffffffc0227cc7>] ? dnode_rele_and_unlock+0x57/0x90 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691880]  [<ffffffffc02a5660>] ? zfs_iput_async+0x70/0x70 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691909]  [<ffffffffc02a596e>] zfs_get_data+0x28e/0x2e0 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691938]  [<ffffffffc02ab500>] ? zil_add_block+0x190/0x190 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691967]  [<ffffffffc02ad98b>] zil_commit.part.11+0x5fb/0x7e0 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.691996]  [<ffffffffc02adb87>] zil_commit+0x17/0x20 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.692025]  [<ffffffffc02a3b0e>] zfs_fsync+0x7e/0x100 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.692053]  [<ffffffffc02b9a04>] zpl_fsync+0x74/0xb0 [zfs]
Apr 19 12:36:49 prox01 kernel: [ 8277.692057]  [<ffffffff812309fd>] vfs_fsync_range+0x3d/0xb0
Apr 19 12:36:49 prox01 kernel: [ 8277.692058]  [<ffffffff81230acd>] do_fsync+0x3d/0x70
Apr 19 12:36:49 prox01 kernel: [ 8277.692060]  [<ffffffff81230d93>] SyS_fdatasync+0x13/0x20
Apr 19 12:36:49 prox01 kernel: [ 8277.692062]  [<ffffffff8180aaf2>] entry_SYSCALL_64_fastpath+0x16/0x75
 
Is there anything in the logs (messages, syslog and/or daemon.log) about fencing? Not sure why this occurs, but I guess the node is rebooted because the watchdog timer expired and thus the node is fenced.

If my guess is correct, the only question left is why the node is that busy when a backup task is running. Can you post your Ceph config? What version of Ceph do you use (ceph -v)?
 
Last edited:
The ceph version in use: 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)

In further troubleshooting I was able to confirm the issue has to do with performing a snapshot to a ZFS volume. If I attach an external drive formatted as ext4, I can perform the snapshot successfully to it without a crash. This result looks very similar to an issue reported by another Proxmox user to the ZFSonLinux team regarding writing a VM dump to ZFS. https://github.com/zfsonlinux/zfs/issues/4263

There is nothing in the log regarding fencing.
 
Can't help you with this problem, but will follow this thread closely, since we are planning to use ZFS for our backups too (and for our VM storage already use CEPH).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!