ZFS rollback stuck in 'D' state

ca_maer

Well-Known Member
Dec 5, 2017
181
14
58
44
Hello,

I've recently have an issue where rollbacking a snapshot never completed and is stuck in 'D' state
Code:
root     26279  0.0  0.0  35188  3024 ?        D    09:50   0:00 zfs rollback rpool/data/subvol-104-disk-1@initialsetup

I'm trying to figure out what might have cause this as I've done this multiple times on this node without issues. It seems the issue is only happening with this CT. There is no errors in dmesg nor journalctl. Is there a place I can check ?


Here are some informations:
Code:
proxmox-ve: 5.2-2 (running kernel: 4.15.17-1-pve)
pve-manager: 5.2-2 (running version: 5.2-2/b1d1c7f4)
pve-kernel-4.15: 5.2-3
pve-kernel-4.13: 5.1-45
pve-kernel-4.15.17-3-pve: 4.15.17-13
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.13.16-3-pve: 4.13.16-49
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.13.16-1-pve: 4.13.16-46
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-4-pve: 4.13.13-35
pve-kernel-4.13.13-3-pve: 4.13.13-34
pve-kernel-4.13.13-2-pve: 4.13.13-33
pve-kernel-4.13.13-1-pve: 4.13.13-31
pve-kernel-4.13.8-3-pve: 4.13.8-30
pve-kernel-4.13.8-2-pve: 4.13.8-28
pve-kernel-4.13.4-1-pve: 4.13.4-26
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-33
libpve-guest-common-perl: 2.0-16
libpve-http-server-perl: 2.0-9
libpve-storage-perl: 5.0-23
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-1
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-27
pve-container: 2.0-23
pve-docs: 5.2-4
pve-firewall: 3.0-12
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-5
qemu-server: 5.0-28
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

Code:
Name             Type     Status           Total            Used       Available        %
nas-1            dir     active      7779824880      3327663864      4452161016   42.77%
local             dir     active      1318379648         5557632      1312822016    0.42%
local-zfs     zfspool     active      1679255580       366433528      1312822052   21.82%

Code:
[2018-08-20 08:45:46]  audit: type=1400 audit(1534768746.657:4687): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-cgns" name="/" pid=20144 comm="(ionclean)" flags="rw, rslave"
[2018-08-20 09:15:45]  audit: type=1400 audit(1534770546.099:4688): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-cgns" name="/" pid=1748 comm="(ionclean)" flags="rw, rslave"
[2018-08-20 09:15:46]  audit: type=1400 audit(1534770546.855:4689): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-cgns" name="/" pid=1849 comm="(ionclean)" flags="rw, rslave"
[2018-08-20 09:45:46]  audit: type=1400 audit(1534772346.284:4690): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-cgns" name="/" pid=18402 comm="(ionclean)" flags="rw, rslave"
[2018-08-20 09:45:46]  audit: type=1400 audit(1534772347.044:4691): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-cgns" name="/" pid=18707 comm="(ionclean)" flags="rw, rslave"
[2018-08-20 09:50:54]  audit: type=1400 audit(1534772654.302:4692): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-cgns" name="/" pid=4997 comm="(pachectl)" flags="rw, rslave"
[2018-08-20 09:50:54]  audit: type=1400 audit(1534772654.806:4693): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-cgns" name="/" pid=5004 comm="(pachectl)" flags="rw, rslave"
[2018-08-20 09:53:23]  vmbr1: port 7(veth104i0) entered disabled state
[2018-08-20 09:53:23]  device veth104i0 left promiscuous mode

Code:
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0h57m with 0 errors on Sun Aug 12 01:21:05 2018
config:

    NAME                              STATE     READ WRITE CKSUM
    rpool                             ONLINE       0     0     0
      mirror-0                        ONLINE       0     0     0
        wwn-0x5000cca05936d10c-part2  ONLINE       0     0     0
        wwn-0x5000cca05934a518-part2  ONLINE       0     0     0
      mirror-1                        ONLINE       0     0     0
        wwn-0x5000cca059359884-part2  ONLINE       0     0     0
        wwn-0x5000cca0593405a8-part2  ONLINE       0     0     0
      mirror-2                        ONLINE       0     0     0
        wwn-0x5000cca059370f48-part2  ONLINE       0     0     0
        wwn-0x5000cca0593530d4-part2  ONLINE       0     0     0
    logs
      mirror-3                        ONLINE       0     0     0
        wwn-0x5002538050002de5-part2  ONLINE       0     0     0
        wwn-0x500253805001ed92-part2  ONLINE       0     0     0
    cache
      wwn-0x5002538050002de5-part3    ONLINE       0     0     0
      wwn-0x500253805001ed92-part3    ONLINE       0     0     0

errors: No known data errors

Code:
rpool/data/subvol-104-disk-1  1.24G  23.8G  1.18G  /rpool/data/subvol-104-disk-1

Code:
Aug 20 09:40:11 dev-proxmox-2 zed: eid=1654565 class=history_event pool_guid=0x82EE2CB119FC19AD
Aug 20 09:40:11 dev-proxmox-2 zed: eid=1654566 class=history_event pool_guid=0x82EE2CB119FC19AD
Aug 20 09:40:11 dev-proxmox-2 zed: eid=1654567 class=history_event pool_guid=0x82EE2CB119FC19AD
Aug 20 09:40:11 dev-proxmox-2 zed: eid=1654568 class=history_event pool_guid=0x82EE2CB119FC19AD
Aug 20 09:40:11 dev-proxmox-2 zed: eid=1654569 class=history_event pool_guid=0x82EE2CB119FC19AD
Aug 20 09:40:12 dev-proxmox-2 systemd[1]: Started Session 1502302 of user root.
Aug 20 09:40:13 dev-proxmox-2 zed: eid=1654570 class=history_event pool_guid=0x82EE2CB119FC19AD
Aug 20 09:40:13 dev-proxmox-2 ntpd[5566]: Soliciting pool server 144.217.181.221
Aug 20 09:40:17 dev-proxmox-2 systemd[1]: Started Session 1502303 of user root.
Aug 20 09:40:18 dev-proxmox-2 systemd[1]: Started Session 1502304 of user root.
Aug 20 09:40:20 dev-proxmox-2 zed: eid=1654571 class=history_event pool_guid=0x82EE2CB119FC19AD
Aug 20 09:40:20 dev-proxmox-2 zed: eid=1654572 class=history_event pool_guid=0x82EE2CB119FC19AD
Aug 20 09:40:20 dev-proxmox-2 zed: eid=1654573 class=history_event pool_guid=0x82EE2CB119FC19AD
Aug 20 09:40:20 dev-proxmox-2 zed: eid=1654574 class=history_event pool_guid=0x82EE2CB119FC19AD
Aug 20 09:40:20 dev-proxmox-2 zed: eid=1654575 class=history_event pool_guid=0x82EE2CB119FC19AD
Aug 20 09:40:20 dev-proxmox-2 systemd[1]: Started Session 1502305 of user root.
Aug 20 09:40:22 dev-proxmox-2 zed: eid=1654576 class=history_event pool_guid=0x82EE2CB119FC19AD
Aug 20 09:40:22 dev-proxmox-2 systemd[1]: Started Session 1502306 of user root.
Aug 20 09:40:25 dev-proxmox-2 systemd[1]: Started Session 1502307 of user root.

Code:
              total        used        free      shared  buff/cache   available
Mem:            125          58          24           0          42          66
Swap:             7           0           7
 
Last edited:
I haven't seen that happen before. Can you send a kill signal to it (not -9) and see if it will gracefully quit?
 
there have been two ZFS mount/umount deadlock fixes in the kernel modules that are not yet included in the kernel you are running. I suggest upgrading, and if the issue triggers again, enabling zdbg_msg and/or collecting kernel stack traces.
 
Same issue in 6.0!
Code:
root@pve2:~# pveversion
pve-manager/6.0-5/f8a710d7 (running kernel: 5.0.15-1-pve)

There was some messages in dmesg on friday:
Code:
[794123.204085] INFO: task kworker/u82:3:21616 blocked for more than 120 seconds.
[794123.204122]       Tainted: P           O      5.0.15-1-pve #1
[794123.204138] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[794123.204160] kworker/u82:3   D    0 21616      2 0x80000000
[794123.204169] Workqueue: writeback wb_workfn (flush-zfs-20)
[794123.204170] Call Trace:
[794123.204178]  __schedule+0x2d4/0x870
[794123.204180]  schedule+0x2c/0x70
[794123.204184]  io_schedule+0x16/0x40
[794123.204187]  __lock_page+0x122/0x220
[794123.204189]  ? file_check_and_advance_wb_err+0xe0/0xe0
[794123.204192]  write_cache_pages+0x23e/0x4d0
[794123.204261]  ? zpl_write_common_iovec+0xf0/0xf0 [zfs]
[794123.204263]  ? _cond_resched+0x19/0x30
[794123.204273]  ? mutex_lock+0x12/0x30
[794123.204302]  ? rrw_exit+0x5e/0x150 [zfs]
[794123.204332]  zpl_writepages+0x94/0x170 [zfs]
[794123.204334]  do_writepages+0x41/0xd0
[794123.204335]  ? __wb_calc_thresh+0x3e/0x130
[794123.204337]  __writeback_single_inode+0x40/0x350
[794123.204338]  writeback_sb_inodes+0x211/0x500
[794123.204340]  __writeback_inodes_wb+0x67/0xb0
[794123.204342]  wb_writeback+0x25f/0x2f0
[794123.204343]  wb_workfn+0x2fb/0x3f0
[794123.204347]  process_one_work+0x20f/0x410
[794123.204348]  worker_thread+0x34/0x400
[794123.204351]  kthread+0x120/0x140
[794123.204352]  ? process_one_work+0x410/0x410
[794123.204354]  ? __kthread_parkme+0x70/0x70
[794123.204355]  ret_from_fork+0x35/0x40

And now 3 processes are blocked:
Code:
root@pve2:/~# ps axu | grep " D "
root     13397  0.0  0.0  11548  4224 ?        D    14:50   0:00 zfs rollback rpool/data/subvol-118-disk-0@AD_connect
root     21616  0.0  0.0      0     0 ?        D    Jul26   0:00 [kworker/u82:3+flush-zfs-20]
root     23085  0.0  0.0   6072   884 pts/28   S+   15:28   0:00 grep  D
root     31414  0.0  0.0      0     0 ?        D    Jul27   0:00 [arc_prune]
root@pve2:~# cat /proc/13397/stack
[<0>] cv_wait_common+0x104/0x130 [spl]
[<0>] __cv_wait+0x15/0x20 [spl]
[<0>] rrw_enter_write+0x3c/0xa0 [zfs]
[<0>] rrm_enter+0x2a/0x80 [zfs]
[<0>] zfsvfs_teardown+0x8b/0x2a0 [zfs]
[<0>] zfs_suspend_fs+0x10/0x20 [zfs]
[<0>] zfs_ioc_rollback+0xf8/0x180 [zfs]
[<0>] zfsdev_ioctl+0x1e0/0x8f0 [zfs]
[<0>] do_vfs_ioctl+0xa9/0x640
[<0>] ksys_ioctl+0x67/0x90
[<0>] __x64_sys_ioctl+0x1a/0x20
[<0>] do_syscall_64+0x5a/0x110
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
root@pve2:~# cat /proc/21616/stack     
[<0>] io_schedule+0x16/0x40
[<0>] __lock_page+0x122/0x220
[<0>] write_cache_pages+0x23e/0x4d0
[<0>] zpl_writepages+0x94/0x170 [zfs]
[<0>] do_writepages+0x41/0xd0
[<0>] __writeback_single_inode+0x40/0x350
[<0>] writeback_sb_inodes+0x211/0x500
[<0>] __writeback_inodes_wb+0x67/0xb0
[<0>] wb_writeback+0x25f/0x2f0
[<0>] wb_workfn+0x2fb/0x3f0
[<0>] process_one_work+0x20f/0x410
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
root@pve2:~# cat /proc/31414/stack     
[<0>] __inode_wait_for_writeback+0xb9/0xf0
[<0>] inode_wait_for_writeback+0x26/0x40
[<0>] evict+0xb5/0x1a0
[<0>] iput+0x148/0x210
[<0>] zfs_prune+0x33c/0x380 [zfs]
[<0>] zpl_prune_sb+0x35/0x50 [zfs]
[<0>] arc_prune_task+0x22/0x40 [zfs]
[<0>] taskq_thread+0x2ec/0x4d0 [spl]
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
 
I seem to have a similar issue but the stack looks like this:
Code:
root@ex1:~# cat /proc/1005172/stack
[<0>] taskq_cancel_id+0xdb/0x110 [spl]
[<0>] zfs_unlinked_drain_stop_wait+0x47/0x70 [zfs]
[<0>] zfsvfs_teardown+0x25/0x2f0 [zfs]
[<0>] zfs_suspend_fs+0x10/0x20 [zfs]
[<0>] zfs_ioc_rollback+0xf8/0x180 [zfs]
[<0>] zfsdev_ioctl+0x1e0/0x8f0 [zfs]
[<0>] do_vfs_ioctl+0xa9/0x640
[<0>] ksys_ioctl+0x67/0x90
[<0>] __x64_sys_ioctl+0x1a/0x20
[<0>] do_syscall_64+0x5a/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

It also hangs in D state on zfs rollback.

pve-manager/6.1-7/13e58d5e (running kernel: 5.3.13-1-pve)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!