PBS / GC Kernel Crash - ZFS Encryption

MarvinE

Active Member
Jan 16, 2020
121
17
38
27
Hello,

we have a PBS datastore on a ZFS encrypted volume.
When the garbage collection is running, the kernel is crashing after 31% of the marking process.

PBS Log:
Code:
2021-09-17T07:22:50+02:00: starting garbage collection on store synflow
2021-09-17T07:22:50+02:00: Start GC phase1 (mark used chunks)
2021-09-17T07:25:24+02:00: marked 1% (18 of 1703 index files)
2021-09-17T07:26:05+02:00: marked 2% (35 of 1703 index files)
2021-09-17T07:26:08+02:00: marked 3% (52 of 1703 index files)
2021-09-17T07:26:10+02:00: marked 4% (69 of 1703 index files)
2021-09-17T07:26:11+02:00: marked 5% (86 of 1703 index files)
2021-09-17T07:26:13+02:00: marked 6% (103 of 1703 index files)
2021-09-17T07:26:14+02:00: marked 7% (120 of 1703 index files)
2021-09-17T07:26:14+02:00: marked 8% (137 of 1703 index files)
2021-09-17T07:26:15+02:00: marked 9% (154 of 1703 index files)
2021-09-17T07:26:17+02:00: marked 10% (171 of 1703 index files)
2021-09-17T07:26:17+02:00: marked 11% (188 of 1703 index files)
2021-09-17T07:26:18+02:00: marked 12% (205 of 1703 index files)
2021-09-17T07:26:20+02:00: marked 13% (222 of 1703 index files)
2021-09-17T07:26:20+02:00: marked 14% (239 of 1703 index files)
2021-09-17T07:26:22+02:00: marked 15% (256 of 1703 index files)
2021-09-17T07:26:22+02:00: marked 16% (273 of 1703 index files)
2021-09-17T07:26:24+02:00: marked 17% (290 of 1703 index files)
2021-09-17T07:26:25+02:00: marked 18% (307 of 1703 index files)
2021-09-17T07:26:26+02:00: marked 19% (324 of 1703 index files)
2021-09-17T07:26:27+02:00: marked 20% (341 of 1703 index files)
2021-09-17T07:26:28+02:00: marked 21% (358 of 1703 index files)
2021-09-17T07:26:29+02:00: marked 22% (375 of 1703 index files)
2021-09-17T07:26:30+02:00: marked 23% (392 of 1703 index files)
2021-09-17T09:29:20+02:00: marked 24% (409 of 1703 index files)
2021-09-17T11:56:40+02:00: marked 25% (426 of 1703 index files)
2021-09-17T13:54:35+02:00: marked 26% (443 of 1703 index files)
2021-09-17T15:48:56+02:00: marked 27% (460 of 1703 index files)
2021-09-17T15:52:25+02:00: marked 28% (477 of 1703 index files)
2021-09-17T17:45:41+02:00: marked 29% (494 of 1703 index files)
2021-09-17T20:07:06+02:00: marked 30% (511 of 1703 index files)
2021-09-17T20:50:11+02:00: marked 31% (528 of 1703 index files)

Kernel Crash:
Code:
[27419.415569] perf: interrupt took too long (2616 > 2500), lowering kernel.perf_event_max_sample_rate to 76250
[50871.542432] task:txg_sync        state:D stack:    0 pid:  966 ppid:     2 flags:0x00004000
[50871.542435] Call Trace:
[50871.542438]  __schedule+0x2ca/0x880
[50871.542442]  schedule+0x4f/0xc0
[50871.542444]  schedule_timeout+0x8a/0x140
[50871.542447]  ? zio_issue_async+0x56/0x90 [zfs]
[50871.542545]  ? __next_timer_interrupt+0x110/0x110
[50871.542549]  io_schedule_timeout+0x51/0x80
[50871.542551]  __cv_timedwait_common+0x131/0x170 [spl]
[50871.542557]  ? wait_woken+0x80/0x80
[50871.542560]  __cv_timedwait_io+0x19/0x20 [spl]
[50871.542565]  zio_wait+0x133/0x2c0 [zfs]
[50871.542664]  dsl_pool_sync+0xcc/0x4c0 [zfs]
[50871.542733]  ? spa_suspend_async_destroy+0x50/0x50 [zfs]
[50871.542799]  spa_sync+0x56e/0x1000 [zfs]
[50871.542862]  ? mutex_lock+0x13/0x40
[50871.542864]  ? spa_txg_history_init_io+0x106/0x110 [zfs]
[50871.542930]  txg_sync_thread+0x2d3/0x460 [zfs]
[50871.542995]  ? txg_init+0x260/0x260 [zfs]
[50871.543060]  thread_generic_wrapper+0x79/0x90 [spl]
[50871.543066]  ? __thread_exit+0x20/0x20 [spl]
[50871.543071]  kthread+0x12b/0x150
[50871.543074]  ? set_kthread_struct+0x50/0x50
[50871.543076]  ret_from_fork+0x22/0x30

zpool status:
Code:
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 3 days 06:22:24 with 0 errors on Wed Sep 15 06:46:25 2021
config:

        NAME                                         STATE     READ WRITE CKSUM
        rpool                                        ONLINE       0     0     0
          raidz1-0                                   ONLINE       0     0     0
            ata-XXXXXXXXXXXXXXXXXXXXXXXXXXXX1-part3  ONLINE       0     0     0
            ata-XXXXXXXXXXXXXXXXXXXXXXXXXXXX2-part3  ONLINE       0     0     0
            ata-XXXXXXXXXXXXXXXXXXXXXXXXXXXX3-part3  ONLINE       0     0     0
            ata-XXXXXXXXXXXXXXXXXXXXXXXXXXXX4-part3  ONLINE       0     0     0

errors: No known data errors
 
Just a guess...did you check with arc_summary what your ARC is doing while doing the backup? Maybe its running out of space somewhere. Or maybe SMR drives?
 
At the moment we are on:
zfsutils-linux: 2.0.5-pve1
proxmox-backup-server: 2.0.11-1
pve-kernel-5.11.22-5-pve

We installed pbs on top of an pve because we will use the host for an direct restore. Since we reduced the ZFS encryption from aes-256-gcm to aes-128-ccm it seams to work, but very slow. The GC marking takes around 5 - 6 days for 10T.
 
well, you are using spinning disks with raidz1, so iops wise you are at the low-end of the spectrum. GC is mostly small reads/writes (updating/reading metadata) - so rather IOPS sensitive, until it gets to the actual deletion part.
 
>When the garbage collection is running, the kernel is crashing after 31% of the marking process.

is this reproducible?
Yes, with aes-256-gcm encryption it was reproducible, at any try. Around 14TB of backup data, at 1 CT, 6 Host and 47 VMs.

Edit: Prune/GC wars set to 14 days and 4 months.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!