(ZFS) Snapshots put a lot of process in D state

Stereo973

Member
Nov 29, 2019
17
0
21
27
France
Hello all!

Since i have update my PVE cluster from 6.2-2 to 6.3-1, the replication tasks put a lot of process in "D" state during 1 second (it's happen when snapshot is created, i can reproduce the bug when i create a snapshot).
It's a problem, because when the bug occur, the snapshoted container is freeze during 1 second, and a lot of employees lose time (replication is each minutes).

Three nodes cluster with:
Intel Xeon E-2288G CPU
128Gb DDR4 ECC RAM
2x 1Tb NVMe Intel entreprise class (without RAID controller, i use ZFS RAID1)

Thanks you in advance for your help ;)

Package version
Code:
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.78-1-pve: 5.4.78-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-1
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
pve-zsync: 2.0-4
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
 

Attachments

  • D State.png
    D State.png
    303.8 KB · Views: 9
  • PVE_graph.png
    PVE_graph.png
    137.8 KB · Views: 9
Hi there.

I think I might be facing a similar problem. Since I updated my nodes to V6.3.x, I'm receiving a lot problems during my scheduled nightly backups.

I didn't have the chance to confirm if the processes get the "D state", but load increases heavily in most containers.

Are you using PBS for your snapshots or are you using your local storage?

Anyway, I'll try to do another test during office hours, so I can run some tests and try to get the state of the processes.
 
Hi there.

I think I might be facing a similar problem. Since I updated my nodes to V6.3.x, I'm receiving a lot problems during my scheduled nightly backups.

I didn't have the chance to confirm if the processes get the "D state", but load increases heavily in most containers.

Are you using PBS for your snapshots or are you using your local storage?

Anyway, I'll try to do another test during office hours, so I can run some tests and try to get the state of the processes.
Hello!

The server itself does not backup, the problem occur on replication tasks but this use Snapshot like you.
Do you use ZFS too?
 
Ah, ok I understand.

Yes, I'm also using ZFS as filesystem.

I don't backup the full server myself either, but as you suggest, my LXC snapshots must be relying in the same procedure as your replications. So it may be related.

Anyway, I couldn't validate yet if my snapshots get the D state you mentioned, sorry.
 
Ah, ok I understand.

Yes, I'm also using ZFS as filesystem.

I don't backup the full server myself either, but as you suggest, my LXC snapshots must be relying in the same procedure as your replications. So it may be related.

Anyway, I couldn't validate yet if my snapshots get the D state you mentioned, sorry.
Yes we use snapshots and ZFS together.
For try, you need to run htop on the node, filter by processus "S"tate, and take a snapshot. For me the bug occur 4/6 snaps.

Thanks you ;)
 
If you use KVM VMs and the freezing of FS before snapshot bothers you, you can disable it, by disabling qemu agent option on vm.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!