Random ZFS Replication failed exit code 4

dendi

Renowned Member
Nov 17, 2011
126
9
83
Hello,

I receive many emails with this error:

Code:
command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=HOSTNAME' root@1.2.3.4 -- pvesr prepare-local-job 184-0 --scan kvm kvm:vm-184-disk-0 --last_sync 1633621331' failed: exit code 4

The 9 node cluster worked very well from a year but now I added VMs (120 Vms now) and I think this is due to a timeout with zfs listing on the receiving node.

The backup node contains all the replicas and does nothing (no runningVMs):

Code:
zpool status -v
  pool: kvm
 state: ONLINE
  scan: scrub repaired 0B in 0 days 07:26:51 with 0 errors on Sun Sep 12 07:50:52 2021
config:

    NAME                                                     STATE     READ WRITE CKSUM
    kvm                                                      ONLINE       0     0     0
      mirror-0                                               ONLINE       0     0     0
        ata-HGST_HUH721008ALE600_2SG05MBF-part4              ONLINE       0     0     0
        ata-HGST_HUH721008ALE600_2SG08U8F-part4              ONLINE       0     0     0
    logs   
      mirror-1                                               ONLINE       0     0     0
        ata-Micron_5100_MTFDDAK240TCB_18271D4AB52F-part3     ONLINE       0     0     0
        ata-SAMSUNG_MZ7WD240HAFV-00003_S16LNYAD904567-part3  ONLINE       0     0     0

errors: No known data errors

Code:
ii  pve-cluster                          6.2-1                           amd64        "pmxcfs" distributed cluster filesystem for Proxmox Virtual Environment.
ii  pve-container                        3.2-3                           all          Proxmox VE Container management tool
ii  pve-docs                             6.2-6                           all          Proxmox VE Documentation
ii  pve-edk2-firmware                    2.20200531-1                    all          edk2 based firmware modules for virtual machines
ii  pve-firewall                         4.1-3                           amd64        Proxmox VE Firewall
ii  pve-firmware                         3.1-3                           all          Binary firmware code for the pve-kernel
ii  pve-ha-manager                       3.1-1                           amd64        Proxmox VE HA Manager
ii  pve-i18n                             2.2-2                           all          Internationalization support for Proxmox VE
ii  pve-kernel-5.4                       6.3-1                           all          Latest Proxmox VE Kernel Image
ii  pve-kernel-5.4.73-1-pve              5.4.73-1                        amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-helper                    6.3-1                           all          Function for various kernel maintenance tasks.
ii  pve-lxc-syscalld                     0.9.1-1                         amd64        PVE LXC syscall daemon
ii  pve-manager                          6.2-15                          amd64        Proxmox Virtual Environment Management Tools
ii  pve-qemu-kvm                         5.1.0-6                         amd64        Full virtualization on x86 hardware
ii  pve-xtermjs                          4.7.0-2                         amd64        Binaries built from the Rust termproxy crate
ii  pve-zsync                            2.0-3                           all          Proxmox VE ZFS syncing tool
ii  smartmontools                        7.1-pve2                        amd64        control and monitor storage systems using S.M.A.R.T.
ii  zfs-zed                              0.8.5-pve1                      amd64        OpenZFS Event Daemon
ii  zfsutils-linux                       0.8.5-pve1                      amd64        command-line tools to manage OpenZFS filesystems

When I try the command
Code:
zfs list -o name,volsize,origin,type,refquota -t volume,filesystem -Hrp
sometime the output is very fast and I think it is cached, But sometime it takes 5-10 seconds.
I think this causes the random replication problem.

Is there a way to improve caching? I have now two logs ssd but I can use them for something else