Replication Job: failed

gadreel

New Member
Apr 22, 2023
9
0
1
Hi, I have Proxmox 8.2.4

For the past 4 months my replication jobs keep failing and this usually happens when I am doing something disk intensive.

For example I noticed this when:
  1. I am restoring a VM without a limit.
  2. I am backing up VMs
  3. I was inside a VM and I was copying for example a file that was 30GB from one drive to another.
The drives that I have my local-zfs are NVMEs.

Any suggestions what should I do?
 
Hi, I have Proxmox 8.2.4

For the past 4 months my replication jobs keep failing and this usually happens when I am doing something disk intensive.

For example I noticed this when:
  1. I am restoring a VM without a limit.
  2. I am backing up VMs
  3. I was inside a VM and I was copying for example a file that was 30GB from one drive to another.
The drives that I have my local-zfs are NVMEs.

Any suggestions what should I do?
Hi,
please provide the task log of such a failed replication job as well as an excerpt from the systemd journal around that time, which you will get via journalctl --since <DATETIME> --until <DATETIME > journal.txt. Does the issue also arise if you configure bandwith limits? What models are these SSDs?
 
@Chris Hi, thanks for the reply.
Find attached the journal.

When I apply a bandwidth limit (when restoring) it does not happen, but today it happened when I was transferring a 30GB file.

CT1000P3PSSD8 is the NVMEs model.

Bash:
root@luffy:~# lsblk -o NAME,FSTYPE,LABEL,MOUNTPOINT,SIZE,MODEL
NAME        FSTYPE            LABEL            MOUNTPOINT   SIZE MODEL
sda                                                           1T iSCSI Disk
├─sda1                                                       16M
└─sda2      ntfs              Games                        1024G
sdb                                                           0B iSCSI Disk
├─sdb1      vfat              EFI                           200M
└─sdb2      apfs                                          299.8G
zd0                                                           1M
zd16                                                          1M
zd32                                                          1M
zd48                                                          1M
zd64                                                          1G
├─zd64p1    vfat              ARC1                           50M
├─zd64p2    ext2              ARC2                           50M
└─zd64p3    ext4              ARC3                          923M
zd80                                                        200G
├─zd80p1    linux_raid_member                                 8G
├─zd80p2    linux_raid_member                                 2G
├─zd80p3                                                      1K
└─zd80p5    linux_raid_member Usopp:2                     189.8G
zd96                                                         60G
├─zd96p1    ntfs              System Reserved               100M
├─zd96p2    ntfs                                           59.1G
└─zd96p3    ntfs              Windows RE tools              774M
zd112                                                       150G
├─zd112p1   vfat              EFI                           200M
└─zd112p2   apfs                                          149.8G
zd128                                                         1M
zd144                                                       100G
├─zd144p1   ntfs              Recovery                      450M
├─zd144p2   vfat                                             99M
├─zd144p3                                                    16M
└─zd144p4   ntfs                                           79.4G
zd160                                                        60G
├─zd160p1   ntfs              Recovery                      450M
├─zd160p2   vfat                                             99M
├─zd160p3                                                    16M
├─zd160p4   ntfs                                           58.6G
└─zd160p5   ntfs              Windows RE tools              843M
zd176                                                        60G
├─zd176p1   vfat                                            100M
├─zd176p2                                                    16M
├─zd176p3   ntfs                                           59.1G
└─zd176p4   ntfs              Windows RE tools              774M
zd192                                                        16G
├─zd192p1                                                   512K
└─zd192p2   zfs_member        boot-pool                      16G
zd208                                                       100G
├─zd208p1   vfat              EFI                           200M
└─zd208p2   apfs                                           99.8G
zd224                                                         1M
nvme0n1                                                   931.5G CT1000P3PSSD8
├─nvme0n1p1                                                1007K
├─nvme0n1p2 vfat                                              1G
└─nvme0n1p3 zfs_member        rpool                       930.5G
nvme1n1                                                   931.5G CT1000P3PSSD8
├─nvme1n1p1                                                1007K
├─nvme1n1p2 vfat                                              1G
└─nvme1n1p3 zfs_member        rpool                       930.5G
zd256                                                       100G
├─zd256p1   vfat              EFI                           200M
└─zd256p2   apfs                                           99.8G
 

Attachments

Last edited:
@Chris Hi, thanks for the reply.
Find attached the journal.

When I apply a bandwidth limit (when restoring) it does not happen, but today it happened when I was transferring a 30GB file.

CT1000P3PSSD8 is the NVMEs model.

Bash:
root@luffy:~# lsblk -o NAME,FSTYPE,LABEL,MOUNTPOINT,SIZE,MODEL
NAME        FSTYPE            LABEL            MOUNTPOINT   SIZE MODEL
sda                                                           1T iSCSI Disk
├─sda1                                                       16M
└─sda2      ntfs              Games                        1024G
sdb                                                           0B iSCSI Disk
├─sdb1      vfat              EFI                           200M
└─sdb2      apfs                                          299.8G
zd0                                                           1M
zd16                                                          1M
zd32                                                          1M
zd48                                                          1M
zd64                                                          1G
├─zd64p1    vfat              ARC1                           50M
├─zd64p2    ext2              ARC2                           50M
└─zd64p3    ext4              ARC3                          923M
zd80                                                        200G
├─zd80p1    linux_raid_member                                 8G
├─zd80p2    linux_raid_member                                 2G
├─zd80p3                                                      1K
└─zd80p5    linux_raid_member Usopp:2                     189.8G
zd96                                                         60G
├─zd96p1    ntfs              System Reserved               100M
├─zd96p2    ntfs                                           59.1G
└─zd96p3    ntfs              Windows RE tools              774M
zd112                                                       150G
├─zd112p1   vfat              EFI                           200M
└─zd112p2   apfs                                          149.8G
zd128                                                         1M
zd144                                                       100G
├─zd144p1   ntfs              Recovery                      450M
├─zd144p2   vfat                                             99M
├─zd144p3                                                    16M
└─zd144p4   ntfs                                           79.4G
zd160                                                        60G
├─zd160p1   ntfs              Recovery                      450M
├─zd160p2   vfat                                             99M
├─zd160p3                                                    16M
├─zd160p4   ntfs                                           58.6G
└─zd160p5   ntfs              Windows RE tools              843M
zd176                                                        60G
├─zd176p1   vfat                                            100M
├─zd176p2                                                    16M
├─zd176p3   ntfs                                           59.1G
└─zd176p4   ntfs              Windows RE tools              774M
zd192                                                        16G
├─zd192p1                                                   512K
└─zd192p2   zfs_member        boot-pool                      16G
zd208                                                       100G
├─zd208p1   vfat              EFI                           200M
└─zd208p2   apfs                                           99.8G
zd224                                                         1M
nvme0n1                                                   931.5G CT1000P3PSSD8
├─nvme0n1p1                                                1007K
├─nvme0n1p2 vfat                                              1G
└─nvme0n1p3 zfs_member        rpool                       930.5G
nvme1n1                                                   931.5G CT1000P3PSSD8
├─nvme1n1p1                                                1007K
├─nvme1n1p2 vfat                                              1G
└─nvme1n1p3 zfs_member        rpool                       930.5G
zd256                                                       100G
├─zd256p1   vfat              EFI                           200M
└─zd256p2   apfs                                           99.8G
According to the journal, you have networking issues starting from Sep 17 02:30:00, when someone (a script, other host?) logged in via ssh. Since then, there are routing issues, which seem to also (among other things) make your replication job fail. Please make sure that you fix any networking issues.

CT1000P3PSSD8 is the NVMEs model.
We recommend to use enterprise grade SSDs, consumer grade SSDs in many cases do not provide the required performance and resilience needed for virtualization workloads. Could well be that you are saturating the disk IO when performing some operations as backups while other tasks like replication are going on. Please check your IO delay during that time.
 
According to the journal, you have networking issues starting from Sep 17 02:30:00, when someone (a script, other host?) logged in via ssh. Since then, there are routing issues, which seem to also (among other things) make your replication job fail. Please make sure that you fix any networking issues.


We recommend to use enterprise grade SSDs, consumer grade SSDs in many cases do not provide the required performance and resilience needed for virtualization workloads. Could well be that you are saturating the disk IO when performing some operations as backups while other tasks like replication are going on. Please check your IO delay during that time.
@Chris
Routing issues you mean this? This is PBS that I shut down everyday and it's turn on automatically when it's time to do the backups. (Saving Energy).
Is this a problem?
Code:
Sep 17 02:33:00 luffy pvestatd[2007]: pbs-vms: error fetching datastores - 500 Can't connect to 192.168.81.94:8007 (No route to host)

To be honest when I am doing the backups it's not often. I saw the replication jobs failing so may times when I am restoring a VM that I am 100% sure, that's why I put bandwidth limitation when I am restoring from now on.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!