ZFS Disk Replication - "command 'zfs snapshot (...)' failed: got timeout" NUR am Wochenende

neffets

New Member
May 5, 2023
11
1
3
Guten Morgen,

ich stehe schon länger vor dem kuriosem Problem, dass meine ZFS Disk Replikationen am Wochenende vermehrt fehlschlagen.
Ich bekomme jedes Wochenende etwa ~200 Mails von zwei Nodes in einem Cluster rein, mit folgender Fehlermeldung:

Node 1:
Code:
  Replication job 105-0 with target 'proxmox2' and schedule 'mon..sat 10,30,50' failed!
    Last successful sync: 2023-06-24 09:30:53
    Next sync try: 2023-06-24 09:55:00
    Failure count: 1
 
  Error:
  command 'zfs snapshot tank1/data1/vm-105-disk-0@__replicate_105-0_1687593216__' failed: got timeout

Node 2:
Code:
  Replication job 102-0 with target 'proxmox1' and schedule 'mon..sat 00,20,40' failed!
    Last successful sync: 2023-06-24 09:41:55
    Next sync try: 2023-06-24 10:05:00
    Failure count: 1
 
  Error:
  command 'zfs snapshot tank1/data1/vm-102-disk-0@__replicate_102-0_1687593695__' failed: got timeout

Proxmox VE Version:
Kernel Version

Linux 5.15.107-1-pve #1 SMP PVE 5.15.107-1 (2023-04-20T10:05Z)
PVE Manager Version

pve-manager/7.4-3/9002ab8a

Die Replikationen finden jeweils um 10 Minuten versetzt statt, sodass Sie sich nicht in die Quere kommen, was innerhalb der Woche auch super funktioniert.
Die Replikationen dauern nicht lange im Schnitt ~30 Sek.
Auf den Nodes laufen jeweils vier und fünf VMs.

Node 1:
journalctl -u pvescheduler.service
Code:
Jun 24 09:53:36 proxmox1 pvescheduler[2449473]: 104-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-104-disk-0@__replicate_104-0_1687593003__' failed: got timeout
Jun 24 09:54:22 proxmox1 pvescheduler[2449473]: 105-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-105-disk-0@__replicate_105-0_1687593216__' failed: got timeout
Jun 24 11:53:46 proxmox1 pvescheduler[170416]: 104-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-104-disk-0@__replicate_104-0_1687600203__' failed: got timeout
Jun 24 12:05:11 proxmox1 pvescheduler[972162]: 104-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-104-disk-0@__replicate_104-0_1687600863__' failed: got timeout
Jun 24 12:37:11 proxmox1 pvescheduler[1882496]: command 'zfs destroy tank1/data1/vm-105-disk-0@__replicate_105-0_1687601403__' failed: got timeout
Jun 24 21:11:21 proxmox1 pvescheduler[2491271]: 104-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-104-disk-0@__replicate_104-0_1687633803__' failed: got timeout
Jun 25 02:02:04 proxmox1 pvescheduler[4150683]: <root@pam> starting task UPID:proxmox1:003F559C:1670DEB0:649783FC:vzdump::root@pam:
Jun 26 00:13:39 proxmox1 pvescheduler[1739862]: 104-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-104-disk-0@__replicate_104-0_1687731000__' failed: got timeout
Jun 26 00:20:34 proxmox1 pvescheduler[1739862]: 105-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-105-disk-0@__replicate_105-0_1687731219__' failed: got timeout
Jun 26 00:26:40 proxmox1 pvescheduler[1739862]: 108-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-108-disk-0@__replicate_108-0_1687731634__' failed: got timeout
Jun 26 00:32:15 proxmox1 pvescheduler[1739862]: 109-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-109-disk-0@__replicate_109-0_1687732000__' failed: got timeout
Jun 26 00:38:21 proxmox1 pvescheduler[1223911]: command 'zfs destroy tank1/data1/vm-104-disk-0@__replicate_104-0_1687731000__' failed: got timeout
Jun 26 00:43:54 proxmox1 pvescheduler[1223911]: 104-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-104-disk-0@__replicate_104-0_1687732380__' failed: got timeout
Jun 26 00:49:14 proxmox1 pvescheduler[1223911]: command 'zfs destroy tank1/data1/vm-105-disk-0@__replicate_105-0_1687731219__' failed: got timeout
Jun 26 00:55:03 proxmox1 pvescheduler[1223911]: 105-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-105-disk-0@__replicate_105-0_1687733034__' failed: got timeout
Jun 26 01:02:09 proxmox1 pvescheduler[1223911]: command 'zfs destroy tank1/data1/vm-108-disk-0@__replicate_108-0_1687731634__' failed: got timeout
Jun 26 01:08:09 proxmox1 pvescheduler[1223911]: 108-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-108-disk-0@__replicate_108-0_1687733703__' failed: got timeout
Jun 26 01:14:13 proxmox1 pvescheduler[1223911]: command 'zfs destroy tank1/data1/vm-109-disk-0@__replicate_109-0_1687732000__' failed: got timeout
Jun 26 01:19:27 proxmox1 pvescheduler[1223911]: 109-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-109-disk-0@__replicate_109-0_1687734489__' failed: got timeout
Jun 26 01:24:39 proxmox1 pvescheduler[629666]: command 'zfs destroy tank1/data1/vm-104-disk-0@__replicate_104-0_1687732380__' failed: got timeout
Jun 26 01:30:31 proxmox1 pvescheduler[629666]: 104-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-104-disk-0@__replicate_104-0_1687735200__' failed: got timeout
Jun 26 01:37:05 proxmox1 pvescheduler[629666]: command 'zfs destroy tank1/data1/vm-105-disk-0@__replicate_105-0_1687733034__' failed: got timeout
Jun 26 01:43:15 proxmox1 pvescheduler[629666]: 105-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-105-disk-0@__replicate_105-0_1687735831__' failed: got timeout
Jun 26 01:50:07 proxmox1 pvescheduler[629666]: command 'zfs destroy tank1/data1/vm-108-disk-0@__replicate_108-0_1687733703__' failed: got timeout
Jun 26 01:56:41 proxmox1 pvescheduler[629666]: 108-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-108-disk-0@__replicate_108-0_1687736595__' failed: got timeout
Jun 26 02:02:01 proxmox1 pvescheduler[3405527]: <root@pam> starting task UPID:proxmox1:00340114:16F4B3B8:6498D579:vzdump::root@pam:
Jun 26 02:02:10 proxmox1 pvescheduler[629666]: command 'zfs destroy tank1/data1/vm-109-disk-0@__replicate_109-0_1687734489__' failed: got timeout
Jun 26 02:10:38 proxmox1 pvescheduler[3483801]: command 'zfs destroy tank1/data1/vm-108-disk-0@__replicate_108-0_1687736595__' failed: got timeout

Node 2:
journalctl -u pvescheduler.service
Code:
Jun 24 21:52:25 proxmox2 pvescheduler[497827]: 106-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-106-disk-0@__replicate_106-0_1687636279__' failed: got timeout
Jun 24 22:13:42 proxmox2 pvescheduler[465118]: command 'zfs destroy tank1/data1/vm-101-disk-0@__replicate_101-0_1687635784__' failed: got timeout
Jun 24 22:14:46 proxmox2 pvescheduler[465118]: 101-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-101-disk-0@__replicate_101-0_1687637584__' failed: got timeout
Jun 24 22:16:15 proxmox2 pvescheduler[919679]: command 'zfs destroy tank1/data1/vm-102-disk-0@__replicate_102-0_1687635964__' failed: got timeout
Jun 24 22:16:28 proxmox2 pvescheduler[919679]: 102-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-102-disk-0@__replicate_102-0_1687637764__' failed: got timeout
Jun 24 22:18:25 proxmox2 pvescheduler[1184182]: command 'zfs destroy tank1/data1/vm-103-disk-0@__replicate_103-0_1687636084__' failed: got timeout
Jun 24 22:18:51 proxmox2 pvescheduler[1184182]: 103-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-103-disk-0@__replicate_103-0_1687637884__' failed: got timeout
Jun 24 22:19:25 proxmox2 pvescheduler[1304452]: command 'zfs destroy tank1/data1/vm-107-disk-0@__replicate_107-0_1687636137__' failed: got timeout
Jun 24 22:20:36 proxmox2 pvescheduler[1304452]: 107-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-107-disk-0@__replicate_107-0_1687637944__' failed: got timeout
Jun 24 22:22:35 proxmox2 pvescheduler[1719912]: command 'zfs destroy tank1/data1/vm-106-disk-0@__replicate_106-0_1687636279__' failed: got timeout
Jun 24 22:23:13 proxmox2 pvescheduler[1719912]: 106-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-106-disk-0@__replicate_106-0_1687638124__' failed: got timeout
Jun 24 22:43:35 proxmox2 pvescheduler[482491]: command 'zfs destroy tank1/data1/vm-101-disk-0@__replicate_101-0_1687637584__' failed: got timeout
Jun 24 22:44:39 proxmox2 pvescheduler[482491]: 101-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-101-disk-0@__replicate_101-0_1687639384__' failed: got timeout
Jun 24 22:46:22 proxmox2 pvescheduler[902775]: command 'zfs destroy tank1/data1/vm-102-disk-0@__replicate_102-0_1687637764__' failed: got timeout
Jun 24 22:47:09 proxmox2 pvescheduler[902775]: 102-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-102-disk-0@__replicate_102-0_1687639564__' failed: got timeout
Jun 24 22:48:36 proxmox2 pvescheduler[1278040]: command 'zfs destroy tank1/data1/vm-103-disk-0@__replicate_103-0_1687637884__' failed: got timeout
Jun 24 22:49:34 proxmox2 pvescheduler[1278040]: 103-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-103-disk-0@__replicate_103-0_1687639684__' failed: got timeout
Jun 24 22:50:11 proxmox2 pvescheduler[1278040]: command 'zfs destroy tank1/data1/vm-107-disk-0@__replicate_107-0_1687637944__' failed: got timeout
Jun 24 22:52:09 proxmox2 pvescheduler[1278040]: 107-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-107-disk-0@__replicate_107-0_1687639774__' failed: got timeout
Jun 24 22:52:46 proxmox2 pvescheduler[1278040]: command 'zfs destroy tank1/data1/vm-106-disk-0@__replicate_106-0_1687638124__' failed: got timeout
Jun 24 22:53:30 proxmox2 pvescheduler[1278040]: 106-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-106-disk-0@__replicate_106-0_1687639929__' failed: got timeout
Jun 24 23:13:26 proxmox2 pvescheduler[1691635]: command 'zfs destroy tank1/data1/vm-101-disk-0@__replicate_101-0_1687639384__' failed: got timeout
Jun 24 23:14:20 proxmox2 pvescheduler[1691635]: 101-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-101-disk-0@__replicate_101-0_1687641184__' failed: got timeout
Jun 24 23:16:31 proxmox2 pvescheduler[2152732]: command 'zfs destroy tank1/data1/vm-102-disk-0@__replicate_102-0_1687639564__' failed: got timeout
Jun 24 23:17:01 proxmox2 pvescheduler[2152732]: 102-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-102-disk-0@__replicate_102-0_1687641364__' failed: got timeout
Jun 24 23:18:24 proxmox2 pvescheduler[2461804]: command 'zfs destroy tank1/data1/vm-103-disk-0@__replicate_103-0_1687639684__' failed: got timeout
Jun 24 23:18:45 proxmox2 pvescheduler[2461804]: 103-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-103-disk-0@__replicate_103-0_1687641484__' failed: got timeout
Jun 24 23:20:36 proxmox2 pvescheduler[2784014]: command 'zfs destroy tank1/data1/vm-107-disk-0@__replicate_107-0_1687639774__' failed: got timeout
Jun 24 23:22:13 proxmox2 pvescheduler[2784014]: 107-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-107-disk-0@__replicate_107-0_1687641604__' failed: got timeout
Jun 24 23:22:46 proxmox2 pvescheduler[2784014]: command 'zfs destroy tank1/data1/vm-106-disk-0@__replicate_106-0_1687639929__' failed: got timeout
Jun 24 23:23:34 proxmox2 pvescheduler[2784014]: 106-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-106-disk-0@__replicate_106-0_1687641733__' failed: got timeout
Jun 25 02:02:05 proxmox2 pvescheduler[315156]: <root@pam> starting task UPID:proxmox2:0004CF32:16667D8B:649783FD:vzdump::root@pam:
Jun 26 02:02:00 proxmox2 pvescheduler[964754]: <root@pam> starting task UPID:proxmox2:000EB899:16EA51DE:6498D578:vzdump::root@pam:
Jun 26 02:06:46 proxmox2 pvescheduler[1027658]: 101-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-101-disk-0@__replicate_101-0_1687737780__' failed: got timeout
Jun 26 02:07:46 proxmox2 pvescheduler[1027658]: 102-0: got unexpected replication job error - command 'zfs snapshot tank1/data1/vm-102-disk-0@__replicate_102-0_1687738006__' failed: got timeout
Jun 26 02:08:35 proxmox2 pvescheduler[1027658]: command 'zfs destroy tank1/data1/vm-103-disk-1@__replicate_103-0_1687730460__' failed: got timeout

Hat jemand evtl. Ideen woran es liegen kann, dass es nur an Wochenenden fehlschlägt? Irgendwelche Logs die ich durchsuchen kann?
Ich hab die Replikation für Sonntags bereits deaktiviert, da an diesem Tag keine neuen Daten hinzukommen sollten.

Repilikations-Jobs Node 1:
1687760398201.png

Replikations-Jobs Node 2:
1687760537084.png

Grüße!
 
Last edited:
Laufen am WE gleichzeitig noch andere Jobs (wie z. B. Backups etc.), die das System auslasten könnten?
 
Es läuft jede Nacht um ~2 Uhr ein VZ Dump Backup für alle VMs auf beiden Nodes, jedoch nichts was speziell am Wochenende läuft.
 
Schön mal überlegt, die einzelnen Syncs um jeweils eine Minute zu verschieben?

Ja könnte man durchaus machen, jedoch läuft es innerhalb der Woche problemlos mit den Einstellungen durch ohne einmal die Fehlermeldung zu generieren.
 
Hi,
was sagt zpool status -v auf Ziel und Quelle? Wie schaut die (IO)-Last auf den Servern zu dem Zeitpunkt aus? Läuft da vielleicht ein Scrub?
 
Hi,
was sagt zpool status -v auf Ziel und Quelle? Wie schaut die (IO)-Last auf den Servern zu dem Zeitpunkt aus? Läuft da vielleicht ein Scrub?

Node 1:
Code:
 pool: tank1
 state: ONLINE
  scan: scrub repaired 0B in 02:25:54 with 0 errors on Sun Jun 11 02:49:55 2023
config:

        NAME                          STATE     READ WRITE CKSUM
        tank1                         ONLINE       0     0     0
          raidz1-0                    ONLINE       0     0     0
            ata-MM1000GBKAL_9XG94TYJ  ONLINE       0     0     0
            ata-MM1000GBKAL_9XG94V20  ONLINE       0     0     0
            ata-MM1000GBKAL_9XG94H4G  ONLINE       0     0     0
            ata-MM1000GBKAL_9XG94HP5  ONLINE       0     0     0
            ata-MM1000GBKAL_9XG94V92  ONLINE       0     0     0
            ata-MM1000GBKAL_9XG94V3G  ONLINE       0     0     0

errors: No known data errors

Node 2:
Code:
  pool: tank1
 state: ONLINE
  scan: scrub repaired 0B in 04:47:30 with 0 errors on Sun Jun 11 05:12:05 2023
config:

        NAME                        STATE     READ WRITE CKSUM
        tank1                       ONLINE       0     0     0
          raidz1-0                  ONLINE       0     0     0
            scsi-350000397b810c4c1  ONLINE       0     0     0
            scsi-350000397b81070d9  ONLINE       0     0     0
            scsi-350000397b810c621  ONLINE       0     0     0
            scsi-350000397b810a5dd  ONLINE       0     0     0
            scsi-350000397b8106899  ONLINE       0     0     0
            scsi-350000397b810a48d  ONLINE       0     0     0

errors: No known data errors

Der letzte Scrub liegt schon ein bisschen zurück, weshalb er wohl nicht zwischenfunkt.

Bzgl. der (IO)-Last kann ich momentan leider nicht viel sagen, da noch kein externes Monitoring läuft. Kommt aber bald.
Kann man bei Proxmox irgendwo Last-Statistiken in Logs einsehen?
 
Last edited:
Ok habe gefunden wo man die Server-Last rückwirkend nachschauen kann:

1687767981818.png

1687768015458.png

1687768036574.png

Interessanterweise ist die Server-Last an Samstagen und Sonntagen tatsächlich höher als in der Woche!
 
Greifen am Wochenende vielleicht einfach mehr Nutzer auf Eure Dienste zu? Ansonsten würde ich am Wochenende mal schauen, welche Prozesse die Last erzeugen.
 
Greifen am Wochenende vielleicht einfach mehr Nutzer auf Eure Dienste zu? Ansonsten würde ich am Wochenende mal schauen, welche Prozesse die Last erzeugen.

Am Wochenende ist hier eigentlich tote Hose...

Ich werde am Freitag mal ein Log in dieser Form über das Wochenende laufen lassen:

while true; do (echo "%CPU %MEM ARGS $(date)" && ps -e -o pcpu,pmem,args --sort=pcpu | cut -d" " -f1-5 | tail) >> ps.log; sleep 20; done

Eventuell erkenne ich ja daran, wo die Auslastung herkommt.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!