Dear ProxMox and zfs supporters,
as we dont have this problem on local send / recv from one clusternode to the other but the problem is reproducible on an externally hosted ProxMox host where we pull the incremental snapshots via ssh, I already tried and ruled out the zfs pools on the receiving end by changing the usb disks, and by destroying the pools on those external disks and recreating them.
I really need some help in founding the source of this problem. This only started happening only after the upgrade from ProxMox 6 to the actual ProxMox 7 version.
How can I find out if the zfs send on the remote server - initiated by ssh from the local server where the external destination disks are plugged in - stalls? What logs can give me some hint on the root cause? Can it be a problem of available diskspace on the source? Should I try scrubbing the remote server pools?
At the moment we keep the last 30 incremental Snapshots on the remote server, and for more than a year this was working exeptionally well. Does it make sense, to delete all snapshots on the remote server and create a new snapshot - getting rid of the 30 incremental snapshots - and try to send that single snapshot?
I am willing to provide any information that might help finding the source of this problem to eliminate it.
Thanks in advance for your much appreciated knowhow and support.
Following the log entries from /var/log/syslog during the zfs send.
Code:
Aug 15 15:16:15 hostname smartd[3351]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 112 to 113
Aug 15 15:41:08 hostname systemd[1]: Created slice User Slice of UID 0.
Aug 15 15:41:08 hostname systemd[1]: Starting User Runtime Directory /run/user/0...
Aug 15 15:41:08 hostname systemd[1]: Finished User Runtime Directory /run/user/0.
Aug 15 15:41:08 hostname systemd[1]: Starting User Manager for UID 0...
Aug 15 15:41:08 hostname systemd[3803914]: Queued start job for default target Main User Target.
Aug 15 15:41:08 hostname systemd[3803914]: Created slice User Application Slice.
Aug 15 15:41:08 hostname systemd[3803914]: Reached target Paths.
Aug 15 15:41:08 hostname systemd[3803914]: Reached target Timers.
Aug 15 15:41:08 hostname systemd[3803914]: Listening on GnuPG network certificate management daemon.
Aug 15 15:41:08 hostname systemd[3803914]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers).
Aug 15 15:41:08 hostname systemd[3803914]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Aug 15 15:41:08 hostname systemd[3803914]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Aug 15 15:41:08 hostname systemd[3803914]: Listening on GnuPG cryptographic agent and passphrase cache.
Aug 15 15:41:08 hostname systemd[3803914]: Reached target Sockets.
Aug 15 15:41:08 hostname systemd[3803914]: Reached target Basic System.
Aug 15 15:41:08 hostname systemd[3803914]: Reached target Main User Target.
Aug 15 15:41:08 hostname systemd[3803914]: Startup finished in 140ms.
Aug 15 15:41:08 hostname systemd[1]: Started User Manager for UID 0.
Aug 15 15:41:08 hostname systemd[1]: Started Session 548 of user root.
Aug 15 15:41:09 hostname systemd[1]: session-548.scope: Succeeded.Aug 15 15:57:43 hostname systemd[1]: session-550.scope: Succeeded.
Aug 15 15:57:43 hostname systemd[1]: session-550.scope: Consumed 2.729s CPU time.
Aug 15 15:57:53 hostname systemd[1]: Stopping User Manager for UID 0...
Aug 15 15:57:53 hostname systemd[3803914]: Stopped target Main User Target.
Aug 15 15:57:53 hostname systemd[3803914]: Stopped target Basic System.
Aug 15 15:57:53 hostname systemd[3803914]: Stopped target Paths.
Aug 15 15:57:53 hostname systemd[3803914]: Stopped target Sockets.
Aug 15 15:57:53 hostname systemd[3803914]: Stopped target Timers.
Aug 15 15:57:53 hostname systemd[3803914]: dirmngr.socket: Succeeded.
Aug 15 15:57:53 hostname systemd[3803914]: Closed GnuPG network certificate management daemon.
Aug 15 15:57:53 hostname systemd[3803914]: gpg-agent-browser.socket: Succeeded.
Aug 15 15:57:53 hostname systemd[3803914]: Closed GnuPG cryptographic agent and passphrase cache (access for web browsers).
Aug 15 15:57:53 hostname systemd[3803914]: gpg-agent-extra.socket: Succeeded.
Aug 15 15:57:53 hostname systemd[3803914]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Aug 15 15:57:53 hostname systemd[3803914]: gpg-agent-ssh.socket: Succeeded.
Aug 15 15:57:53 hostname systemd[3803914]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Aug 15 15:57:53 hostname systemd[3803914]: gpg-agent.socket: Succeeded.
Aug 15 15:57:53 hostname systemd[3803914]: Closed GnuPG cryptographic agent and passphrase cache.
Aug 15 15:57:53 hostname systemd[3803914]: Removed slice User Application Slice.
Aug 15 15:57:53 hostname systemd[3803914]: Reached target Shutdown.
Aug 15 15:57:53 hostname systemd[3803914]: systemd-exit.service: Succeeded.
Aug 15 15:57:53 hostname systemd[3803914]: Finished Exit the Session.
Aug 15 15:57:53 hostname systemd[3803914]: Reached target Exit the Session.
Aug 15 15:57:53 hostname systemd[1]: user@0.service: Succeeded.
Aug 15 15:57:53 hostname systemd[1]: Stopped User Manager for UID 0.
Aug 15 15:57:53 hostname systemd[1]: Stopping User Runtime Directory /run/user/0...
Aug 15 15:57:53 hostname systemd[1]: run-user-0.mount: Succeeded.
Aug 15 15:57:53 hostname systemd[1]: user-runtime-dir@0.service: Succeeded.
Aug 15 15:57:53 hostname systemd[1]: Stopped User Runtime Directory /run/user/0.
Aug 15 15:57:53 hostname systemd[1]: Removed slice User Slice of UID 0.
Aug 15 15:57:53 hostname systemd[1]: user-0.slice: Consumed 2.904s CPU time.
Aug 15 16:04:08 hostname pvestatd[4330]: auth key pair too old, rotating..
Aug 15 16:15:56 hostname systemd[1]: Starting Daily apt download activities...
Aug 15 16:15:57 hostname systemd[1]: apt-daily.service: Succeeded.
Aug 15 16:15:57 hostname systemd[1]: Finished Daily apt download activities.
Aug 15 16:16:15 hostname smartd[3351]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 113 to 112
Aug 15 16:16:15 hostname smartd[3351]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 110 to 109
Following the printscreens on the receiving side:
The window was then left open overnight without anything happening, just stalled zfs send/recv.
Is there a way to enable debug logging to see more relevant information, or does the provided info already indicate what the problem could be?