zfs sending over ssh stalls

Elleni · Aug 11, 2022

We have a local proxmox cluster, and on one node I have put two usb sticks and created zpools on them. By a script we use to send daily incremental snapshots.

Now lately while receiving the snapshot it stops and we get the following error message:

Code:

client_loop: send disconnect: Broken pipe
cannot receive new filesystem stream: incomplete stream

Any idea how to troubleshoot this and why the

Code:

ssh extserver 'zfs send -i pool/diskname@snap1 pool/diskname@snap2' | zfs recv localpoolonstick/diskname@snap2 fails?

fails?

I dont know if its a problem on the sending ProxMox which is a hosted server not inhouse, or if its a problem on the local pool on the sticks.

Elleni · Aug 14, 2022

Anyone has a hint in howto analyze why after some hundred mb the zfs send/recv stops and if wie wait long enough then theres broken pipe? It only happens when pulling snapshots from vmdisks on remote site via ssh.

My next try is to delete all older snapshots (around 30 from each vm disk) then the newest and then try to send an Inc snapshot once again.

Elleni · Aug 16, 2022

Dear ProxMox and zfs supporters,

as we dont have this problem on local send / recv from one clusternode to the other but the problem is reproducible on an externally hosted ProxMox host where we pull the incremental snapshots via ssh, I already tried and ruled out the zfs pools on the receiving end by changing the usb disks, and by destroying the pools on those external disks and recreating them.

I really need some help in founding the source of this problem. This only started happening only after the upgrade from ProxMox 6 to the actual ProxMox 7 version.

How can I find out if the zfs send on the remote server - initiated by ssh from the local server where the external destination disks are plugged in - stalls? What logs can give me some hint on the root cause? Can it be a problem of available diskspace on the source? Should I try scrubbing the remote server pools?

At the moment we keep the last 30 incremental Snapshots on the remote server, and for more than a year this was working exeptionally well. Does it make sense, to delete all snapshots on the remote server and create a new snapshot - getting rid of the 30 incremental snapshots - and try to send that single snapshot?

I am willing to provide any information that might help finding the source of this problem to eliminate it.

Thanks in advance for your much appreciated knowhow and support.

Following the log entries from /var/log/syslog during the zfs send.

Code:

Aug 15 15:16:15 hostname smartd[3351]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 112 to 113
Aug 15 15:41:08 hostname systemd[1]: Created slice User Slice of UID 0.
Aug 15 15:41:08 hostname systemd[1]: Starting User Runtime Directory /run/user/0...
Aug 15 15:41:08 hostname systemd[1]: Finished User Runtime Directory /run/user/0.
Aug 15 15:41:08 hostname systemd[1]: Starting User Manager for UID 0...
Aug 15 15:41:08 hostname systemd[3803914]: Queued start job for default target Main User Target.
Aug 15 15:41:08 hostname systemd[3803914]: Created slice User Application Slice.
Aug 15 15:41:08 hostname systemd[3803914]: Reached target Paths.
Aug 15 15:41:08 hostname systemd[3803914]: Reached target Timers.
Aug 15 15:41:08 hostname systemd[3803914]: Listening on GnuPG network certificate management daemon.
Aug 15 15:41:08 hostname systemd[3803914]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers).
Aug 15 15:41:08 hostname systemd[3803914]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Aug 15 15:41:08 hostname systemd[3803914]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Aug 15 15:41:08 hostname systemd[3803914]: Listening on GnuPG cryptographic agent and passphrase cache.
Aug 15 15:41:08 hostname systemd[3803914]: Reached target Sockets.
Aug 15 15:41:08 hostname systemd[3803914]: Reached target Basic System.
Aug 15 15:41:08 hostname systemd[3803914]: Reached target Main User Target.
Aug 15 15:41:08 hostname systemd[3803914]: Startup finished in 140ms.
Aug 15 15:41:08 hostname systemd[1]: Started User Manager for UID 0.
Aug 15 15:41:08 hostname systemd[1]: Started Session 548 of user root.
Aug 15 15:41:09 hostname systemd[1]: session-548.scope: Succeeded.Aug 15 15:57:43 hostname systemd[1]: session-550.scope: Succeeded.
Aug 15 15:57:43 hostname systemd[1]: session-550.scope: Consumed 2.729s CPU time.
Aug 15 15:57:53 hostname systemd[1]: Stopping User Manager for UID 0...
Aug 15 15:57:53 hostname systemd[3803914]: Stopped target Main User Target.
Aug 15 15:57:53 hostname systemd[3803914]: Stopped target Basic System.
Aug 15 15:57:53 hostname systemd[3803914]: Stopped target Paths.
Aug 15 15:57:53 hostname systemd[3803914]: Stopped target Sockets.
Aug 15 15:57:53 hostname systemd[3803914]: Stopped target Timers.
Aug 15 15:57:53 hostname systemd[3803914]: dirmngr.socket: Succeeded.
Aug 15 15:57:53 hostname systemd[3803914]: Closed GnuPG network certificate management daemon.
Aug 15 15:57:53 hostname systemd[3803914]: gpg-agent-browser.socket: Succeeded.
Aug 15 15:57:53 hostname systemd[3803914]: Closed GnuPG cryptographic agent and passphrase cache (access for web browsers).
Aug 15 15:57:53 hostname systemd[3803914]: gpg-agent-extra.socket: Succeeded.
Aug 15 15:57:53 hostname systemd[3803914]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Aug 15 15:57:53 hostname systemd[3803914]: gpg-agent-ssh.socket: Succeeded.
Aug 15 15:57:53 hostname systemd[3803914]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Aug 15 15:57:53 hostname systemd[3803914]: gpg-agent.socket: Succeeded.
Aug 15 15:57:53 hostname systemd[3803914]: Closed GnuPG cryptographic agent and passphrase cache.
Aug 15 15:57:53 hostname systemd[3803914]: Removed slice User Application Slice.
Aug 15 15:57:53 hostname systemd[3803914]: Reached target Shutdown.
Aug 15 15:57:53 hostname systemd[3803914]: systemd-exit.service: Succeeded.
Aug 15 15:57:53 hostname systemd[3803914]: Finished Exit the Session.
Aug 15 15:57:53 hostname systemd[3803914]: Reached target Exit the Session.
Aug 15 15:57:53 hostname systemd[1]: user@0.service: Succeeded.
Aug 15 15:57:53 hostname systemd[1]: Stopped User Manager for UID 0.
Aug 15 15:57:53 hostname systemd[1]: Stopping User Runtime Directory /run/user/0...
Aug 15 15:57:53 hostname systemd[1]: run-user-0.mount: Succeeded.
Aug 15 15:57:53 hostname systemd[1]: user-runtime-dir@0.service: Succeeded.
Aug 15 15:57:53 hostname systemd[1]: Stopped User Runtime Directory /run/user/0.
Aug 15 15:57:53 hostname systemd[1]: Removed slice User Slice of UID 0.
Aug 15 15:57:53 hostname systemd[1]: user-0.slice: Consumed 2.904s CPU time.
Aug 15 16:04:08 hostname pvestatd[4330]: auth key pair too old, rotating..
Aug 15 16:15:56 hostname systemd[1]: Starting Daily apt download activities...
Aug 15 16:15:57 hostname systemd[1]: apt-daily.service: Succeeded.
Aug 15 16:15:57 hostname systemd[1]: Finished Daily apt download activities.
Aug 15 16:16:15 hostname smartd[3351]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 113 to 112
Aug 15 16:16:15 hostname smartd[3351]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 110 to 109

Following the printscreens on the receiving side:

The window was then left open overnight without anything happening, just stalled zfs send/recv.

Is there a way to enable debug logging to see more relevant information, or does the provided info already indicate what the problem could be?

Elleni · Aug 16, 2022

Following the /proc/spl/kstat/zfs/dbgmsg

Please tell me what could be helpfull to further investigate why those snapshots stop being sent to destination.

Search

Search

zfs sending over ssh stalls

Elleni

Member

Elleni

Member

Elleni

Member

Elleni

Member

Attachments