Hi all,
moving or restoring disk causes a very high system load avg and failure with timeout error.
The first couple of gigabytes eg. 15GB to 30GB may be transferred without considerable delay or load when suddenly the load avg rises up to ridiculous amounts and the transfer process and all of the running VMs come to grinding halt.
E.g. moving a 40GB disk from one zfs mirrored storage to another stalled at about 30GB transferred and the load rose up to 235, two-hundred-thirty-five! on a 6 core xeon. Meanwhile a lot of z_wr_iss processes show up each consuming 100% cpu.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2316 root 1 -19 0 0 0 R 100.0 0.0 37:08.65 z_wr_iss
2317 root 1 -19 0 0 0 R 100.0 0.0 37:06.15 z_wr_iss
2312 root 1 -19 0 0 0 R 100.0 0.0 37:08.77 z_wr_iss
2315 root 1 -19 0 0 0 R 100.0 0.0 37:05.01 z_wr_iss
2319 root 1 -19 0 0 0 R 100.0 0.0 37:02.98 z_wr_iss
2320 root 1 -19 0 0 0 R 100.0 0.0 37:03.10 z_wr_iss
2313 root 1 -19 0 0 0 R 99.7 0.0 37:09.83 z_wr_iss
2314 root 1 -19 0 0 0 R 99.4 0.0 37:06.52 z_wr_iss
2318 root 1 -19 0 0 0 R 98.7 0.0 37:07.40 z_wr_iss
2606 root 20 0 0 0 0 R 69.0 0.0 14:55.58 txg_sync
661 root 20 0 0 0 0 R 67.7 0.0 102:51.16 l2arc_feed
This situation can last quite some time, during which the VMs are more or less unusable. iostat and iotop show almost no write activity on the hard disks. ZFS seems to be busy with itself. Last time it took over an hour for the system to recover.
When the operation is retried it mostly goes through without problems and another transfer might again just work. The size of the transfer does seem not to matter.
Proxmox 5.0-31, but occurred on 4.4 and on different HW too, but allways with SATA HDs and ZFS mirrors or raidz1
Regards
....Volker
moving or restoring disk causes a very high system load avg and failure with timeout error.
The first couple of gigabytes eg. 15GB to 30GB may be transferred without considerable delay or load when suddenly the load avg rises up to ridiculous amounts and the transfer process and all of the running VMs come to grinding halt.
E.g. moving a 40GB disk from one zfs mirrored storage to another stalled at about 30GB transferred and the load rose up to 235, two-hundred-thirty-five! on a 6 core xeon. Meanwhile a lot of z_wr_iss processes show up each consuming 100% cpu.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2316 root 1 -19 0 0 0 R 100.0 0.0 37:08.65 z_wr_iss
2317 root 1 -19 0 0 0 R 100.0 0.0 37:06.15 z_wr_iss
2312 root 1 -19 0 0 0 R 100.0 0.0 37:08.77 z_wr_iss
2315 root 1 -19 0 0 0 R 100.0 0.0 37:05.01 z_wr_iss
2319 root 1 -19 0 0 0 R 100.0 0.0 37:02.98 z_wr_iss
2320 root 1 -19 0 0 0 R 100.0 0.0 37:03.10 z_wr_iss
2313 root 1 -19 0 0 0 R 99.7 0.0 37:09.83 z_wr_iss
2314 root 1 -19 0 0 0 R 99.4 0.0 37:06.52 z_wr_iss
2318 root 1 -19 0 0 0 R 98.7 0.0 37:07.40 z_wr_iss
2606 root 20 0 0 0 0 R 69.0 0.0 14:55.58 txg_sync
661 root 20 0 0 0 0 R 67.7 0.0 102:51.16 l2arc_feed
This situation can last quite some time, during which the VMs are more or less unusable. iostat and iotop show almost no write activity on the hard disks. ZFS seems to be busy with itself. Last time it took over an hour for the system to recover.
When the operation is retried it mostly goes through without problems and another transfer might again just work. The size of the transfer does seem not to matter.
Proxmox 5.0-31, but occurred on 4.4 and on different HW too, but allways with SATA HDs and ZFS mirrors or raidz1
Regards
....Volker