Today I upgraded 2 of 3 nodes in a cluster from
pve-manager/6.2-15/48bd51b6
Linux 5.4.73-1-pve #1 SMP PVE 5.4.73-1 (Mon, 16 Nov 2020 10:52:16 +0100)
to
pve-manager/6.3-3/eee5f901
Linux 5.4.78-2-pve #1 SMP PVE 5.4.78-2 (Thu, 03 Dec 2020 14:26:17 +0100)
and anything involving ZFS snapshots seems to have either broken or become horribly slow on the upgraded nodes.
Replication of a container gets me:
2020-12-07 21:45:01 100-0: start replication job
2020-12-07 21:45:01 100-0: guest => CT 100, running => 1
2020-12-07 21:45:01 100-0: volumes => local-zfs:subvol-100-disk-1
2020-12-07 21:45:02 100-0: freeze guest filesystem
2020-12-07 21:45:03 100-0: create snapshot '__replicate_100-0_1607377501__' on local-zfs:subvol-100-disk-1
2020-12-07 22:22:53 100-0: thaw guest filesystem
2020-12-07 22:22:53 100-0: end replication job with error: command 'zfs snapshot rpool/data/subvol-100-disk-1@__replicate_100-0_1607377501__' failed: got timeout
I explicitly take a snapshot and what in the past was essentially instantaneous hangs for quite a time with the status as "prepare". (I'll add a note here if it ever moves along to giving me some output.) [It appears to be a coordination issue between PVE and the underlying ZFS reality. In the GUI the snapshot job never finished and the status is still saying "prepare". If I go to the CLI and look in the snapshot on disk it appears to be quite normal.] I try one on the node still running 6.2, and it works fine.
Migration is pretty iffy. Rebooting one of the upgraded nodes for the second time after upgrade was very slow as it had trouble unmounting the filesystem for one of the containers. Etc. I also note that the system load overall has risen and CPU idle % dropped on the nodes.
Where do I look next?
pve-manager/6.2-15/48bd51b6
Linux 5.4.73-1-pve #1 SMP PVE 5.4.73-1 (Mon, 16 Nov 2020 10:52:16 +0100)
to
pve-manager/6.3-3/eee5f901
Linux 5.4.78-2-pve #1 SMP PVE 5.4.78-2 (Thu, 03 Dec 2020 14:26:17 +0100)
and anything involving ZFS snapshots seems to have either broken or become horribly slow on the upgraded nodes.
Replication of a container gets me:
2020-12-07 21:45:01 100-0: start replication job
2020-12-07 21:45:01 100-0: guest => CT 100, running => 1
2020-12-07 21:45:01 100-0: volumes => local-zfs:subvol-100-disk-1
2020-12-07 21:45:02 100-0: freeze guest filesystem
2020-12-07 21:45:03 100-0: create snapshot '__replicate_100-0_1607377501__' on local-zfs:subvol-100-disk-1
2020-12-07 22:22:53 100-0: thaw guest filesystem
2020-12-07 22:22:53 100-0: end replication job with error: command 'zfs snapshot rpool/data/subvol-100-disk-1@__replicate_100-0_1607377501__' failed: got timeout
I explicitly take a snapshot and what in the past was essentially instantaneous hangs for quite a time with the status as "prepare". (I'll add a note here if it ever moves along to giving me some output.) [It appears to be a coordination issue between PVE and the underlying ZFS reality. In the GUI the snapshot job never finished and the status is still saying "prepare". If I go to the CLI and look in the snapshot on disk it appears to be quite normal.] I try one on the node still running 6.2, and it works fine.
Migration is pretty iffy. Rebooting one of the upgraded nodes for the second time after upgrade was very slow as it had trouble unmounting the filesystem for one of the containers. Etc. I also note that the system load overall has risen and CPU idle % dropped on the nodes.
Where do I look next?
Last edited: