Last Saturday, we tried to upgrade our 4-node cluster environment from PVE5 to PVE6, using the guide and the pve5to6 tool. Following the guide, we upgraded Corosync on the 4 nodes, then proceeded to upgrade the first node (PMX01) to buster. All smooth, until the system rebooted and came up in maintenance mode.
Looking through the boot logs, it appears that mounting /dev/pve/data timed out, which in turn caused a bunch of dependencies to fail:
We can get the system to boot to a functional multi-user state by:
- Logging on to maintenance mode as root
- edit /etc/fstab and commenting out the /dev/pve/data mount
-
- edit /etc/fstab and uncommenting the /dev/pve/data mount
-
After this, the server seems to work as expected. All the images are there, they can be started, migrated, etc. However, each subsequent reboot will require us to do the same thing.
To see if it was a machine-specific issue, we also upgraded the second server. The same issue happened, the same work-around got it going. Right now, we are running a half-upgraded cluster, 2 nodes with PVE6 and 2 nodes with PVE5, which is not ideal.
I found one other thread in this forum from somebody who described the same issue, but no resolution was posted and the last entry was from early August. I have been banging my head against the wall the whole day yesterday, trying to find a fix for the issue, but I have not found anything that even points me in the right direction.
Short of trying to rebuild one of the 'broken' cluster nodes, which may lead to rebuilding the whole Proxmox environment, does anyone here have any idea where to look?
Any help is greatly appreciated!
Looking through the boot logs, it appears that mounting /dev/pve/data timed out, which in turn caused a bunch of dependencies to fail:
Code:
Aug 25 11:26:16 arwpmx01 systemd[1]: dev-pve-data.device: Job dev-pve-data.device/start timed out.
Aug 25 11:26:16 arwpmx01 systemd[1]: Timed out waiting for device /dev/pve/data.
-- Subject: A start job for unit dev-pve-data.device has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- A start job for unit dev-pve-data.device has finished with a failure.
--
-- The job identifier is 14 and the job result is timeout.
We can get the system to boot to a functional multi-user state by:
- Logging on to maintenance mode as root
- edit /etc/fstab and commenting out the /dev/pve/data mount
-
reboot
- edit /etc/fstab and uncommenting the /dev/pve/data mount
-
mount -a
After this, the server seems to work as expected. All the images are there, they can be started, migrated, etc. However, each subsequent reboot will require us to do the same thing.
To see if it was a machine-specific issue, we also upgraded the second server. The same issue happened, the same work-around got it going. Right now, we are running a half-upgraded cluster, 2 nodes with PVE6 and 2 nodes with PVE5, which is not ideal.
I found one other thread in this forum from somebody who described the same issue, but no resolution was posted and the last entry was from early August. I have been banging my head against the wall the whole day yesterday, trying to find a fix for the issue, but I have not found anything that even points me in the right direction.
Short of trying to rebuild one of the 'broken' cluster nodes, which may lead to rebuilding the whole Proxmox environment, does anyone here have any idea where to look?
Any help is greatly appreciated!