Background:
I have a 6-host cluster that I built and imaged with the PVE4.4 ISO, and am working on a process of upgrading this cluster to PVE 5.1 that is installed on top of a vanilla Debian Stretch image as per the various wiki instructions:
This is part of a project to upgrade the production cluster my company runs. I'm using this as a test cluster to hammer out any problems in the process so we don't break our production cluster.
I start by upgrading ceph on all 6 of the nodes, then I pick a node to pull. I destroy any OSDs and monitors it has, migrate all VMs away, and power it off to install debian stretch on it, and then proxmox 5.1 on top of that. I then add the node back into the cluster, then add it to ceph. This goes MOSTLY* smoothly. The main issues I have are that when I try to create an OSD to replace the ones that were on the host before I pulled it, pveceph doesn't perform some crucial steps.
First, it does not create a mountpoint for the new OSD. If I manually create the mount point and give it the right ownership (ceph:ceph) it will have and create the OSD with pveceph, it will not mount the OSD. It also does not update systemd to autostart the OSD upon reboot. I was able to get around this by manually adding the partition UUIDs to fstab and manually doing a systemctl daemon-reload, but I just noticed that I did not need to do this in proxmox 4.4. Is this new behavior in PVE 5.1, or is it because I installed the software on a Debian installation instead of from an ISO?
Secondly, I noticed that if the OSD I am recreating was bluestore before destroying it, neither the destruction or creation process overwrite the old fsid left from the OSD. This means that even if I delete the partitions on the disk and recreate the OSD, the 2nd partition will have a stale fsid and the OSD won't start. I had to use dd to write zeros to the front of the disk in order for the old FSID to be wiped out. Is this intentional? I would think that if you are destroying an OSD, the fsid should similarly be destroyed.
Lastly, when I go pull and re-add the next node, I'm unable to join it to the cluster by specifying the IP of a node that I installed with Debian; I get 'unable to copy ssh ID: exit code 1'. I get I can join it by specifying the IP of a 4.4 node though. The authorized_keys files have the same ownership and permissions between all the nodes, so I'm not sure what's going on there.
Edit: never mind on that last one; it's because of the ssh config I have on the Stretch hosts. Disregard this last issue.
I have a 6-host cluster that I built and imaged with the PVE4.4 ISO, and am working on a process of upgrading this cluster to PVE 5.1 that is installed on top of a vanilla Debian Stretch image as per the various wiki instructions:
This is part of a project to upgrade the production cluster my company runs. I'm using this as a test cluster to hammer out any problems in the process so we don't break our production cluster.
I start by upgrading ceph on all 6 of the nodes, then I pick a node to pull. I destroy any OSDs and monitors it has, migrate all VMs away, and power it off to install debian stretch on it, and then proxmox 5.1 on top of that. I then add the node back into the cluster, then add it to ceph. This goes MOSTLY* smoothly. The main issues I have are that when I try to create an OSD to replace the ones that were on the host before I pulled it, pveceph doesn't perform some crucial steps.
First, it does not create a mountpoint for the new OSD. If I manually create the mount point and give it the right ownership (ceph:ceph) it will have and create the OSD with pveceph, it will not mount the OSD. It also does not update systemd to autostart the OSD upon reboot. I was able to get around this by manually adding the partition UUIDs to fstab and manually doing a systemctl daemon-reload, but I just noticed that I did not need to do this in proxmox 4.4. Is this new behavior in PVE 5.1, or is it because I installed the software on a Debian installation instead of from an ISO?
Secondly, I noticed that if the OSD I am recreating was bluestore before destroying it, neither the destruction or creation process overwrite the old fsid left from the OSD. This means that even if I delete the partitions on the disk and recreate the OSD, the 2nd partition will have a stale fsid and the OSD won't start. I had to use dd to write zeros to the front of the disk in order for the old FSID to be wiped out. Is this intentional? I would think that if you are destroying an OSD, the fsid should similarly be destroyed.
Lastly, when I go pull and re-add the next node, I'm unable to join it to the cluster by specifying the IP of a node that I installed with Debian; I get 'unable to copy ssh ID: exit code 1'. I get I can join it by specifying the IP of a 4.4 node though. The authorized_keys files have the same ownership and permissions between all the nodes, so I'm not sure what's going on there.
Edit: never mind on that last one; it's because of the ssh config I have on the Stretch hosts. Disregard this last issue.
Last edited: