Hello.
It's about a week I'm banging my head on this.
I had a 9-nodes cluster (virtN, N=1..9, 192.168.1.3N/24). I now have to replace all the nodes with "new" hardware, so I started from the nodes 4..6.
As described in the docs:
- shutdown virtX and start install on new HW, so no risk old virtX coming alive again
- from virt1 (currently not being reinstalled):
- pvecm delnode virtX
- edit /etc/ssh/ssh_known_hosts to remove the two lines pertaining to virtX
- pvecm updatecerts
After reinstall on new virtX (from freshly downloaded 7.2 ISO) is complete, from the new node web interface I:
- disable pve-enterprise and enable pve-no-subscription repo
- upgrade all packages
- join cluster (by pasting join data obtained from virt1 and entering correct root password; tried both with default value
- reboot
Repeat for the other nodes.
Now, from every virt I can see all the nodes and ssh without issues between new and old ones.
So I assume everything is OK. If more tests are needed I can do them.
Now the real problem starts: I start following the guide to install Ceph. I first tried with quincy.
On virt4: select 'virt4', select 'ceph', click 'install ceph'. It asks for the release (tried both quincy and pacific, no difference) and for the network to use (I select the node address, 192.168.1.34/24). The install seems to proceed and I (often) get mon.virt4 and mgt.virt4 processes listed. Sometimes it died with a timeout, but mon and mgr processes eventually appeared.
"ceph status" on virt4:
Again, assuming it's OK I proceed on virt5.
I select "virt5". "select "ceph", click "install ceph", it detects the existing Ceph instance: "Newest ceph version in cluster is Pacific (16.2.9)", I confirm that 16.2 is to be installed on the node, it installs the packages then the timeouts start. No way to configure a monitor on virt5. Every ceph-related operation ends in an error like "Could not connect to ceph cluster despite configured monitors (500)".
Trying ceph status from cli, results in messages like "2022-09-19T11:23:22.837+0200 7f76e125d700 0 monclient(hunting): authenticate timed out after 300".
Any hint before I ditch everything?
Tks.
Diego
It's about a week I'm banging my head on this.
I had a 9-nodes cluster (virtN, N=1..9, 192.168.1.3N/24). I now have to replace all the nodes with "new" hardware, so I started from the nodes 4..6.
As described in the docs:
- shutdown virtX and start install on new HW, so no risk old virtX coming alive again
- from virt1 (currently not being reinstalled):
- pvecm delnode virtX
- edit /etc/ssh/ssh_known_hosts to remove the two lines pertaining to virtX
- pvecm updatecerts
After reinstall on new virtX (from freshly downloaded 7.2 ISO) is complete, from the new node web interface I:
- disable pve-enterprise and enable pve-no-subscription repo
- upgrade all packages
- join cluster (by pasting join data obtained from virt1 and entering correct root password; tried both with default value
- reboot
Repeat for the other nodes.
Now, from every virt I can see all the nodes and ssh without issues between new and old ones.
So I assume everything is OK. If more tests are needed I can do them.
Now the real problem starts: I start following the guide to install Ceph. I first tried with quincy.
On virt4: select 'virt4', select 'ceph', click 'install ceph'. It asks for the release (tried both quincy and pacific, no difference) and for the network to use (I select the node address, 192.168.1.34/24). The install seems to proceed and I (often) get mon.virt4 and mgt.virt4 processes listed. Sometimes it died with a timeout, but mon and mgr processes eventually appeared.
"ceph status" on virt4:
Code:
root@virt4:~# ceph status
cluster:
id: 40833458-1c2a-45a0-9216-76710d9f3f7e
health: HEALTH_WARN
OSD count 0 < osd_pool_default_size 3
services:
mon: 1 daemons, quorum virt4 (age 9m)
mgr: virt4(active, since 8m)
osd: 0 osds: 0 up, 0 in
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs:
I select "virt5". "select "ceph", click "install ceph", it detects the existing Ceph instance: "Newest ceph version in cluster is Pacific (16.2.9)", I confirm that 16.2 is to be installed on the node, it installs the packages then the timeouts start. No way to configure a monitor on virt5. Every ceph-related operation ends in an error like "Could not connect to ceph cluster despite configured monitors (500)".
Trying ceph status from cli, results in messages like "2022-09-19T11:23:22.837+0200 7f76e125d700 0 monclient(hunting): authenticate timed out after 300".
Any hint before I ditch everything?
Tks.
Diego