Hey all,
In a nutshell, I've been having a lot of issues with my cluster once production VMs were deployed and have been working through separate forum posts - I really appreciate all your help thus far as I am working through each. My usage scenario:
- 5x nodes
- ceph storage
For some back story, this post is regarding two VMs that were ONLY causing issues on Node1. We concluded after two crashes that this node was having issues. I eventually swapped the disk out and reinstalled PVE. I was under the impression the OSD data should remain in tact - was I incorrect in this thought?
So anyways, after Node1 died the first time, i went onsite, restarted the server a few times and it eventually rebuilt itself. There was a read-only file system error and I figured that's why the server died in the first place. Well this happened again so I swapped the disk and reinstalled ceph. I only "del node1" the server and was able to rejoin after PVE was reinstalled. The VMs in question and all data was still there.
The server then died again, went on-site and saw similar errors (sector errors and a few other thinks on sda, which is the disk PVE was installed on. I swapped it out again and figured I'd completely remove that node to ensure a clean rejoining. I consulted many forum posts to ensure I cleaned out every mention of node1.
This time after rejoining the node, I had to manually readd the node to the crush map and since doing so, the VMs are not appearing.
I remember coming across this forum post about something similar as I was conducting research for another cause, but I can't find it again. Where can I check if the VM data is still on Node1/its OSDs?
Sorry if I'm jumping around and let me know if there are questions. If you're asking why I didn't pull the VMs off that node before hand, all communications ceased with that node and was unable to move them. In researching proxmox and ceph, I was under the impression that if you don't zap the disks, ceph should auto detect OSD objects.
In a nutshell, I've been having a lot of issues with my cluster once production VMs were deployed and have been working through separate forum posts - I really appreciate all your help thus far as I am working through each. My usage scenario:
- 5x nodes
- ceph storage
For some back story, this post is regarding two VMs that were ONLY causing issues on Node1. We concluded after two crashes that this node was having issues. I eventually swapped the disk out and reinstalled PVE. I was under the impression the OSD data should remain in tact - was I incorrect in this thought?
So anyways, after Node1 died the first time, i went onsite, restarted the server a few times and it eventually rebuilt itself. There was a read-only file system error and I figured that's why the server died in the first place. Well this happened again so I swapped the disk and reinstalled ceph. I only "del node1" the server and was able to rejoin after PVE was reinstalled. The VMs in question and all data was still there.
The server then died again, went on-site and saw similar errors (sector errors and a few other thinks on sda, which is the disk PVE was installed on. I swapped it out again and figured I'd completely remove that node to ensure a clean rejoining. I consulted many forum posts to ensure I cleaned out every mention of node1.
This time after rejoining the node, I had to manually readd the node to the crush map and since doing so, the VMs are not appearing.
I remember coming across this forum post about something similar as I was conducting research for another cause, but I can't find it again. Where can I check if the VM data is still on Node1/its OSDs?
Sorry if I'm jumping around and let me know if there are questions. If you're asking why I didn't pull the VMs off that node before hand, all communications ceased with that node and was unable to move them. In researching proxmox and ceph, I was under the impression that if you don't zap the disks, ceph should auto detect OSD objects.