Hi,
I experienced yesterday a peculiar problem while adding a storage.
We have a file server on which i created a SMB share.
For our PBS, i created a vmdisk of 10TB on that SMB share. when i added it the process timed out.
What is weird is that on the file server, i saw the object created and increasing in size little by little. It got to a point (around 8TB) where on my pve, i guess on the node that had the quorum some of the VMs started to have kernel panics with CPU time increasing and hanging for a long time. (we have a dual socket configuration so i guess only one CPU was having problem because only around half of the VMs were impacted)
I had to wait for the "disk" on the filer to be fully created and deleted immediately automatically. Then the machines were still experiencing problems so i had to restart the node entirely which gave us some downtime for our customers.
while experiencing all these problems, i tried to restart some of the machines with no improvement and even worse, it corrupted the file system on the one i restarted the most forcing me to restore a backup.
tldr: yesterday was a rough day.
Now everything is back in order, after the restart of the node everything went back to normal, but i would like to understand what went wrong.
can you maybe help me run some diagnostics to understand what happened ? Considering the urgency and impact of the situation, i didn't took the time to take screenshots or dump some log files. It stated happening about 24h ago and stopped about 20h ago.
The only thing i noticed is when i ran htop, i didn't saw a huge CPU consumption but i did saw about 15 cores out of 128 that were stuck at 100%
Thank you in advance for your help.
I experienced yesterday a peculiar problem while adding a storage.
We have a file server on which i created a SMB share.
For our PBS, i created a vmdisk of 10TB on that SMB share. when i added it the process timed out.
What is weird is that on the file server, i saw the object created and increasing in size little by little. It got to a point (around 8TB) where on my pve, i guess on the node that had the quorum some of the VMs started to have kernel panics with CPU time increasing and hanging for a long time. (we have a dual socket configuration so i guess only one CPU was having problem because only around half of the VMs were impacted)
I had to wait for the "disk" on the filer to be fully created and deleted immediately automatically. Then the machines were still experiencing problems so i had to restart the node entirely which gave us some downtime for our customers.
while experiencing all these problems, i tried to restart some of the machines with no improvement and even worse, it corrupted the file system on the one i restarted the most forcing me to restore a backup.
tldr: yesterday was a rough day.
Now everything is back in order, after the restart of the node everything went back to normal, but i would like to understand what went wrong.
can you maybe help me run some diagnostics to understand what happened ? Considering the urgency and impact of the situation, i didn't took the time to take screenshots or dump some log files. It stated happening about 24h ago and stopped about 20h ago.
The only thing i noticed is when i ran htop, i didn't saw a huge CPU consumption but i did saw about 15 cores out of 128 that were stuck at 100%
Thank you in advance for your help.