Hello everyone!
I hope you feel good!
First, sorry for my bad english!
PVE Cluster with 3 nodes, storage RAID1 ZFS with NVMe, Intel Xeon E-2288G CPU, 128Go RAM DDR4 ECC, 1 node master with 16 LXC Containers, 1 node slave and the last just used for quorum, redunded network link.
One of node (actually the master) have encountered a storage problem this weekend;
Saturday morning i see than the node show I/O Delay at 99% on pve panel!!
The server can't Read/Write to ZFS Pool, so all of my services on all my containers just running with the cached code on the RAM.
I hard restart the server and all services restart normaly, without any errors.
I check all logs, all of them don't say any errors but they finish at 00:50:51, time of the beginning of the crash until i hard reboot the server at 11:43:11.
My monitoring system (Observium) just show 100%CPU usage all the night and I/O Storage Activity NULL.
After that i have migrate all LXC containers to slave node and i do all hardware tests on the master, no problem found.
It's feel like ZFS storage crash or something like that.
I need your help, maybe someone have encountered the problem?
Thanks you very much and have a good day!
I hope you feel good!
First, sorry for my bad english!
PVE Cluster with 3 nodes, storage RAID1 ZFS with NVMe, Intel Xeon E-2288G CPU, 128Go RAM DDR4 ECC, 1 node master with 16 LXC Containers, 1 node slave and the last just used for quorum, redunded network link.
One of node (actually the master) have encountered a storage problem this weekend;
Saturday morning i see than the node show I/O Delay at 99% on pve panel!!
The server can't Read/Write to ZFS Pool, so all of my services on all my containers just running with the cached code on the RAM.
I hard restart the server and all services restart normaly, without any errors.
I check all logs, all of them don't say any errors but they finish at 00:50:51, time of the beginning of the crash until i hard reboot the server at 11:43:11.
My monitoring system (Observium) just show 100%CPU usage all the night and I/O Storage Activity NULL.
After that i have migrate all LXC containers to slave node and i do all hardware tests on the master, no problem found.
It's feel like ZFS storage crash or something like that.
I need your help, maybe someone have encountered the problem?
Thanks you very much and have a good day!