Hi,
For testing purposes I tried to migrate a container from one server to another. They are in a non-HA cluster.
I ran:
pct migrate <VMID> <node_name> -restart
Then I cancelled the process by pressing ctrl+c.
Then all of a sudden the server to which I tried to migrate the lxc couldn't be access partially through the GUI (an "x" appread on the server's icon). All containers (around 10) couldn't be accessed either, and systemctl -t service showed that three of them failed to start (the server was perfectly accessible through ssh).
I/O delay showed aroun 99% usage.
lsof didn't display anything, but I was able to install iotop, which showed jbd2 running somewhat high. Sometimes around 50% or more. The thing is jbd2 does take a good chunk of i/o even in normal circumstances, like now, when the server is ok. I tried to start the affected containers, but nothing happened. There was no output.
Eventually I simply restart the server and it worked directly as expected. I've no idea what happened.
I should also probably mention that there was another node in this cluster. Later on we moved this node away. This node also shows in the GUI interface, because we haven't actually remove it (with pvecm).
But pvecm status shows only two nodes:
pvecm status
Quorum information
------------------
Date: Tue Jun 12 16:01:42 2018
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1/660
Quorate: Yes
Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 pub_ip (local)
0x00000002 1 pub_ip2
I really want to understand what has actually happened and next time to solve it without need to restart the whole server, as if my servers were running on a Microsoft product.
For testing purposes I tried to migrate a container from one server to another. They are in a non-HA cluster.
I ran:
pct migrate <VMID> <node_name> -restart
Then I cancelled the process by pressing ctrl+c.
Then all of a sudden the server to which I tried to migrate the lxc couldn't be access partially through the GUI (an "x" appread on the server's icon). All containers (around 10) couldn't be accessed either, and systemctl -t service showed that three of them failed to start (the server was perfectly accessible through ssh).
I/O delay showed aroun 99% usage.
lsof didn't display anything, but I was able to install iotop, which showed jbd2 running somewhat high. Sometimes around 50% or more. The thing is jbd2 does take a good chunk of i/o even in normal circumstances, like now, when the server is ok. I tried to start the affected containers, but nothing happened. There was no output.
Eventually I simply restart the server and it worked directly as expected. I've no idea what happened.
I should also probably mention that there was another node in this cluster. Later on we moved this node away. This node also shows in the GUI interface, because we haven't actually remove it (with pvecm).
But pvecm status shows only two nodes:
pvecm status
Quorum information
------------------
Date: Tue Jun 12 16:01:42 2018
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1/660
Quorate: Yes
Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 pub_ip (local)
0x00000002 1 pub_ip2
I really want to understand what has actually happened and next time to solve it without need to restart the whole server, as if my servers were running on a Microsoft product.