io delay 99% after cancelling lxc migration

lethargos · Jun 12, 2018

Hi,
For testing purposes I tried to migrate a container from one server to another. They are in a non-HA cluster.
I ran:
pct migrate <VMID> <node_name> -restart

Then I cancelled the process by pressing ctrl+c.
Then all of a sudden the server to which I tried to migrate the lxc couldn't be access partially through the GUI (an "x" appread on the server's icon). All containers (around 10) couldn't be accessed either, and systemctl -t service showed that three of them failed to start (the server was perfectly accessible through ssh).

I/O delay showed aroun 99% usage.

lsof didn't display anything, but I was able to install iotop, which showed jbd2 running somewhat high. Sometimes around 50% or more. The thing is jbd2 does take a good chunk of i/o even in normal circumstances, like now, when the server is ok. I tried to start the affected containers, but nothing happened. There was no output.
Eventually I simply restart the server and it worked directly as expected. I've no idea what happened.

I should also probably mention that there was another node in this cluster. Later on we moved this node away. This node also shows in the GUI interface, because we haven't actually remove it (with pvecm).
But pvecm status shows only two nodes:
pvecm status
Quorum information
------------------
Date: Tue Jun 12 16:01:42 2018
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1/660
Quorate: Yes

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 pub_ip (local)
0x00000002 1 pub_ip2

I really want to understand what has actually happened and next time to solve it without need to restart the whole server, as if my servers were running on a Microsoft product.

Alwin · Jun 13, 2018

You have to slow hardware to cope with the migration of the lxc container. You need faster hardware.

lethargos · Jun 13, 2018

Really? So two Xeon E5-2620 v4 of 8 cores each, 64GB RAM with SSDs on each server isn't enough to migrate a STOPPED lxc container (this is what proxmox did first when I ran pct migrate, it stopped it first) taking less than 20GB of space. If that's your answer, then I really bet hyper-v could do this trick, and even live migrate it without too much trouble...

Alwin · Jun 13, 2018

lethargos said:
Really? So two Xeon E5-2620 v4 of 8 cores each, 64GB RAM with SSDs on each server isn't enough to migrate a STOPPED lxc container (this is what proxmox did first when I ran pct migrate, it stopped it first) taking less than 20GB of space. If that's your answer, then I really bet hyper-v could do this trick, and even live migrate it without too much trouble...

What model of SSDs and how they are configured? The jbd2 is writing the journal on the ext4 filesystem.
https://en.wikipedia.org/wiki/Journaling_block_device

lethargos · Jun 29, 2018

It's a RAID 6 with 7 Samsung SSD 850 EVO.
Yes, I did search previously for jbd2 and saw what it did, but I'm not sure how to associate it with the event.

Alwin · Jul 2, 2018

lethargos said:
It's a RAID 6 with 7 Samsung SSD 850 EVO.

These are not enterprise SSDs, they have a bad numbers for their 4K sync writes. Try to limit the migration speed, to not overload the storage with IOs.

lethargos · Jul 2, 2018

What would a good number of 4k sync writes be? What if I used 15k rpm enterprise hdds? Even the worst of SSDs have better IOPS than the best HDDs, so I don't get it. The servers are not particularly overloaded, in any case.

Search

Search

io delay 99% after cancelling lxc migration

lethargos

Well-Known Member

Alwin

Proxmox Retired Staff

lethargos

Well-Known Member

Alwin

Proxmox Retired Staff

lethargos

Well-Known Member

Alwin

Proxmox Retired Staff

lethargos

Well-Known Member