io delay 99% after cancelling lxc migration

lethargos

Well-Known Member
Jun 10, 2017
134
6
58
74
Hi,
For testing purposes I tried to migrate a container from one server to another. They are in a non-HA cluster.
I ran:
pct migrate <VMID> <node_name> -restart

Then I cancelled the process by pressing ctrl+c.
Then all of a sudden the server to which I tried to migrate the lxc couldn't be access partially through the GUI (an "x" appread on the server's icon). All containers (around 10) couldn't be accessed either, and systemctl -t service showed that three of them failed to start (the server was perfectly accessible through ssh).

I/O delay showed aroun 99% usage.

lsof didn't display anything, but I was able to install iotop, which showed jbd2 running somewhat high. Sometimes around 50% or more. The thing is jbd2 does take a good chunk of i/o even in normal circumstances, like now, when the server is ok. I tried to start the affected containers, but nothing happened. There was no output.
Eventually I simply restart the server and it worked directly as expected. I've no idea what happened.

I should also probably mention that there was another node in this cluster. Later on we moved this node away. This node also shows in the GUI interface, because we haven't actually remove it (with pvecm).
But pvecm status shows only two nodes:
pvecm status
Quorum information
------------------
Date: Tue Jun 12 16:01:42 2018
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1/660
Quorate: Yes

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 pub_ip (local)
0x00000002 1 pub_ip2

I really want to understand what has actually happened and next time to solve it without need to restart the whole server, as if my servers were running on a Microsoft product.
 
You have to slow hardware to cope with the migration of the lxc container. You need faster hardware.
 
Really? So two Xeon E5-2620 v4 of 8 cores each, 64GB RAM with SSDs on each server isn't enough to migrate a STOPPED lxc container (this is what proxmox did first when I ran pct migrate, it stopped it first) taking less than 20GB of space. If that's your answer, then I really bet hyper-v could do this trick, and even live migrate it without too much trouble...
 
Really? So two Xeon E5-2620 v4 of 8 cores each, 64GB RAM with SSDs on each server isn't enough to migrate a STOPPED lxc container (this is what proxmox did first when I ran pct migrate, it stopped it first) taking less than 20GB of space. If that's your answer, then I really bet hyper-v could do this trick, and even live migrate it without too much trouble...
What model of SSDs and how they are configured? The jbd2 is writing the journal on the ext4 filesystem.
https://en.wikipedia.org/wiki/Journaling_block_device
 
It's a RAID 6 with 7 Samsung SSD 850 EVO.
Yes, I did search previously for jbd2 and saw what it did, but I'm not sure how to associate it with the event.
 
It's a RAID 6 with 7 Samsung SSD 850 EVO.
These are not enterprise SSDs, they have a bad numbers for their 4K sync writes. Try to limit the migration speed, to not overload the storage with IOs.
 
What would a good number of 4k sync writes be? What if I used 15k rpm enterprise hdds? Even the worst of SSDs have better IOPS than the best HDDs, so I don't get it. The servers are not particularly overloaded, in any case.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!